Monitoring systems may be used to secure environments and, more generally, to track activity in these environments. A monitoring system may provide a variety of functionalities and may include a variety of controllable and configurable options and parameters. These features may greatly benefit from a user-friendly control interface.
In general, in one aspect, the invention relates to a method for natural language-based interaction with a vision-based monitoring system. The method includes obtaining a request input from a user, by the vision-based monitoring system. The request input is directed to an object detected by a classifier of the vision-based monitoring system. The method further includes obtaining an identifier associated with the request input, identifying a site of the vision-based monitoring system from a plurality of sites, and based on the identifier, generating a database query, based on the request input and the identified site, and obtaining, from a monitoring system database, video frames that relate to the database query. The video frames include the detected object. The method also includes providing the video frames to the user.
In general, in one aspect, the invention relates to a non-transitory computer readable medium including instructions that enable a system to obtain a request input from a user, by the vision-based monitoring system. The request input is directed to an object detected by a classifier of the vision-based monitoring system. The instructions further enable the system to obtain an identifier associated with the request input, identify a site of the vision-based monitoring system from a plurality of sites, based on the identifier, generate a database query, based on the request input and the identified site, and obtain, from a monitoring system database, video frames that relate to the database query. The video frames include the detected object. The instructions also enable the system to provide the video frames to the user.
Specific embodiments of the invention will now be described in detail with reference to the accompanying figures. In the following detailed description of embodiments of the invention, numerous specific details are set forth in order to provide a more thorough understanding of the invention. However, it will be apparent to one of ordinary skill in the art that the invention may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.
In the following description of
In general, embodiments of the invention relate to a monitoring system used for monitoring and/or securing an environment. More specifically, one or more embodiments of the invention enable speech interaction with the monitoring system for various purposes, including the configuration of the monitoring system and/or the control of functionalities of the monitoring system. In one or more embodiments of the technology, the monitoring system supports spoken language queries, thereby allowing a user to interact with the monitoring system using common language. Consider, for example, a scenario in which a user of the monitoring system returns home after work and wants to know whether the dog-sitter has walked the dog. The owner may ask the monitoring system: “Tell me when the dog sitter was here.” In response, the monitoring system may analyze the activity registered throughout the day and may, for example, reply by providing the time when the dog sitter was last seen by the monitoring system, or it may alternatively or in addition play back a video recorded by the monitoring system when the dog sitter was at the house. Speech interaction may thus be used to request and review activity captured by the monitoring system. Those skilled in the art will recognize that the above-described scenario is merely an example, and that the invention is not limited to this example. A detailed description is provided below.
In one embodiment of the invention, the monitoring system (100) may classify certain objects, e.g., stationary objects such as a table (background element B (152B)) as background elements. Further, in one embodiment of the invention, the monitoring system (100) may classify other objects, e.g., moving objects such as a human or a pet, as foreground objects (154A, 154B). The monitoring system (100) may further classify detected foreground objects (154A, 154B) as threats, for example, if the monitoring system (100) determines that a person (154A) detected in the monitored environment (150) is an intruder, or as harmless, for example, if the monitoring system (100) determines that the person (154A) detected in the monitored environment (150) is the owner of the monitored premises, or if the classified object is a pet (154B). Embodiments of the invention may be based on classification schemes ranging from a mere distinction between moving and non-moving objects to the distinction of many classes of objects, including for example the recognition of particular people and/or the distinction of different pets, without departing from the invention.
In one embodiment of the invention, the monitoring system (100) includes a camera system (102) and a remote processing service (112). In one embodiment of the invention, the monitoring system further includes one or more remote computing devices (114). Each of these components is described below.
The camera system (102) may include a video camera (108) and a local computing device (110), and may further include a depth sensing camera (104). The camera system (102) may be a portable unit that may be positioned such that the field of view of the video camera (108) covers an area of interest in the environment to be monitored. The camera system (102) may be placed, for example, on a shelf in a corner of a room to be monitored, thereby enabling the camera to monitor the space between the camera system (102) and a back wall of the room. Other locations of the camera system may be used without departing from the invention.
The video camera (108) of the camera system (102) may be capable of continuously capturing a two-dimensional video of the environment (150). The video camera may use, for example, an RGB or CMYG color or grayscale CCD or CMOS sensor with a spatial resolution of for example, 320×240 pixels, and a temporal resolution of 30 frames per second (fps). Those skilled in the art will appreciate that the invention is not limited to the aforementioned image sensor technologies, temporal, and/or spatial resolutions. Further, a video camera's frame rates may vary, for example, depending on the lighting situation in the monitored environment.
In one embodiment of the invention, the camera system (102) further includes a depth-sensing camera (104) that may be capable of reporting multiple depth values from the monitored environment (150). For example, the depth-sensing camera (104) may provide depth measurements for a set of 320×240 pixels (Quarter Video Graphics Array (QVGA) resolution) at a temporal resolution of 30 frames per second (fps). The depth-sensing camera (104) may be based on scanner-based or scannerless depth measurement techniques such as, for example, LIDAR, using time-of-flight measurements to determine a distance to an object in the field of view of the depth-sensing camera (104). The field of view and the orientation of the depth sensing camera may be selected to cover a portion of the monitored environment (150) similar (or substantially similar) to the portion of the monitored environment captured by the video camera. In one embodiment of the invention, the depth-sensing camera (104) may further provide a two-dimensional (2D) grayscale image, in addition to the depth-measurements, thereby providing a complete three-dimensional (3D) grayscale description of the monitored environment (150). Those skilled in the art will appreciate that the invention is not limited to the aforementioned depth-sensing technology, temporal, and/or spatial resolutions. For example, stereo cameras may be used rather than time-of-flight-based cameras.
In one embodiment of the invention, the camera system (102) further includes components that enable communication between a person in the monitored environment and the monitoring system The camera system may thus include a microphone (122) and/or a speaker (124). The microphone (122) and the speaker (124) may be used to support acoustic communication, e.g. verbal communication, as further described below.
In one embodiment of the invention, the camera system (102) includes a local computing device (110). Any combination of mobile, desktop, server, embedded, or other types of hardware may be used to implement the local computing device. For example, the local computing device (110) may be a system on a chip (SOC), i.e. an integrated circuit (IC) that integrates all components of the local computing device (110) into a single chip. The SOC may include one or more processor cores, associated memory (e.g., random access memory (RAM), cache memory, flash memory, etc.), a network interface (e.g., a local area network (LAN), a wide area network (WAN) such as the Internet, mobile network, or any other type of network) via a network interface connection (not shown), and interfaces to storage devices, input and output devices, etc. The local computing device (110) may further include one or more storage device(s) (e.g., a hard disk, an optical drive such as a compact disk (CD) drive or digital versatile disk (DVD) drive, a flash memory stick, etc.), and numerous other elements and functionalities. In one embodiment of the invention, the computing device includes an operating system (e.g., Linux) that may include functionality to execute the methods further described below. Those skilled in the art will appreciate that the invention is not limited to the aforementioned configuration of the local computing device (110). In one embodiment of the invention, the local computing device (110) may be integrated with the video camera (108) and/or the depth sensing camera (104). Alternatively, the local computing device (110) may be detached from the video camera (108) and/or the depth sensing camera (104), and may be using wired and/or wireless connections to interface with the local computing device (110). In one embodiment of the invention, the local computing device (110) executes methods that include functionality to implement at least portions of the various methods described below (see e.g.,
Continuing with the discussion of
In one or more embodiment of the invention, the monitoring system (100) includes one or more remote computing devices (114). A remote computing device (114) may be a device (e.g., a personal computer, laptop, smart phone, tablet, etc.) capable of receiving notifications from the remote processing service (112) and/or from the camera system (102). A notification may be, for example, a text message, a phone call, a push notification, etc. In one embodiment of the invention, the remote computing device (114) may include functionality to enable a user of the monitoring system (100) to interact with the camera system (102) and/or the remote processing service (112) as subsequently described below with reference to
The components of the monitoring system (100), i.e., the camera system(s) (102), the remote processing service (112) and the remote computing device(s) (114) may communicate using any combination of wired and/or wireless communication protocols. In one embodiment of the invention, the camera system(s) (102), the remote processing service (112) and the remote computing device(s) (114) communicate via a wide area network (116) (e.g., over the Internet), and/or a local area network (116) (e.g., an enterprise or home network). The communication between the components of the monitoring system (100) may include any combination of secured (e.g., encrypted) and non-secure (e.g., un-encrypted) communication. The manner in which the components of the monitoring system (100) communicate may vary based on the implementation of the invention.
Additional details regarding the monitoring system and the detection of events that is based on the distinction of foreground objects from the background of the monitored environment are provided in U.S. patent application Ser. No. 14/813,907 filed Jul. 30, 2015, the entire disclosure of which is hereby expressly incorporated by reference herein.
One skilled in the art will recognize that the monitoring system is not limited to the components shown in
Turning to
The user (250), in accordance with one or more embodiments of the invention, may be any user of the monitoring system, including but not limited to the owner of the monitoring system, a family member, an administrative user that configures the monitoring system, but also a person that is not affiliated with the monitoring system including, for example, a stranger that is detected in the monitored environment (150) by the monitoring system (200). In one embodiment of the invention, the user (250) directs a request to an input device (202) of the monitoring system (200). The request may be a spoken request or a text request, e.g., a typed text. Accordingly the input device may include the microphone (122) of the camera system (122) or it may include a microphone (not shown) of a remote computing device (114), e.g., of a smartphone, if the request is a spoken request. Alternatively, if the request is a text request, the input device may include a keyboard (not shown) of the remote computing device. The request may also be obtained as a file that includes the recorded audio of a spoken text or typed text. The interaction of the user (250) with the monitoring system may, thus, be local, with the user being in the monitored environment (150), or it may be remote, with the user being anywhere, and being remotely connected to the monitoring system via a remote computing device (114). The request, issued by the user (250), may be any kind of spoken or typed request and may be, e.g., a question or a command. Multiple exemplary user requests are discussed in the subsequently introduced use cases. In one embodiment of the invention, the request is provided using natural, spoken language and therefore does not require the user to be familiar with a particular request syntax. In one embodiment of the invention, the input device (202) captures other audio signals, in addition to the user request. For example, the input device may capture additional interactions with the user, after the user provided an original user request, as further discussed below. Accordingly, the audio signal captured by the input device (202) may be any kind of spoken user input, without departing from the invention.
In one or more embodiments of the invention, the input device further includes a speech-to-text conversion engine (204) that is configured to convert the recorded audio signal, e.g., the spoken user input, to text. The speech-to-text conversion engine (204) may be a software module hosted on either the local computing device (110) of the camera system (102), or on the remote computing device (114), or it may be a component of the remote processing service (112). In one embodiment of the invention, the speech-to-text conversion engine is a cloud service (e.g., a Software as a Service (SaaS), provided by a third party). The speech-to-text-conversion engine may convert the recorded spoken user input to a text in the form of a string.
The text, in one or more embodiments of the invention, is provided to the database query generation engine (206). The database query generation engine (206) may be a software and/or hardware module hosted on either the local computing device (110) of the camera system (102) or on a remote computing device (114). The database query generation engine converts the text into a database query in a format suitable for querying the monitoring system database. The database query generation engine may then analyze the text to extract a message or meaning from the text and generates a database query that reflects the meaning of the text. The database query generation engine may rely on natural language processing methods which may include probabilistic models of word sequences and may be based on, for example, n-gram models. Other natural language processing methods may be used without departing from the invention. Further, the database query generation engine may recognize regular expressions such as, in case of the monitoring system, camera names, user names, dates, times, ranges of dates and times, etc. Those skilled in the art will appreciate that various methods may be used by the database query generation engine to generate a database query based on the text.
In one embodiment of the invention, the database query generation engine is further configured to resolve texts for which it is initially unable to completely understand all content. This may be the case, for example, if the text includes elements that are ambiguous or unknown to the database query generation engine. In such a scenario, the database query generation engine may attempt to obtain the missing information as supplementary data from the monitoring system database, and/or the database query generation engine may contact the user with a clarification request, enabling the user to provide clarification using spoken language. A description of the obtaining of supplementary data from the monitoring system database (208) and the obtaining of user clarification is provided below with reference to
Continuing with the discussion of the database query generation engine, once a complete database query has been generated, the database query is directed to the monitoring system database. The monitoring system database (208) upon receipt of the database query addresses the query. Addressing the query may include providing a query result to the user and/or updating content of the monitoring system database. The use cases, introduced below, provide illustrative examples of query results returned to the user and of updates of the monitoring system database.
Turning to
In one or more embodiments of the invention, the video data archive (310) stores video data captured by the camera system (102) of the monitoring system (100). The video archive (310) may be implemented using any format suitable for the storage of video data. The video data may be provided by the camera system as a continuous stream of frames, e.g. in the H.264 format, or in any other video format with or without compression. The video data may further be accompanied by depth data and/or audio data. Accordingly, the video archive may include archived video streams (312) and archived depth data streams (314). An archived video stream (312) may be the continuously or non-continuously recorded stream of video frames received from a camera, and that may be stored in any currently available or future video format. Similarly, an archived depth data stream (314) may be the continuously or non-continuously recorded stream of depth data frames received from a depth-sensing camera. The video archive may include multiple video streams and/or audio streams. More specifically, the video archive may include a stream for each camera system installed on a site, such as a house protected by the monitoring system. Consider, for example, a home with two floors. On the first floor, a first camera system that monitors the front door, and a second camera system that monitors the living room are installed. On the second floor, a third camera system that monitors the master bedroom is installed. The site thus includes three camera systems (102), and the video archive (310) includes three separate archived video streams, one for each of the three camera systems. The video archive, as previously noted, may archive video data obtained from many sites.
As video data are received and archived in the video archive (310), tags may be added to label the content of the video streams, as subsequently described. The tags may label objects and/or actions detected in video streams, thus enabling a later retrieval of the video frames in which the object and/or action occurred.
The video archive (310) may be hosted on any type of non-volatile (or persistent) storage, including, for example, a hard disk drive, NAND Flash memory, NOR Flash memory, Magnetic RAM Memory (M-RAM), Spin Torque Magnetic RAM Memory (ST-MRAM), Phase Change Memory (PCM), or any other memory defined as a non-volatile Storage Class Memory (SCM). Further, the video archive (310) may be implemented using a redundant array of independent disks (RAID), network attached storage (NAS), cloud storage, etc. At least some of the content of the video archive may alternatively or in addition be stored in volatile memory, e.g., Dynamic Random-Access Memory (DRAM), Synchronous DRAM, SDR SDRAM, and DDR SDRAM. The storage used for the video archive (310) may be a component of the remote processing service (112), or it may be located elsewhere, e.g., in a dedicated storage array or in a cloud storage service, where the video archive (310) may be stored in logical pools that are decoupled from the underlying physical storage environment.
In one or more embodiments of the invention, the metadata archive (330) stores data that accompanies the data in the video archive (310). Specifically, the metadata archive (330) may include labels for the content stored in the video archive, using tags, and other additional information that is useful or necessary for the understanding and/or retrieval of content stored in the video archive. In one embodiment of the invention, the labels are organized as site-specific data (332) and camera-specific data (342).
The metadata archive (330) may be a document-oriented database or any other type of database that enables the labeling of video frames in the video archive (310). Similar to the video archive (310), the metadata archive (330) may also be hosted on any type of non-volatile (or persistent) storage, in redundant arrays of independent disks, network attached storage, cloud storage, etc. At least some of the content of the metadata archive may alternatively or in addition be stored in volatile memory. The storage used for the metadata archive (310) may be a component of the remote processing service (112), or it may be located elsewhere, e.g., in a dedicated storage array or in a cloud storage service.
The site-specific data (332) may provide definitions and labeling of elements in the archived video streams that are site-specific, but not necessarily camera-specific. For example, referring to the previously introduced home protected by the three camera systems (102), people moving within the house are not camera-specific, as they may appear anywhere in the house. In the example, the owner of the home would be recognized by the monitoring system (100) as a moving object regardless of which camera system (102) sees the owner. Accordingly, the owner is considered a moving object that is site-specific but not camera-specific. As previously noted, the monitoring system database may store data for many sites. The use of site-specific data (332), may enable strict separation of data for different sites. For example, while one site may have a moving object that is the owner of one monitoring system, another site may have a moving object that is considered the owner of another monitoring system. While both owners are considered moving objects, they are distinguishable because they are associated with different sites. Accordingly, there may be a set of site-specific data (332) for each site for which data are stored in the monitoring system database (300).
In one or more embodiments of the invention, frames of the archived video streams in which a moving object is recognized are tagged using site-specific moving object tags (336). Moving object tags (336) may be used to tag frames that include moving objects detected by any camera system of the site, such that the frames can be located, for example for later playback. For example, a user request to show the dog's activity throughout the day may be served by identifying, in the archived video streams (312), the frames that show the dog, as indicated by moving object tags (334) for the dog. Separate moving object tags may be generated, for moving objects including, but not limited to, persons, pets, specific persons, etc., if the monitoring system is capable of distinguishing between these. In other words, site-specific object tags may enable the identification of video and/or depth data frames that include the site-specific moving object. Those skilled in the art will appreciate that any kind of moving object that is detectable by the monitoring system may be tagged. For example if the monitoring system is capable of distinguishing different pets, e.g. cats and dogs, it may use separate tags for cats and dogs, rather than classifying both as pets. Similarly, the monitoring system may be able to distinguish between adults and children and/or the monitoring system may be able to distinguish between different people, e.g. using face recognition. Accordingly, the moving object tags (334) may include person-specific tags.
Moving object tags may be generated as follows. As a video stream is received and archived in the video archive (310), a foreground object detection may be performed. In one embodiment of the invention, a classifier that is trained to distinguish foreground objects (e.g., humans, dogs, cats, etc.) is used to classify the foreground object(s) detected in a video frame. The classification may be performed based on the foreground object appearing in a single frame or based on a foreground object track, i.e., the foreground object appearing in a series of subsequent frames.
The site-specific data (332) of the metadata archive (330) may further include moving object definitions (334). A moving object definition may establish characteristics of the moving object that make the moving object uniquely identifiable. The moving object definition may include, for example, a name of the moving object, e.g., a person's or a pet's name. The moving object definition may further include a definition of those characteristics that are being used by the monitoring system to uniquely identify the moving object. These characteristics may include, but are not limited to, the geometry or shape of the moving object, color, texture, etc., i.e., visual characteristics. A moving object definition may further include other metadata such as the gender of a person, and/or any other descriptive information.
In one or more embodiments of the invention, the moving object definitions (334) may grow over time and may be completed by additional details as they become available. Consider, for example, a person that is newly registered with the site. The monitoring system may initially know only the name of the person. Next, assume that the person's cell phone is registered with the monitoring system, for example, by installing an application associated with the monitoring system on the person's cell phone. The moving object definitions may now include an identifier of the person's cell phone. Once the person visits the site, the monitoring system may recognize the presence of the cell phone, e.g., based on the cell phone with the identifier connecting to a local wireless network or by the cell phone providing location information (e.g., based on global positioning system data or cell phone tower information). If, while the cell phone is present, an unknown person is seen by a camera of the monitoring system, the monitoring system may infer that the unknown person is the person associated with the cell phone, and thus corresponds to the newly registered person. Based on this inferred identity, the monitoring system may store visual characteristics, captured by the camera, under the moving object definition to enable future visual identification of the person. The monitoring system may rely on any of the information stored in the moving object definition to recognize the person. For example, the monitoring system may conclude that the person is present based on the detection of the cell phone, even when the person is not visually detected.
The site-specific data (332) of the metadata archive (330), in one embodiment of the invention, further include action tags (340). Action tags may be used to label particular actions that the monitoring system is capable of recognizing. For example, the monitoring system may be able to recognize a person entering the monitored environment, e.g., through the front door. The corresponding video frames of the videos stored in the video archive may thus be tagged with the recognized action “entering through front door”. Action tags may be used to serve database queries that are directed toward an action. For example, the user may submit the request “Who was visiting today?”, to which the monitoring system may respond by providing a summary video clip that shows all people that were seen entering through the front door. Action tags in combination with moving object tags may enable a targeted retrieval of video frames from the video archive. For example, the combination of the action tag “entering through front door” with the moving object tag “Fred” will only retrieve video frames in which Fred is shown entering through the front door, while not retrieving video frames of other persons entering through the front door.
Action tags may be generated based on foreground object tracks. More specifically, in the subsequent video frames that form the foreground object tracks, motion descriptors such as speed, trajectories, particular movement pattern (e.g., waiving, walking) may be detected. If a particular set of motion descriptors, corresponding to an action, is detected, the video frames that form the foreground object track are tagged with the corresponding action tag.
The site-specific data (332) of the metadata archive (330) may further include action definitions (338). An action definition may establish characteristics of an action that makes the action uniquely identifiable. The action definition may include, for example, a name of the action. In the above example of a person entering through the front door, the action may be named “person entering through front door”. The action definition may further include a definition of those characteristics that are being used by the monitoring system to uniquely identify the action. These characteristics may include, for example, a definition of an object track spanning multiple video frames, that defines the action.
In one embodiment of the invention, the metadata archive (330) further includes a site configuration (342). The site configuration may include the configuration information of the monitoring system. For example, the site configuration may specify accounts for users and administrators of the monitoring system, including credentials (e.g. user names and passwords), privileges and access restrictions. The site configuration may further specify the environments that are being monitored and/or the camera systems being used to monitor these environments.
Continuing with the discussion of the site-specific data (332) of the metadata archive (330), in one embodiment of the invention, camera-specific data (352) include static object definitions (354) and/or a camera configuration (356). Separate static object definitions (354) and camera configurations (356) may exist for each of the camera systems (102) of the monitoring system (100). The camera-specific data (352) may provide labeling of elements in the archived video streams that are camera-specific, i.e., elements that may not be seen by other camera systems. For example, referring to the previously introduced home protected by the three camera systems, the bedroom door is camera-specific, because only the camera system installed in the bedroom can see the bedroom door.
The static objects (354), in accordance with an embodiment of the invention, include objects that are continuously present in the environment monitored by a camera system. Unlike moving objects that may appear and disappear, static objects are thus permanently present and therefore do not need to be tagged in the archived video streams. However, a definition of the static objects may be required, in order to detect interactions of moving objects with these static objects. Consider, for example, a user submitting the question: “Who entered through the front door?” To answer this question, a classification of all non-moving objects as background without further distinction is not sufficient. The camera-specific data (352) therefore include definitions of static objects (354), that enable the monitoring system to detect interactions of moving objects with these static objects. Static objects may thus be defined in the camera-specific data (352), e.g., based on their geometry, location, texture or any other feature that enables the detection of moving objects' interaction with these static objects. Static objects may include, but are not limited to, doors, windows and furniture.
The presence and appearance of static objects in a monitored environment may change under certain circumstances, e.g., when the camera system is moved, or when the lighting in the monitored environment changes. Accordingly, the static object definitions (354) may be updated under these conditions. Further, an entirely new set of static object definitions (354) may be generated if a camera system is relocated to a different room. In such a scenario, the originally defined static objects become meaningless and may therefore be discarded, whereas the relevant static objects in the new monitored environment are captured by a new set of static object definitions (354) in the camera-specific data (352).
Continuing with the discussion of the camera-specific data (352), the camera configuration (356), in accordance with an embodiment of the invention, includes settings and parameters that are specific to a particular camera system (102) of the monitoring system (100). A camera configuration may exist for each camera system of the monitoring system. The camera configuration may include, for example, a name of the camera system, an address of the camera system, a location of the camera system, and/or any other information that is necessary or beneficial for the operation of the monitoring system. Names of camera systems may be selected by the user and may be descriptive. For example, a camera system that is set up to monitor the front door may be named “front door”. Addresses of camera systems may be network addresses to be used to communicate with the camera systems. A camera system address may be, for example, an Internet Protocol (IP) address.
Those skilled in the art will appreciate that the monitoring system database (300) is not limited to the elements shown in
One or more of the steps described in
Turning to
In Step 402, the recorded spoken user request is converted to text. Any type of currently existing or future speech-to-text conversion method may be employed to obtain a text string that corresponds to the recorded spoken user request. Step 402 is optional and may be skipped, for example, if the request input was provided as a text.
In Step 404, a database query is formulated based on the text. The database query, in accordance with one or more embodiments of the invention, is a representation of the text, in a form that is suitable for querying the monitoring system database. Accordingly, the generation of the database query may be database-specific. The details regarding the generation of the database query are provided below with reference to
In Step 406, the monitoring system database is accessed using the database query. If the query includes a question to be answered based on content of the monitoring system database, a query result, i.e., an answer to the question, is generated and returned to the user in Step 408A. Consider, for example, a scenario in which a user submits the question “Who was in the living room today?”. The monitoring system database, in this scenario, is queried for any moving object that was identified as a person, during a time span limited to today's date. The querying may be performed by analyzing the moving object tags, previously described in
If, alternatively or in addition, the query includes an instruction to update a monitoring system database setting, the monitoring system database is updated in Step 408B. Consider, for example, a scenario in which a user submits the request “Change the camera system's IP address to 192.168.3.66.” The monitoring system database, in this scenario, is accessed to update the IP address setting which may be located in the camera configuration, as previously described in FIG.
2.
In Step 410, a determination is made about whether a modification input was obtained. A modification input may be any kind of input that modifies the original request input. If a determination is made that an a modification input was provided, the method may return to Step 402 in order to process the modification input. Consider, for example, the originally submitted request input “What did Cassie do today?”. As a result, after the execution of Steps 400-408A, the user may receive video frames showing Cassie' s activities throughout the day. In the example, the user then submits the modification input “What about yesterday?”. The modification input is then interpreted in the context of the originally submitted request. In other words, the method of
Turning to
In Step 502, the correct site is identified, based on the identifier. The site to be used in the subsequent steps is the site to which the user belongs. It may be identified, based on the moving object tag that was relied upon to validate the user's identity. For example, if user Jeff in Step 400 issues a user request, and his identity is verified using a moving object tag for a site created for Jeff's condominium, it is the data of this site (Jeff s condominium) that are relied upon in the subsequently discussed steps, whereas data from other sites are not considered.
in Step 504, distinct filtering intents are identified in the text. A distinct filtering intent, in accordance with an embodiment of the invention, may be any kind of content fragment extracted from the text by a text processor. A filtering intent may be obtained, for example, when segmenting the text using an n-gram model. Filtering intents may further be obtained by querying the monitoring system database for regular expressions in the text. Regular expressions may include, but are not limited to, for example, camera names, names of moving and static objects such as names of persons, various types of background elements such as furniture, doors and other background elements that might be of relevance and that were therefore registered as static objects in the monitoring system database. Other regular expressions that may be recognized include user names, dates, times, ranges of dates and times, etc. Filtering intents that were obtained in Step 504 are elements of the text that are considered to be “understood”, i.e., a database query can be formulated based on their meaning, as further described in Step 514. Those skilled in the art will appreciate that a variety of techniques may be employed to obtain filtering intents, including but not limited to, n-gram models, keyword matching, regular expressions, recurrent neural networks, long short term memories, etc.
In the subsequent steps, e.g., Steps 506-512, a validation of the obtained filtering intents is performed. The validation includes determining whether, within the context of the known site, all filtering intents are understood and make sense.
In Step 506, a determination is made about whether the text includes an unknown filtering intent. An unknown filtering intent, in accordance with an embodiment of the invention, is a filtering intent that, after execution of Step 504, remains unresolved, and is therefore “not understood”, thus preventing the generation of a database query. An unknown filtering intent may be, for example, a single word (e.g., an unknown name), a phrase, or an entire sentence. An unknown filtering intent may be a result of the spoken user request including content that, although properly converted to text in Step 402, could not be entirely processed in Step 504. In this scenario, the actual spoken request contained content that could not be resolved. Alternatively, the spoken user request may include only content that could have been entirely processed in Step 504, but an erroneous speech-to-text conversion in Step 402 resulted in a text that included the unknown filtering intent.
If no unknown filtering intent is detected in Step 506, the method may directly proceed to Step 514. If a determination is made that an unknown filtering intent exists, the method may proceed to Step 508.
In Step 508, a determination is made about whether the unknown filtering intent is obtainable from the monitoring system database. In one embodiment of the invention, the monitoring system database may be searched for the unknown filtering intent. In this search, database content beyond the regular expressions already considered in Step 504 may be considered. In one embodiment of the invention, the data considered in step 508 is limited to data specific to the site that was identified in Step 502.
If a determination is made that the monitoring database includes the unknown filtering intent, in Step 510, the unknown filtering intent is resolved using the content of the monitoring system database. Consider, for example the previously discussed user request “Change the camera system's IP address to 192.168.3.66,” and further assume that the entire sentence was correctly converted to text, using the speech-to-text conversion in Step 402. In addition, assume that, in Step 500, the text was segmented into syntactic elements, with only the term “IP address” not having been resolved. In this scenario, in Step 508, the entire monitoring system database is searched, and as a result an “IP address” setting is detected in the camera configuration. The unknown syntactic element “IP address” is thus resolved. Sanity checks may be performed to verify that the resolution is meaningful. In the above example, the sanity check may include determining that the format of the IP address in the user-provided request matches the format of the IP address setting in the monitoring system database. In addition, or alternatively, the user may be asked for confirmation.
Returning to Step 508, if a determination is made that the unknown filtering intent is not obtainable from the monitoring system database, the method may proceed to Step 512, where the unknown filtering intent is resolved based on a user-provided clarification. The details of Step 512 are provided in
Those skilled in the art will appreciate that above-described Steps 506-512 may be repeated if multiple unknown filtering intents were detected, until all filtering intents are resolved.
In Step 514, the database query is composed based on the filtering intents.
Depending on the user request, the complexity of the database query may vary. For example, a simple database request may be directed to merely retrieving all video frames that are tagged as including a person, seen by the monitoring system. A more complex database query may be directed to retrieving all video frames that include the person, but only for a particular time interval. Another database query may be directed to retrieving all video frames that include the person, when the person performs a particular action. Other database queries may update settings in the database, without retrieving content from the database. In one or more embodiments of the invention, the database query further specifies the site identified in Step 502. A variety of use cases that include various database queries are discussed below. The database query, in one or more embodiments of the invention, is in a format compatible with the organization of the metadata archive of the monitoring system database. Specifically, the database query may be in a format that enables the identification of moving object tags and/or action tags that match the query. Further the query may be in a format that also enables the updating of the metadata archive, including, but not limited to, the moving object definitions, the action definitions, the static object definitions and the camera configuration.
In Step 602, a user clarification is obtained. The user clarification may be either a spoken user clarification or a clarification provided via a selection in a video frame.
The spoken user clarification may be obtained, analogous to Step 400 in
The clarification provided via selection in a video frame may be obtained as follows. Consider the user request “Who came through the front door?”, and further assume that the term “front door” is not yet registered as a static object in the metadata archive. Accordingly the term “front door” is an unknown filtering intent. To resolve the unknown filtering intent, the user, in a video frame that shows the front door may select the front door, e.g. by marking the front door using the touchscreen interface of the user's smartphone. The selection of the front door establishes an association of the term “front door” with image content that represents the front door, in the archived video streams, thus resolving the previously unknown filtering intent “front door”.
In Step 606, the monitoring system database may be updated to permanently store the newly resolved filtering intent. In the above examples, the dog's name “Lucky” may be stored in the moving object definition for the dog, and/or the a new static object definition may be generated for the front door. Thus, future queries that include the name “Lucky” and/or the term “front door” can be directly processed without requiring a clarification request.
The use case scenarios described below are intended to provide examples of the user requests that may be processed using the methods described in
(i) Owner requests: “Show me what was going on today.” When the user request is received, the monitoring system database is queried to determine that the request was issued by Jeff, and the Jeff is associated with the site “Jeff's condominium”. Accordingly, only data that is associated with the site “Jeff's condominium are considered. The user request, when processed using the previously described methods, is segmented into a syntactic element for a requested activity (“show me”), an unspecific of activity, i.e., any kind of activity (“what was going on”), and a time frame (“today”). Note that even though the syntactic elements convey the message of the request, the actual vocabulary used as syntactic elements may be different, without departing from the invention. Next, a database query is formulated that, when submitted, results in the non-selective retrieval of any activity captured anywhere on site, for the specified time range (“today”). Specifically, the database request, specifies that video frames are to be retrieved from any video stream, regardless of the location of the camera system that provided the video stream, and that the time frame is limited to the interval between midnight and the current time. The retrieval may be performed through identification of all tags in the database that meet these limitations. For example, all moving object tags and all action tags may be considered. Based on these tags, the video frames that these tags refer to are retrieved from the video archive, and a summary video that includes all or at least some of these video frames is generated and returned to the owner.
(ii) Owner asks: “What happened in the living room throughout the day?” This user request, in comparison to request (i) includes an additional constraint. Specifically, only activity that occurred in the living room is to be reported. This additional constraint translates into the database query including a limitation that specifies that only activity captured in the living room is to be considered. Accordingly, only tags for the video stream(s) provided by the camera system installed in the living room are considered. A summary video is thus generated that only includes activity that occurred in the living room, throughout the day.
(iii) Owner asks: “What was the dog doing in the morning?” This user request, unlike the requests (i) and (ii) specifies a particular object of interest (the dog). Accordingly, only tags for the dog are considered. These tags may be moving object tags, with the dog being a specific moving object. Further, the specified time frame is limited to “in the morning”. Accordingly, the database may be queried using a time limitation such as between 12:00 midnight and 12:00 noon, today. A summary video is then generated that only includes video frames in which the dog is present, regardless of the camera that captured the dog, in a time interval between midnight and noon.
(iv) Owner asks: “Was Lucy in the bedroom today?” This user request specifies a name and therefore requires name resolution in order to properly respond to the request. Thus, when formulating the database query the unknown syntactic element “Lucy” is detected. The unknown syntactic element is then resolved using the monitoring system database, based on the association of the name “Lucy” with the moving object “cat”. Based on this association, the syntactic element “Lucy” is no longer unknown, and a complete database query can therefore be submitted. The query may include the term “Lucy” or “cat”, as they are equivalent.
(v) Owner asks: “Did Lucky jump on the couch?” This request not only requires the resolution of the name “Lucky” as described in use case (iv) , but it also requires an interaction of a moving object (Lucky, the dog) with a static object (the couch). Such an interaction, if found in the archived video streams, may be marked using action tags, stored in the metadata archive of the monitoring system database. Accordingly, the database query, in the monitoring system database, triggers are search for action tags that identify the video frames in which the dog was seen jumping onto the couch.
(vi) Owner asks: “When was the dog sitter here?” This user request requires the resolution of the term “dog sitter”. While the dog sitter is a person known to the monitoring system, the term “dog sitter” has not been associated with the recognized person. Accordingly, the monitoring system, whenever the dog sitter appears merely generates tags for the same unknown person. The term “dog sitter” can therefore not be resolved using the monitoring system database. Accordingly, the owner is requested to clarify the term “dog sitter”. The owner, in response, may select, in a video frame or in a sequence of video frames displayed on the owner's smartphone, the unknown person, to indicate that the unknown person is the dog sitter. An association between the detected unknown person and the term “dog sitter” is established and stored in the monitoring system database, thus enabling resolution of requests that include the term “dog sitter”.
(vii) Owner requests: “Change camera location to “Garage”.” This user request involves updating a setting in the monitoring system database. The owner may want to change the camera location, for example, because he decided to move the camera from one room to another room. The update of the camera name is performed by overwriting the current camera location in the camera configuration, stored in the metadata archive. The updated camera location may then be relied upon, for example, when a request is issued that is directed to activity in the garage.
Embodiments of the invention enable the interaction of users with a monitoring system using speech commands and/or requests. Natural spoken language, as if addressing another person, may be used, thus not requiring the memorization and use of a particular syntax when communicating with the monitoring system. The interaction using spoken language may be relied upon for both the regular use and the configuration of the monitoring system. The regular use includes, for example, the review of activity that was captured by the monitoring system. The speech interface, in accordance with one or more embodiments of the invention, simplifies the use and configuration of the monitoring system because a user no longer needs to rely on a complex user interface that would potentially require extensive multi-layer menu structures to accommodate all possible user commands and requests. The speech interface thus increases user-friendliness and dramatically reduces the need for a user to familiarize herself with the user interface of the monitoring system.
Embodiments of the invention are further configured to be interactive, thus requesting clarification if an initial user request is not understood. Because the monitoring system is configured to memorize information learned from a user providing a clarification, the speech interface's ability to handle increasingly sophisticated requests that include previously unknown terminology will continuously develop.
Embodiments of the technology may be implemented on a computing system. Any combination of mobile, desktop, server, embedded, or other types of hardware may be used. For example, as shown in
Software instructions in the form of computer readable program code to perform embodiments of the technology may be stored, in whole or in part, temporarily or permanently, on a non-transitory computer readable medium such as a CD, DVD, storage device, a diskette, a tape, flash memory, physical memory, or any other computer readable storage medium. Specifically, the software instructions may correspond to computer readable program code that, when executed by a processor(s), is configured to perform embodiments of the technology.
Further, one or more elements of the aforementioned computing system (700) may be located at a remote location and connected to the other elements over a network (712). Further, embodiments of the technology may be implemented on a distributed system having a plurality of nodes, where each portion of the technology may be located on a different node within the distributed system. In one embodiment of the technology, the node corresponds to a distinct computing device. Alternatively, the node may correspond to a computer processor with associated physical memory. The node may alternatively correspond to a computer processor or micro-core of a computer processor with shared memory and/or resources.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this disclosure, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as disclosed herein. Accordingly, the scope of the invention should be limited only by the attached claims.