Entities are increasingly adopting electronic displays to increase the versatility of signage. For example, electronic displays may be used to display content for advertising, guidance, and public awareness, among a wide variety of other applications. In particular, electronic displays enable the displayed content to be changed quickly, such as a rotating series of ads, rather than static content of a traditional non-electronic display such as a poster or billboard. However, a persistent challenge is determining what kind of content should be displayed to optimize effectiveness of the display. This challenge is further complicated by the many variables that may be present. For example, the optimal display content may be different depending on time of day, weather conditions, viewer demographics, and various other variables, some of which may be even be difficult to define. Thus, current technology does not enable the full performance potential of the dynamic nature of electronic display technology.
Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which:
In the following description, various embodiments will be described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the embodiments. However, it will also be apparent to one skilled in the art that the embodiments may be practiced without the specific details. Furthermore, well-known features may be omitted or simplified in order not to obscure the embodiment being described.
Systems and methods in accordance with various embodiments of the present disclosure may overcome one or more of the aforementioned and other deficiencies experienced in conventional approaches for using electronic displays. In particular, various embodiments provide systems for optimizing content to be displayed on an electronic display. Various embodiments enable detection of certain conditions of an environment or scene (e.g., viewer demographics, weather conditions, traffic conditions) and selection of display content based at least in part on the detected conditions. Specifically, systems and method provided herein enable the detection of objects appearing in a scene captured by an image sensor such as a camera. The detected objects may be classified as belonging to one or more object types, and display content can be selected based on the one or more objects of the objects appearing in the scene. For example, a the system may detect a group of approximately teenage boys appearing in the scene and select content to display that is likely to appeal to the teenage boys. The system may subsequently detect an adult female enter the scene and update the display to display content that may be more likely to appeal to the adult female. Other scenarios and conditions may be taken into account, such as combinations of object types, number of objects, travel direction of objects, among others.
Additionally, various embodiments enable the systems to learn over time what content most optimally drives a certain performance measure under certain conditions (akin to AB testing), thus enabling the system to optimally select content to be displayed under such conditions. A camera or other type of image sensor can be used to capture image data of a field of view containing the environment or scene, including various conditions. For example, a candidate content item aimed to attract people to enter a store may be displayed for a certain period of time, and image data of a scene is analyzed during the period of time to determine how many people entered a store during that time. A second candidate content item may be displayed and a number of people entering the store can be detected. Thus, it can be determined which of the candidate content items is more effective. Various other functions and advantages are described and suggested below as may be provided in accordance with various embodiments.
The dynamic nature of electronic displays provides the potential for optimal utilization of the content real estate of the display. However, as mentioned, current technology does not enable such a potential to be reached. Embodiments of the present disclosure aim to improve utilization of electronic displays by learning what content should be displayed based on image data captured of a field of view for driving a certain performance measure such as number of visitors to an establishment, number of sales, time spent looking at the display, among others. Conventional image or video analysis approaches may require the captured image or video data to be transferred to a server or other remote system for analysis. As mentioned, this requires significant bandwidth and causes the data to be analyzed offline and after the transmission, which prevents actions from being initiated in response to the analysis in near real time. Further, in many instances it will be undesirable, and potentially unlawful, to collect information about the locations, movements, and actions of specific people. Thus, transmission of the video data for analysis may not be a viable solution. There are various other deficiencies to conventional approaches to such tasks as well.
Accordingly, approaches in accordance with various embodiments provide systems, devices, methods, and software, among other options, that can provide for the near real time detection of a scene and/or specific types of objects, as may include people, vehicles, products, and the like, within the scene, and determine content to be displayed on an electronic display based on the detected objects, and performed in a way that requires minimal storage and bandwidth and does not disclose information about the persons represented in the captured image or video data, unless otherwise instructed or permitted. In one example, the detected objects may include people in viewing proximity of the display (i.e., viewers), and the content displayed may be determined based at least in part of certain detected characteristics of the people. In various embodiments, machine learning techniques are utilized to earn the optimal content to display in order to drive a performance measure based on detected conditions of the scene and/or detected types of objects. Various other approaches and advantages will be appreciated by one of ordinary skill in the art in light of the teachings and suggestions contained herein.
The image sensors 104, 106 may each have a field of view and capture image data representing the respective field of view or a scene within the field of view. In various embodiments, the image sensors 104, 106 are a pair of cameras 104, 106 useful in capturing two sets of video data with partially overlapping fields of view which can be used to provide stereoscopic video data. In various embodiments, the cameras 104, 106 are positioned at an angle such that when the content display device 100 is positioned in a conventional orientation, with the front face 108 of the device being substantially vertical, the cameras 104, 106 will capture video data for items positioned in front of, and at the same height or below, the position of the cameras. As known for stereoscopic imaging and as discussed in more detail elsewhere herein, the cameras can be configured such that their separation and configuration are known for disparity determinations. Further, the cameras can be positioned or configured to have their primary optical axes substantially parallel and the cameras rectified to allow for accurate disparity determinations. It should be understood, however, that devices with a single camera or more than two cameras can be used as well within the scope of the various embodiments, and that different configurations or orientations can be used as well. Various other types of image sensors can be used as well in different devices.
The electronic display 102 may be directed in generally the same or overlapping direction as the camera 104, 106 and is configured to display various content as determined by the content display device 100. The electronic display 102 may be any type of display type device capable of displaying content, such as liquid crystal display (LCD), light-emitting diode (LED), organic light-emitting diode (OLED) cathode ray tube (CRT), electronic ink (i.e., electronic paper), 3D swept volume display, holographic display, laser display, projection-based display, among others. In some embodiments, the electronic display 102 may be replaced by a mechanical display, such as a rotating display, trivision display, among others.
In various embodiments, the content display device 100 further includes one or more LEDs or other status lights that can provide basic communication to a technician or other observer of the device to indicate a state of the device. For Example, in situations where it is desirable to have people be aware that they are being detected or tracked, it may be desirable to cause the device to have bright colors, flashing lights, etc. The example device 102 also has a set 110 of display lights, such as differently colored light-emitting diodes (LEDs), which can be off in a normal state to minimize power consumption and/or detectability in at least some embodiments. If required by law, at least one of the LEDs might remain illuminated, or flash illumination, while active to indicate to people that they are being monitored. The LEDs 110 can be used at appropriate times, such as during installation or configuration, trouble shooting, or calibration, for example, as well as to indicate when there is a communication error or other such problem to be indicated to an appropriate person. The number, orientation, placement, and use of these and other indicators can vary between embodiments. In one embodiment, the LEDs can provide an indication during installation of power, communication signal (e.g., LTE) connection/strength, wireless communication signal (e.g., WiFi or Bluetooth) connection/strength, and error state, among other such options.
The memory on the content display device 100 may include various types of storage elements, such as random access memory (e.g., DRAM) for temporary storage and persistent storage (e.g., solid state drive, hard drives). In at least some embodiments, the memory can have sufficient capacity to store a certain number of frames of video content from both cameras 104, 106 for analysis. In various embodiments, the frames are discarded or detected from memory immediately upon analysis thereof In a subset of embodiments, the persistent storage may have sufficient capacity to store a limited amount of video data, such as video for a particular event or occurrence detected by the device. In a further subset of such embodiments, the persistent storage has insufficient capacity to store lengthy periods of video data, which can prevent the hacking or inadvertent access to video data including representations of the people contained within the field of view of those cameras during the period of recording. By limiting the capacity of the storage to only the minimal amount of video data needed to perform video processing, the amount of data that could be comprised is minimal as well, which provides increased privacy in contrast to system that store a larger amount of data.
In various embodiments, a processor on the content display device 100 analyzes the image data captured by the cameras 104, 106 to make various determinations regarding display content. The processor may access the image data (e.g., frames of video content) from memory as the image data is created and process and analyze the image data in real-time or near real-time. Real-time as used herein can refer to a processing sequence in which data is processed as soon as designated computing resource is available and may be subject to various real-time constraints, such as hardware constraints, computing constraints, design constraints, and the like. In some embodiments, the processor may access and process every frame in a sequence of frames of video content. In some other embodiments, the processor may access and process every n number of frames for analysis purpose. In some embodiments, the image data is deleted from memory as soon as it is analyzed, or as soon as the desired information is extracted from the image data. For example, analysis of the image data may include extracting certain features of the image data, such as those forming a feature vector. Thus, the image data is deleted as soon as the features are determined. This way, the actual image data, which may include more information than needed, can be deleted, and the extracted features can be used for further analysis.
In various embodiments, the extraction of features from the image data is performed on the processor which is local to the content display device 100. For example, the processor may be contained in the same device body as the memory and the cameras, which may be communicable with each other via hardwired connections. Thus, the image data is processed within the content display device 100 and is not transmitted to any other device, which reduces the likelihood of the image data being compromised. Additionally, as the image data is processed and subsequently deleted in real-time (or near real-time) as it is generated by the cameras 104, 106, the likelihood of the image data being compromised is further reduced, as the image data may only exist for a very short period of time.
Image data captured by the cameras 104, 106 and content displayed on the display 102 can be related in several ways. In one example, the image data can be analyzed to determine an effectiveness of displayed content, akin to performing AB testing of various content. In this example, the image data may include information regarding a performance measure and can be analyzed to determine a value of the performance measure. For example, the content display device 100 may be placed in a display window of a store. A first content (e.g., advertisement, message) may be displayed on the display for a first period of time. The cameras 104, 106 may capture a field of view near the entrance, such that the image data can be analyzed to determine how many people walked by the store, and the same or additional cameras may capture image data used to determine how many people entered the store. In this scenario, the performance measure may be the ratio between how many people entered the store and how many people walked by the store. Thus, a first value of the performance measure can be determined for the first content. This performance measure may be interpreted as an effectiveness of the content displayed on the content. Accordingly, a second content, different at least in some respect from the first content, may be displayed for a second period of time and the cameras 104, 106 can capture the same field of view and a ratio between the number people who entered the store and the number of people walking by the store can be determined from the image data to determine a second value of the performance measure for the second content. Various other factors may be held constant such that the difference between the first value and the second value of the performance content can be reasonably attributed to the first and second content. Thus, one of the first and second content can be determined to be more effective than the other for getting people to enter the store based on the first and second values of the performance measure. These techniques can be used to determine the best content item to display from a group of content items.
These above techniques may be performed for additional content options and under various other conditions and for various types of performance measures, such that optimal content can be determined for respective constraints. Certain machine learning techniques such as neural networks may be used. A condition refers to an additional factor beyond the content displayed that may have an effect on the performance measure, and which may affect the optimal content. Example types of conditions include object oriented conditions such as a type of object identified in the representation of the scene as captured by the cameras 104, 106, a combination of objects identified in the representation of the scene, a number of objects detected in the representation of the scene, a movement path of one or more objects detected in the scene, environmental conditions such as weather, or one or more characteristics detected in the scene from analyzing image data from the cameras. These and other conditions are examples of conditions that can be detected using the camera 104, 106. There may be additional conditions that are not image based, such as time of day, time of year, current events, and so forth. Thus, certain such conditions may be associated with the period of time for which a value of a performance measure is obtained such that the optimal display content can be determined given such conditions. A performance measure may refer to any qualitative or quantitative measure of performance, including positive measures where a high value is desirable or negative measures where a low value is desirable. Additional examples of performance measures may include number of sales made, number of interactions with the display where the display is an interactive display, number of website visits, number of people who look at the display, which can be determined using image data from the camera 104, 106, among many others.
As mentioned, in addition to determining the effectiveness of display content and determining the best content to display from a group of content items, the content display device can further determine optimal content to display given certain current conditions. For example, the cameras 104, 106 may capture image data. The image data may be analyzed by the local processor in real-time to detect a representation of a scene. One or more conditions may be determined based on the representation of the scene, such as one or more types of objects present in the scene, weather conditions, etc. In some embodiments, one or more feature values may be determined from the representation of the scene to determine a representation of one or more objects, such as humans, animals, vehicles, etc. The representations of the objects may be classified using an object identification model to determine the type of object present. For example, the object identification model may contain one or more sub-models associating features vectors with a plurality of respective object types. Thus, by analyzing the feature vector of a representation of an object, the object identification model may identify the object as belong to one or more types. The object identification model may be a machine learning based model such as one including one or more neural networks that have been trained on training data for identifying and/or classifying image data into one or more object types. In some embodiments, the model may include a neural network for each object type. Additionally, an optimization model may be stored in memory, which has been trained to determine the best content to display based on one or more given conditions. For example, the optimization model may determine the content to display based on the type of object identified in the image data. For example, an object detected in the image data may be identified as a woman with a stroller, and the content may be determined accordingly, such as an advertisement for baby clothes. In various embodiments, the number of objects detected or group comprising objects of one or more different types may also be taken into consideration by the model in determining the content.
In various embodiments, the abovementioned optimization model may be a machine learning based model such as one including one or more neural network that have been trained using training data. The training data may include a plurality of sets of training data, in which each set of training data represents one data point. For example, one set of training data may include a value of the performance measure, a condition (e.g., detected object type, weather, time of day), and a displayed content item. In other words, the set of training data represents the value of the performance measure associated with the combination of the displayed content item and the condition, or the effectiveness of the displayed content item under the content. Thus, given a large number of sets of training data, the model, through classification or regression, can determine the optimal content to display given queried (i.e., currently detected) conditions in order to optimize for one or more performance measures.
In various embodiments, the content display device 100 may further include a housing 112 or device body, in which the display 102 makes up a front face 108 of the device housing 112 and the processor and the memory located within the device housing 112. The camera 104, 106 may be positioned proximate the front face 108 and have a field of view, wherein the display 102 faces the field of view. In some embodiments, the cameras are located at least partially within the housing 112 and a light-capturing component of the cameras 104, 106 are exposed to the field of view so as to capture image data representing a scene in the field of view.
The object detection device 204 may be at least partially embedded in the electronic display 202, for example such that a front face 210 of the object detection device 204 is flush with a front face 210 of the display 202. In some other embodiments, the object detection device 204 may be external but local to the electronic display 202 and mounted to the display 202, such as on top, on bottom, on a side, in front of, and so forth. The object detection device 204 may be communicatively coupled to the electronic display 202 via wired or wireless communications.
The display can include at least one display 410, such as display 102 of
The content display system 400 can include various other components, including those shown and not shown, that might be included in a computing device as would be appreciated to one of ordinary skill in the art. This can include, for example, at least one power component 414 for powering the device. This can include, for example, a primary power component and a backup power component in at least one embodiment. For example, a primary power component might include power electronics and a port to receive a power cord for an external power source, or a battery to provide internal power, among solar and wireless charging components and other such options. The device might also include at least one backup power source, such as a backup battery, that can provide at least limited power for at least a minimum period of time. The backup power may not be sufficient to operate the device for length periods of time, but may allow for continued operation in the event of power glitches or short power outages. The device might be configured to operate in a reduced power state, or operational state, while utilizing backup power, such as to only capture data without immediate analysis, or to capture and analyze data using only a single camera, among other such options. Another option is to turn off (or reduce) communications until full power is restored, then transmit the stored data in a batch to the target destination. As mentioned, in some embodiments the device may also have a port or connector for docking with the mounting bracket to receive power via the bracket.
The system can have one or more network communications components 420, or sub-systems, that enable the device to communicate with a remote server or computing system. This can include, for example, a cellular modem for cellular communications (e.g., LTE, 5G, etc.) or a wireless modem for wireless network communications (e.g., WiFi for Internet-based communications). The system can also include one or more components 418 for “local” communications (e.g., Bluetooth) whereby the device can communicate with other devices within a given communication range of the device. Examples of such subsystems and components are well known in the art and will not be discussed in detail herein. The network communications components 420 can be used to transfer data to a remote system or service, where that data can include information such as count, object location, and tracking data, among other such options, as discussed herein. The network communications component can also be used to receive instructions or requests from the remote system or service, such as to capture specific video data, perform a specific type of analysis, or enter a low power mode of operation, etc. A local communications component 418 can enable the device to communicate with other nearby detection devices or a computing device of a repair technician, for example. In some embodiments, the device may additionally (or alternatively) include at least one input 416 and/or output, such as a port to receive a USB, micro-USB, FireWire, HDMI, or other such hardwired connection. The inputs can also include devices such as keyboards, push buttons, touch screens, switches, and the like.
The illustrated detection device also includes a camera subsystem 422 that includes a pair of matched cameras 424 for stereoscopic video capture and a camera controller 426 for controlling the cameras. Various other subsystems or separate components can be used as well for video capture as discussed herein and known or used for video capture. The cameras can include any appropriate camera, as may include a complementary metal-oxide-semiconductor (CMOS), charge coupled device (CCD), or other such sensor or detector capable of capturing light energy over a determined spectrum, as may include portions of the visible, infrared, and/or ultraviolet spectrum. Each camera may be part of an assembly that includes appropriate optics, lenses, focusing elements, shutters, and other such elements for image capture by a single camera, set of cameras, stereoscopic camera assembly including two matched cameras, or other such configuration. Each camera can also be configured to perform tasks such as autofocusing, zoom (optical or digital), brightness and color adjustments, and the like. The cameras 424 can be matched digital cameras of an appropriate resolution, such as may be able to capture HD or 4K video, with other appropriate properties, such as may be appropriate for object recognition. Thus, high color range may not be required for certain applications, with grayscale or limited colors being sufficient for some basic object recognition approaches. Further, different frame rates may be appropriate for different applications. For example, thirty frames per second may be more than sufficient for tracking person movement in a library, but sixty frames per second may be needed to get accurate information for a highway or other high speed location. As mentioned, the cameras can be matched and calibrated to obtain stereoscopic video data, or at least matched video data that can be used to determine disparity information for depth, scale, and distance determinations. The camera controller 426 can help to synchronize the capture to minimize the impact of motion on the disparity data, as different capture times would cause some of the objects to be represented at different locations, leading to inaccurate disparity calculations.
The example content display system 400 also includes a microcontroller 406 to perform specific tasks with respect to the device. In some embodiments, the microcontroller can function as a temperature monitor or regulator that can communicate with various temperature sensors (not shown) on the board to determine fluctuations in temperature and send instructions to the processor 404 or other components to adjust operation in response to significant temperature fluctuation, such as to reduce operational state if the temperature exceeds a specific temperature threshold or resume normal operation once the temperature falls below the same (or a different) temperature threshold. Similarly, the microcontroller can be responsible for tasks such as power regulation, data sequencing, and the like. The microcontroller can be programmed to perform any of these and other tasks that relate to operation of the detection device, separate from the capture and analysis of video data and other tasks performed by the primary processor 404.
In this example, the cameras capture video data which can then be processed by at least one processor on the detection device. The object recognition process can detect objects in the video data and then determine which of the objects correspond to objects of interest, in this example corresponding to people. The process can then determine a location of each person, such as by determining a boundary, centroid location, or other such location identifier. The process can then provide this data as output, where the output can include information such as an object identifier, which can be assigned to each unique object in the video data, a timestamp for the video frame(s), and coordinate data indicating a location of the object at that timestamp. In one embodiment, a location (x, y, z) timestamp (t) can be generated as well as a set of descriptors (d1, d2, . . . ) specific to the object or person being detected and/or tracked. Object matching across different frames within a field of view, or across multiple fields of view, can then be performed using a multidimensional vector (e.g., x, y, z, t, d1, d2, d3, . . . ). The coordinate data can be relative to a coordinate of the detection device or relative to a coordinate set or frame of reference previously determined for the detection device. Such an approach enables the number and location of people in the region of interest to be counted and tracked over time without transmitting, from the detection device, any personal information that could be used to identify the individual people represented in the video data. Such an approach maintains privacy and prevents violation of various privacy or data collection laws, while also significantly reducing the amount of data that needs to be transmitted from the detection device.
As illustrated, however, the video data and distance information will be with respect to the cameras, and a plane of reference 506 of the cameras, which can be substantially parallel to the primary plane(s) of the camera sensors. For purposes of the coordinate data provided to a customer, however, the customer will often be more interested in coordinate data relative to a plane 508 of the region of interest, such as may correspond to the floor of a store or surface of a road or sidewalk that can be directly correlated to the physical location. Thus, in at least some embodiments a conversion or translation of coordinate data is performed such that the coordinates or position data reported to the customer corresponds to the plane 508 (or non-planar surface) of the physical region of interest. This translation can be performed on the detection device itself, or the translation can be performed by a data aggregation server or other such system or service discussed herein that receives the data, and can use information known about the detection device 502, such as position, orientation, and characteristics, to perform the translation when analyzing the data and/or aggregating/correlating the data with data from other nearby and associated detection devices. Mathematical approaches for translating coordinates between two known planes of reference are well known in the art and, as such, will not be discussed in detail herein.
The locations of the specific objects can be tracked over time, such as by monitoring changes in the coordinate information determined for a sequence of video frames over time. The type of object, position for each object, and quantity of objects can be reported by the detection device and/or data service, such that a customer can determine where objects of different types are located in the region of interest. In addition to the number of objects of each type, the location and movement of those types of objects can also be determined. If, for example, the types of objects represent people, automobiles, and bicycles, then such information can be used to determine how those objects move around an intersection, and can also be used to detect when a bicycle or person in in the street disrupting traffic, a car is driving on a sidewalk, or another occurrence is detected such that an action can be taken. As mentioned, an advantage of approaches discussed herein is that the position (and other) information can be provided in near real time, such that the determination of the occurrence can be determined while the occurrence is ongoing, such that an action can be taken. This can include, for example, generating audio instructions, activating a traffic signal, dispatching a security officer, or another such action. The real time analysis can be particularly useful for security purposes, where action can be taken as soon as a particular occurrence is detected, such as a person detected in an unauthorized area, etc. Such real time aspects can be beneficial for other purposes as well, such as being able to move employees to customer service counters or cash registers as needed based on current customer locations, line lengths, and the like. For traffic monitoring, this can help determine when to activate or deactivate metering lights, change traffic signals, and perform other such actions.
In other embodiments the occurrence may be logged for subsequent analysis, such as to determine where such occurrences are taking place in order to make changes to reduce the frequency of such occurrences. If in a store situation, such movement data can alternatively be used to determine how men and women move through a store, such that the store can optimize the location of various products or attempt to place items to direct the persons to different regions in the store. The data can also help to alert when a person is in a restricted area or otherwise doing something that should generate an alarm, alert, notification, or other such action.
In various embodiments, some amount of image pre-processing can be performed for purposes of improving the quality of the image, as may include filtering out noise, adjusting brightness or contrast, etc. In cases where the camera might be moving or capable of vibrating or swaying on a pole, for example, some amount of position or motion compensation may be performed as well. Background subtraction approaches that can be utilized with various embodiments include mean filtering, frame differencing, Gaussian average processing, background mixture modeling, mixture of Gaussians (MoG) subtraction, and the like. Libraries such as the OPEN CV library can also be utilized to take advantage of the conventional background and foreground segmentation algorithm.
Once the foreground portions or “blobs” of image data are determined, those portions can be processed using a computer vision algorithm for object recognition or other such process. Object recognition typically makes use of one or more classifiers that have been trained to recognize specific types of categories of objects, such as people, cars, bicycles, and the like. Algorithms used for such purposes can include convolutional or other deep neural networks (DNNs), as may utilize one or more feature extraction libraries for identifying types of feature points of various objects. In some embodiments, a histogram or oriented gradients (HOG)-based approach uses feature descriptors for object detection, such as by counting occurrences of gradient orientation in localized portions of the image data. Other approaches that can be used take advantage of features such as edge orientation histograms and shape contexts, as well as scale- and rotation-invariant feature transform descriptors, although these approaches may not provide the same level of accuracy for at least some data sets.
In some embodiments, an attempt to classify objects that does not require precision can rely on the general shapes of the blobs or foreground regions. For example, there may be two blobs detected that correspond to different types of objects. The first blob can have an outline or other aspect determined that a classifier might indicate corresponds to a human with 85% certainty. Certain classifiers might provide multiple confidence or certainty values, such that the scores provided might indicate an 85% likelihood that the blob corresponds to a human and a 5% likelihood that the blob corresponds to an automobile, based upon the correspondence of the shape to the range of possible shapes for each type of object, which in some embodiments can include different poses or angles, among other such options. Similarly, a second blob might have a shape that a trained classifier could indicate has a high likelihood of corresponding to a vehicle. For situations where the objects are visible over time, such that additional views and/or image data can be obtained, the image data for various portions of each blob can be aggregated, averaged, or otherwise processed in order to attempt to improve precision and confidence. As mentioned elsewhere herein, the ability to obtain views from two or more different cameras can help to improve the confidence of the object recognition processes.
Where more precise identifications are desired, the computer vision process used can attempt to locate specific feature points as discussed above. As mentioned, different classifiers can be used that are trained on different data sets and/or utilize different libraries, where specific classifiers can be utilized to attempt to identify or recognize specific types of objects. For example, a human classifier might be used with a feature extraction algorithm to identify specific feature points of a foreground object, and then analyze the spatial relations of those feature points to determine with at least a minimum level of confidence that the foreground object corresponds to a human. The feature points located can correspond to any features that are identified during training to be representative of a human, such as facial features and other features representative of a human in various poses. Similar classifiers can be used to determine the feature points of other foreground object in order to identify those objects as vehicles, bicycles, or other objects of interest. If an object is not identified with at least a minimum level of confidence, that object can be removed from consideration, or another device can attempt to obtain additional data in order to attempt to determine the type of object with higher confidence. In some embodiments the image data can be saved for subsequent analysis by a computer system or service with sufficient processing, memory, and other resource capacity to perform a more robust analysis.
After processing using a computer vision algorithm with the appropriate classifiers, libraries, or descriptors, for example, a result can be obtained that is an identification of each potential object of interest with associated confidence value(s). One or more confidence thresholds or criteria can be used to determine which objects to select as the indicated type. The setting of the threshold value can be a balance between the desire for precision of identification and the ability to include objects that appear to be, but may not be, objects of a given type. For example, there might be 1,000 people in a scene. Setting a confidence threshold too high, such as at 99%, might result in a count of around 100 people, but there will be a very high confidence that each object identified as a person is actually a person. Setting a threshold too low, such as at 50%, might result in too many false positives being counted, which might result in a count of 1,500 people, one-third of which do not actually correspond to people. For applications where approximate counts are desired, the data can be analyzed to determine the appropriate threshold where, on average, the number of false positives is balanced by the number of persons missed, such that the overall count is approximately correct on average. For many applications this can be a threshold between about 60% and about 85%, although as discussed the ranges can vary by application or situation.
As mentioned, many of the examples herein utilize image data captured by one or more detection devices with a view of an area of interest. In addition to one or more digital still image or video cameras, these devices can include infrared detectors, stereoscopic cameras, thermal sensors, motion sensors, proximity sensors, and other such sensors or components. The image data captured can include one or more images, or video, indicating pixel values for pixel locations of the camera sensor, for example, where the pixel values can represent data such as the intensity or color of ambient, infrared IR, or ultraviolet (UV) radiation detected by the sensor. A device may also include non-visual based sensors, such as radio or audio receivers, for detecting energy emanating from various objects of interest. These energy sources can include, for example, cell phone signals, voices, vehicle noises, and the like. This can include looking for distinct signals or a total number of signals, as well as the bandwidth, congestion, or throughput of signals, among other such options. Audio and other signature data can help to determine aspects such as type of vehicle, regions of activity, and the like, as well as providing another input for counting or tracking purposes. The overall audio level and direction of the audio can also provide an additional input for potential locations of interest. In various embodiments, the devices may also include position or motion sensing devices such as global position system (GPS) devices, gyroscopes, accelerometers, among others.
In some embodiments, a detection device can include an active, structured-light sensor. Such an approach can utilize a set of light sources, such as a laser array, that projects a pattern of light of a certain wavelength, such as in the infrared (IR) spectrum that may not be detectable by the human eye. One or more structured light sensors can be used, in place of or in addition to the ambient light camera sensors, to detect the reflected IR light. In some embodiments sensors can be used that detect light over the visible and infrared spectrums. The size and placement of the reflected pattern components can enable the creation of a three-dimensional mapping of the objects within the field of view. Such an approach may require more power, due to the projection of the IR pattern, but may provide more accurate results in certain situations, such as low light situations or locations where image data is not permitted to be captured, etc. The information obtained through the above-described computer vision and analysis techniques can be used to determine the conditions present, and thus make decisions regarding the content to display based on the detected conditions.
As mentioned, the above techniques can be applied in various ways to determine content to display. In an example scenario, the content determined for display may be customized depending on a number of people detected in a group. For example, the content display device may detect a group of 5 people walking together consistently and make a determination that the group of 5 people make up a single party. The display device may then display content that includes information about a nearby restaurant currently having an open table for 5 people as well as other helpful information such as directions or pictures of example food items.
In another example scenario, the content determined for display may be customized depending on the estimated age or height of people detected in a scene. For example, at a theme park, the content display device may detect a child of a certain height and display rides in the theme park that the child is likely to be tall enough to ride, and other optional information such as directions or a map showing the locations of the rides.
In another example scenario, the content determined for display may be determined based on a detect flow of people. For example, it may be detected that an increasing amount of people are entering a store, and the display may display content indicating that a certain number of additional checkout lanes should be opened in anticipation of the influx in customers. In this scenario, the display and the image sensor may be located remotely. For example, the image sensor may be located near a customer entrance of the store, and the display may be located at an employee room or management office of the store. In another example, a number of people inside a particular store in a shopping plaza may be detected, and the display may display content letting others know that the store is currently crowded.
In another example scenario, the content determined for display may be determined based on a combination of types of objects detected in a scene. For example, a person and an umbrella may be detected in the scene, which may indicate that it is a rainy day. Thus, the content display device may select content that is designated for a rainy day, such as an advertisement for a nearby hot chocolate shop.
In various embodiments, as content displayed by the content display device may change dynamically and based on detected conditions, such as types of objects, the content may necessarily be displayed on a set schedule for based on a certain share of display time. For example, the display may include content from a plurality of different content providers (e.g., companies). For example, a content provider can dictate that their content be displayed to a certain demographic (i.e., object type). The content providers may be charged each time their content is displayed, or for a total time during in which their content was displayed, and/or depending how well the audience matches their preferred demographic. For example the content provide may be charged a certain amount for their content being shown to teenagers and a different amount for their content being shown to adults. In some embodiments, based on historical demographic data, the content display device may determine an estimated amount of “inventory” for various demographic types, and plan the display content accordingly to optimize match between content and audience. In some embodiments, the content providers may provide a maximum amount of time to display their content. In some embodiments, the display value of the display may vary depending on various factors, such as time of day, or number of people walking by the display, or various combinations of factors. In one embodiment, the value of the display may be determined based at least in part on the number of people detected to walk past the display. Thus, the present systems and methods enable values to be determined for time slots of a display.
The representation of the object may be compared 708 to one or more object models to determine an object type. Specifically, in various embodiments, the object type is determined 710 based on the representation of the object matching one of the object models. In various embodiments, the one or more object models may each be associated with a particular object type (e.g., adult male, baby, car, truck, stroller, shopping bag, hat). For example, an object model for a stroller may include example sets of feature points that are known to represent a stroller, and if the feature points of the detected object match (i.e., similar to, within a certain confidence level) the example feature points, then a determination can be made that the detected feature points indicate a stroller in the scene, and the object type is determined to be “stroller”. In various embodiments, the image data and/or the extracted representation of the one or more object can be analyzed using any appropriate object recognition process, computer vision algorithm, artificial neural network, or other such mechanism for analyzing image data to detect and identify objects in the image data. The detection can include, for example, determining feature points or vectors in the image data that can then be compared against patterns or criteria for specific types of objects, in order to identify or recognize objects of specific types. For example, a neural network can be trained for a certain object type such that the neural network can identify objects occurring in an image as belonging to that object type. A neural network could also classify objects occurring in an image into one or more of a plurality of classes, each of the classes corresponding to a certain object type. In various embodiments, a neural network can be trained by providing training data which includes image data having representations of objects which are annotated as belonging to certain object types. Given a critical amount of training data, the neural network can learn how to classify representations of new objects.
In various embodiments, if the object is a person, the type of object may also include certain emotional states of the person, such as happy, sad, worried, angry, etc. In some embodiments, the emotional state may be determined using real-time inference, in which feature points in a detected facial region of the person are analyzed through various techniques, such as neural networks, to determine an emotional state of the person represented in the image data. The neural networks may be training using training data which includes images of faces annotated with the correct emotional state. In some embodiments, body position may also be used in the analysis.
Thus, content is then determined 712 based on the object type. For example, the content may be an advertisement for baby food is the object type is “stroller”. Accordingly, the content is displayed 714 on the display. In an example embodiment, the position of the one or more objects may also be determined from the image data and the content may be determined based at least in part on the position of the one or more objects. For example, one or more object being relatively close to one another in position may be determined to make up a group or party and thus treated as such in determining the content to display.
The image data in this example can correspond to a single digital image or a frame of digital video, among other such options. The captured image data can be analyzed, on the detection device, to extract image features (e.g., feature vector) or other points or aspects that may be representative of objects in the image data. These can include any appropriate image features discussed or suggested herein. Once the features are extracted, the image data can be deleted. Object recognition, or another object detection process, can be performed on the detection device using the extracted image features. The object recognition process can attempt to determine a presence of objects represented in the image data, such as those that match object patterns or have feature vectors that correspond to various defined object types, among other such options. In at least some embodiments each potential object determination will come with a corresponding confidence value, for example, and objects with at least a minimum confidence value that corresponding to specified types of objects may be selected as objects of interest. If it is determined that no objects of interest are represented in the frame of image data, then new image data may be captured.
If, however, one or more objects of interest are detected in the image data, the objects can be analyzed to determine relevant information. In the example process the objects will be analyzed individually for purposes of explanation, but it should be understood that object data can be analyzed concurrently as well in at least some embodiments. An object of interest can be selected and at least one descriptor for that object can be determined. The types of descriptor in some embodiments can depend at least in part upon the type of object. For example, a human object might have descriptors relating to height, clothing color, gender, or other aspects discussed elsewhere herein. A vehicle, however, might have descriptors such as vehicle type and color, etc. The descriptors can vary in detail, but should be sufficiently specific such that two objects in similar locations in the area can be differentiated based at least in part upon those descriptors. Content for display can then be determined based on the at least one descriptor, and the content can then be displayed.
The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention as set forth in the claims.