For a number of reasons, it would be useful if a home entertainment device or system were able to determine if people were present in the room. If viewers leave the room in order to go to the kitchen, for example, the system could go into a low power consumption state, perhaps by dimming or powering down the display, or by shutting down completely. In this way, power could be conserved. If recorded media were being viewed, the playback could be automatically paused when a viewer leaves the room.
In addition, the next generation of smart televisions may he service platforms offering viewers several services such as banking, on-line shopping, etc. Human presence detection would also be useful for such TV-based services. For example, if a viewer was accessing a bank/brokerage account using the TV, but then leaves the room without closing the service, a human presence detection capability could he used to automatically log off or shut down the service after a predetermined time. In another case, if another person enters the room while the on-line banking service is running, the human presence detection could be used to automatically turn off the banking service for security or privacy reasons.
Detecting human presence would also be useful to advertisers and content providers. Actual viewership could he determined. Content providers could determine the number of people viewing a program. Advertisers could use this information to determine the number of people who are exposed to a given advertisement. Moreover, an advertiser could determine how many people viewed a particular airing of an advertisement, i.e., how many people saw an ad at a particular time and channel, and in the context of a particular program. This in turn could allow the advertiser to perform cost benefit analysis. The exposure of an advertisement could he compared to the cost to produce the advertisement, to determine if the advertisement, as aired at a particular time and channel, is a worthwhile expense.
FIG, 7 is a flow chart illustrating feature extraction of room audio in order o determine the presence of more than one person, according to an embodiment.
In the drawings, the leftmost digits) of a reference number identifies the drawing in which the reference number first appears.
An embodiment is now described with reference to the figures, where like reference numbers may indicate identical or functionally related elements. While specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the relevant art will recognize that other configurations and arrangements can he used without departing from the spirit and scope of the description. It will he apparent to a person skilled in the relevant art that this can also be employed in a variety of other systems and applications other than what is described herein.
Disclosed herein are methods, systems and computer program products that may allow for the determination of human presence in a room where content is being presented. The audio that is associated with the content may be captured, along with the audio that may be being generated in the room by whatever sources are collectively present. Features may be extracted from both the content audio and the room audio. These features may then be compared, and the differences may be quantified. If the differences are significant, then human presence may be inferred. Insignificant differences may be used to infer the absence of people.
The overall context of the system is illustrated in
Content 110 may be presented to a user through one or more output devices, such as television (TV) 150. The presentation of content 110 may he controlled through the use of a remote control 160, which may transmit control signals to SIB 120. The control signals may be received by a radio frequency (RF) interface WO 130 at STB 120.
Room audio 170 may also be present, including all sound generated in the room. Sources for the room audio 170 may include ambient noise and sounds made by any users, including but not limited to speech. Room audio 170 may also include sound generated by the consumer electronics in the room, such as the content audio 115 produced by TV 150. The room audio may be he captured by a microphone 140. In the illustrated embodiment, microphone 140 may be incorporated in STB 120. In alternative embodiments, the microphone 140 may be incorporated in TV 150 or elsewhere.
The processing of the system described herein is shown generally at
Process 200 is illustrated in greater detail in
Room audio may be processed in an analogous manner. At 315, room audio may be received. As noted above, room audio may be captured using a microphone incorporated into an STB or other consumer electronics component in the room, and may then be recorded for processing purposes. At 325, the room audio may be sampled. In an embodiment, the room audio may be sampled at 8 kHz or any other frequency. At 335, the sampled room audio may be divided into intervals for subsequent processing, in an embodiment, the intervals may be 0.5 second long. The intervals of sampled room audio may correspond, with respect to time, to respective intervals of sampled content audio. At 345, features may be extracted from each interval of sampled room audio. As in the case of content audio, a coefficient of variation or other statistical measure may be calculated for each interval and used as the feature for subsequent processing.
At 350, the extracted features may be compared. In an embodiment, this includes comparison of the coefficients of variation as a common statistical measure, for temporally corresponding intervals of sampled room audio and sampled content audio. The comparison process will be described in greater detail below. In an embodiment, this may comprise calculating the difference between the coefficients of variation of the room audio and the content audio, for corresponding intervals. At 360, a normalization or smoothing process may take place. This may comprise calculation of a function of the differences between the coefficients of variation of the room audio and the content audio over a sequence of successive intervals. At 370, an inference may be reached regarding the presence of people in the room, where the inference may be based on the statisfic(s) resulting from the normalization performed at 360. In an embodiment, if the coefficients of variation are sufficiently different between temporally corresponding intervals of room and content audio, then the presence of one or more people may be inferred.
In an alternative embodiment, additional processing may be performed in conjunction with feature extraction.
The comparison of coefficients of variation is illustrated in
Note that the magnitude of the percentage difference may allow greater or lesser confidence in the human presence inference. If the percentage difference is less than the threshold, then human presence may be unlikely, as discussed above. If the percentage is significantly less than the threshold, e.g., close to zero, then this may suggest that the room audio and the content audio are extremely similar, so that a higher degree of confidence may be placed in the inference that human presence is unlikely. Conversely, if the percentage difference exceeds the threshold then human presence may he likely. If the percentage difference exceeds the threshold by a significant amount, then this may suggest that the room audio and the content audio are very different, and a higher degree of confidence may he placed in the inference that human presence is likely.
In an embodiment, the data related to a given interval may he normalized by considering this interval in addition to a sequence of immediately preceding intervals. In this way, significance of outliers may be diminished, while the implicit confidence level of an interval may influence the inferences derived in succeeding intervals. Numerically, the normalization process may use any of several functions. Normalization may use a moving average of data from past intervals, or may use linear or exponential decay functions of this data.
The processes of
This is shown in
At 1050, the extracted features of a content audio interval and a room audio interval may be compared. This comparison may be performed in the same manner as shown in
As noted above, the systems, methods and computer program products described herein may be implemented in the context of a home entertainment system that may include an STB and/or a smart television, or may be implemented in a personal computer. Moreover, the systems, methods and computer program products described herein may also be implemented in the context of a laptop computer, ultra-laptop or netbook computer, tablet, touch pad, portable computer, handheld computer, palmtop computer, personal digital assistant (PDA), cellular telephone, combination cellular telephone/PDA, smart device (e.g., smart phone, smart tablet or smart television), mobile internet device (MID), messaging device, data communication device, and so forth.
One or more features disclosed herein may be implemented in hardware, software, firmware, and combinations thereof, including discrete and integrated circuit logic, application specific integrated circuit (ASIC) logic, and microcontrollers, and may be implemented as part of a domain-specific integrated circuit package, or a combination of integrated circuit packages. The term software, as used herein, refers to a computer program product including a computer readable medium having computer program logic stored therein to cause a computer system to perform one or more features and/or combinations of features disclosed herein. The computer readable medium may be transitory or non-transitory. An example of a transitory computer readable medium may be a digital signal transmitted over a radio frequency or over an electrical conductor, through a local or wide area network, or through a network such as the Internet. An example of a non-transitory computer readable medium may be a compact disk, a flash memory, random access memory (RAM), read-only memory (ROM), or other data storage device.
An embodiment of a system that may perform the processing described herein is shown in
A microphone 1105 may capture room audio 1107. Content audio 1117 may be received and routed to PIC 1110. The sampling of the room and content audio and the decomposition of these signals into intervals may be performed in PIC 1110 or elsewhere. After sampling and decomposing into intervals, the content and room audio may be processed by the feature extraction firmware 1115 in PIC 1110. As discussed above, feature extraction process may produce coefficients of variation for each interval, for both sampled room audio and sampled content audio. In the illustrated embodiment, feature extraction may take place in the PIC 1110 through the execution of feature extraction firmware 1115. Alternatively, the feature extraction functionality may be implemented in an execution engine of system on a chip (SOC) 1120.
If feature extraction is performed at PIC 1110, the coefficients of variation may be sent to SOC 1120, and then made accessible to operating system (OS) 1130. Comparison of coefficients from corresponding room audio and content audio intervals may be performed by logic 1160 in presence middleware 1140. Normalization may be performed by normalization logic 1150, which may also be part of presence middleware 1140. An inference regarding human presence may then be made available to a presence-enabled application 1170. Such an application may, for example, put system 1100 into a low power state if it is inferred that no one is present. Another example of a presence-enabled application 1170 may be a program that collects presence inferences from system 1100 and others like it in other households, to determine viewership of a television program or advertisement.
As noted above with respect to
Items 1105, 1110, 1120, and 1130 may all be located in one or more components in a user's home entertainment system or computer system, in an embodiment. They may be located in an STB, digital video recorder, or television, for example. Presence middleware 1140 and presence-enabled application 1170 may also be located in one or more components of the user's home entertainment system or computer system. In alternative embodiments, one or both of presence middleware 1140 and presence-enabled application 1170 may be located elsewhere, such as the facility of a content provider, for example.
Note that in some embodiments, the audio captured by the microphone 1105 may be muted. A user may choose to do this via a button on remote control 1180 or the home entertainment system. Such a mute function does not interfere with the mute on remote controls which mutes the audio coming out of the TV, A “mute” command for the microphone would then be sent to audio selection logic in PIC 1110. As a result of such a command, audio from microphone 1105 would not be received by OS 1130. Nonetheless, room audio 1107 may still he received at PIC 1110, where feature extraction may be performed. Such a capability may be enabled by the presence of the feature extraction firmware 1115 in the PIC 1110. The statistical data, i.e., the coefficients of variation, may then be made available to the OS 1130, even though the room audio itself has been muted. The nature of the coefficients of variation may be such that the coefficients may not be usable for purposes of recreating room audio 1107.
Computer program logic 1240 may include feature extraction code 1250. This code may be responsible for determining the standard deviation and mean for intervals of sampled room audio and content audio, as discussed above. Feature extraction code 1250 may also be responsible for implementing Fourier transformation and bandpass filtering as discussed above with respect to
A software embodiment of the comparison and normalization functionality is illustrated in
Computer program logic 1340 may include comparison code 1350. This module may be responsible for comparing coefficients of variation of corresponding intervals of room audio and content audio, and generating a quantitative indication of lire difference, e.g., a percentage difference, as discussed above. Computer program logic 1340 may include code 1350 for performing normalization. This module may he responsible for performing normalization of data generated by comparison code 1350 using a moving average or other process, as noted above. Computer program logic 1340 may include inference code 1370. This module may be responsible for generating an inference regarding the presence or absence of people, given the results of normalization code 1360.
The systems, methods, and computer program products described above may have a number of applications. If a viewer leaves a room, for example, the absence of people could be detected as described above, and the entertainment or computer system could go into a low power consumption state, perhaps by dimming or powering down the display, or by shutting down completely. In this way, power could be conserved. If recorded media were being viewed, the playback could be automatically paused when a viewer leaves the room.
In addition, service platforms may offer viewers services such as banking, on-line shopping, etc. Human presence detection as described above would be useful for such TV-based services. For example, if a viewer were accessing a bank/brokerage account using the TV, but then leaves the room without closing the service, a human presence detection capability could be used to automatically log off or shut down the service after a predetermined time. In another case, if another person enters the room while the on-line banking service is running, the human presence detection could be used to automatically turn off the banking service for security or privacy reasons.
Detecting human presence would also be used by advertisers and content providers. Actual viewership could be determined. Content providers could determine the number of people viewing a program. Advertisers could use this information to determine the number of people who are exposed to a given advertisement. Moreover, an advertiser could determine how many people viewed a particular airing of an advertisement, i.e., how many people saw an ad at a particular time and channel, and in the context of a particular program. This in turn could allow the advertiser to perform cost benefit analysis. The exposure of an advertisement could be compared to the cost to produce the advertisement, to determine if the advertisement, as aired at a particular time and channel, is a worthwhile expense.
Methods and systems are disclosed herein with the aid of functional building blocks illustrating the functions, features, and relationships thereof. At least sonic of the boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries may be defined so long as the specified functions and relationships thereof are appropriately performed.
While various embodiments are disclosed herein, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various changes in form and detail may be made therein without departing from the spirit and scope of the methods and systems disclosed herein. Thus, the breadth and scope of the claims should not be limited by any of the exemplary embodiments disclosed herein.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2011/049228 | 8/25/2011 | WO | 00 | 6/23/2014 |