The present disclosure relates generally to audience measurement, and more particularly, to methods and apparatus to operate an audience metering device with voice commands.
Determining the demographics of a television viewing audience helps television program producers improve their television programming and determine a price for advertising during such programming. In addition, accurate television viewing demographics allows advertisers to target certain types of audiences. To collect the demographics of a television viewing audience, an audience measurement company may enlist a number of television viewers to cooperate in an audience measurement study for a predefined length of time. The viewing behavior of these enlisted viewers, as well as demographic data about these enlisted viewers, is collected and used to statistically determine the demographics of a television viewing audience. In some cases, automatic measurement systems may be supplemented with survey information recorded manually by the viewing audience members.
Audience measurement systems typically require some amount of on-going input from the participating audience member. One method of collecting viewer input involves the use of a people meter. A people meter is an electronic device that is typically disposed in the viewing area and that is proximate to one or more of the viewers. The people meter is adapted to communicate with a television meter disposed in, for example, a set top box, that measures various signals associated with the television for a variety of purposes including, but not limited to, determining the operational status of the television (i.e., whether the television is on or off), and identifying the programming being displayed by the television. Based on any number of triggers, including, for example a channel change or an elapsed period of time, the people meter prompts the household viewers to input information by depressing one of a set of buttons; each of which is assigned to represent a different household member. For example, the people meter may prompt the viewers to register (i.e., log in), or to indicate that they are still present in the viewing audience. Although periodically inputting information in response to a prompt may not be burdensome when required for an hour, a day or even a week or two, some participants find the prompting and data input tasks to be intrusive and annoying over longer periods of time. Thus, audience measurement companies are researching different ways for participants to input information to collect viewing data and provide greater convenience for the participants.
Today, several voice-activated systems are commercially available to perform a variety of tasks including inputting information. For example, users can log in to a computer network by a unique voice command detected by a microphone and authenticated by an algorithm that analyzes the speech signal. In another example, there are home automation appliances that can be turned on and off by voice commands. However, current voice-activated systems are designed to operate in acoustically clean environments. In the case of logging into a computer network, for example, the user speaks directly into a microphone and very little ambient noise is present. In contrast, a major source of interference in an audience measurement system is present in the form of audio output by, for example, speakers of a media presentation device such as a television. If a microphone is built into a people meter, the microphone may pick up pick up significant audio signals from the television speakers that make it difficult to recognize voice commands.
Although the following discloses example systems including, among other components, software executed on hardware, it should be noted that such systems are merely illustrative and should not be considered as limiting. For example, it is contemplated that any or all of the disclosed hardware and software components could be embodied exclusively in dedicated hardware, exclusively in firmware, exclusively in software or in some combination of hardware, firmware, and/or software.
In the example of
In the illustrated example, an audience metering device 140 is provided to collect viewing information with respect to the household member(s) 160 in the viewing area 150. The audience metering device 140 provides this viewing information as well as other tuning and/or demographic data via a network 170 to a data collection facility 180. The network 170 may be implemented using any desired combination of hardwired and wireless communication links, including for example, the Internet, an Ethernet connection, a digital subscriber line (DSL), a telephone line, a cellular telephone system, a coaxial cable, etc. The data collection facility 180 may be configured to process and/or store data received from the audience metering device 140 to produce ratings information.
The service provider 110 may be implemented by any service provider such as, for example, a cable television service provider 112, a radio frequency (RF) television service provider 114, and/or a satellite television service provider 116. The television 120 receives a plurality of television signals transmitted via a plurality of channels by the service provider 110 and may be adapted to process and display television signals provided in any format such as a National Television Standards Committee (NTSC) television signal format, a high definition television (HDTV) signal format, an Advanced Television Systems Committee (ATSC) television signal format, a phase alteration line (PAL) television signal format, a digital video broadcasting (DVB) television signal format, an Association of Radio Industries and Businesses (ARIB) television signal format, etc.
The user-operated remote control device 125 allows a user to cause the television 120 to tune to and receive signals transmitted on a desired channel, and to cause the television 120 to process and present the programming content contained in the signals transmitted on the desired channel. The processing performed by the television 120 may include, for example, extracting a video component and/or an audio component delivered via the received signal, causing the video component to be displayed on a screen/display associated with the television 120, and causing the audio component to be emitted by speakers associated with the television 120. The programming content contained in the television signal may include, for example, a television program, a movie, an advertisement, a video game, and/or a preview of other programming content that is currently offered or will be offered in the future by the service provider 110.
While the components shown in
The audience metering device 140 may include several sub-systems to perform tasks such as determining the channel being viewed. For example, the audience metering device 140 may be configured to identify the tuned channel from audio watermarks that have been embedded in the television audio. Alternatively, the audience metering device 140 may be configured to identify tuned program by taking program signatures and/or detecting video and/or audio codes embedded in the broadcast signal. For example, the audience metering device 140 may have audio inputs to receive a line signal directly from an audio line output of the television 120. If the television 120 does not have an audio line output, probes may be attached to one or more leads of the television speaker (not shown).
For the purpose of identifying the demographic information of an audience, the measurement device is configured to identify the member of the audience viewing the associated television. To this end, the audience metering device 140 is provided with a prompting mechanism to request the audience member to identify themselves as present in the audience. These prompts can be generated at particular time intervals and/or in response to predetermined events such as channel changes. The prompting mechanism may be implemented by, for example, light emitting diodes (LEDs), an on-screen prompt, an audible request via a speaker, etc.
Whereas prior art devices were structured to respond to electronic inputs from the household member(s) 160 (e.g., inputs via remote control devices, push buttons, switches, etc.) to identify the individual(s) in the audience, the audience metering device 140 of the illustrated example is configured to respond to voice commands from the household member(s) 160 as described in detail below. In particular, the household member(s) 160 are able to signal his/her presence and/or his/her exit from the viewing area 150 by a voice command. In general, the voice commands may be received by the audience metering device 140 via a microphone or a microphone array and processed by the audience metering device 140. The household member(s) 160 may be more likely to respond to prompts from the audience metering device 140 using voice commands than by using other input methods because providing a voice command only requires one to speak.
The voice activation system of the audience metering device 140 may be implemented in many different ways. For example, several voice-activated systems are commercially available to perform a variety of tasks such as logging into a computer and activating home automation appliances voice commands. However, many of the current voice-activated systems are designed to operate in acoustically clean environments. For example, a user may log into a computer by speaking directly into a microphone such that very little ambient noise is present to interfere with the received signal. In contrast, in the context of
In the example of
In general, the audience metering device 140 uses an adaptive filter to reduce or remove the television audio signals from the audio input signal 260. The audience metering device 140 uses a signal representation of the television audio signals received from a line audio output of the television 120 to substantively filter these television audio signals from the audio input signal 260. The filtered audio signal is then processed by a voice command recognizer algorithm. More particularly, the audience metering device 140 of
The MFCC extractor 240 extracts feature vectors from the residual signal output by the television audio subtractor 230. The feature vectors correspond to the one or more voice commands from the household member(s) 160. Through a cross-correlation operation described in detail below, the matcher 250 then compares the feature vectors against stored vector sequences to identify valid voice commands. For example, the stored vector sequences may be generated during a training phase when each of the household member(s) 160 issues voice commands that are recorded and processed. The stored vector sequences may be stored in a memory (e.g., the main memory 1030 and/or the mass storage device 1080 of
Preferably, the voice recognition algorithm is speaker-dependent and uses a relatively small set of particular voice commands. This contrasts with commercially-available speech recognizers that are speaker-independent and use relatively large vocabulary sets. Because of this difference, the audience metering device 140 may be implemented with much lower-power processor than the processor required by the commercially-available speech recognizers.
In one manner of operating the audience metering device 140 with voice commands, consider an example in which the audio input signal 260 is sampled at a sampling rate of 16 kHz (persons of ordinary skill in the art will appreciate that other sampling rates such as 8 kHz may alternatively be used). In general, the television program audio signal(s) received by the audio input device 210 are delayed relative to the television line audio signal 270 because of the propagation delay of sound waves emanating from the speakers of the television 120 and arriving at the audio input device 210. Further, multiple sound wave paths may exist because of reflections from walls and other objects in the viewing area 150. Also, the acoustic wave associated with the television program audio signals is attenuated in amplitude within its path to the audio input device 210.
To reduce the differences between the television line audio signal 270 and the audio signal 260 received by the audio input device 210, the television audio subtractor 230 may include a difference detector 310 and a finite impulse response (FIR) filter 320 having adaptive weights to delay and attenuate the television line audio signal 270 in accordance with the condition in the viewing area 150. An example television audio subtractor 230 is shown in greater detail in
In the example of
where Wm, m=0,1, . . . M−1 are filter weights 340 with initial values set to 0. The signal Xd is defined as the current audio input sample 260 from the audio input device 210. The filter 320 is configured to output XT≈Xd. In the illustrated example, the weight adjustor 350 adjusts the filter weights 340 to new values based on the error signal Xe(n)=Xd(n)−XT(n). In particular, the new values of the filter weights 340 are represented by the equation Wm(n+1)=Wm(n)+μXeXm(n) where the index n is an iteration index denoting the time in sample counts at which the modification is made and μ is a learning factor usually set to a low value such as 0.05. Persons of ordinary skill in the art will readily recognize that this filter gradually minimizes the least mean squared (LMS) error. In fact, the error signal Xe is the desired signal because the error signal Xe contains the one or more voice commands from the household member(s) 160. The difference detector 310 generates the error signal Xe based on the output of the filter 320 XT and the current audio input sample Xd.
In a practical implementation using 16 kHz sampling rates, for example, the filter weights 340 includes W0 through WM−1 where M=400. A maximum time delay of 25 milliseconds exists between the television line audio signals 270 and the audio input signal 260 received by the audio input device 210 after propagation delays. In less than a second, the filter weights 340 adapt themselves to relatively stationery values and the error signal Xe contains virtually no television program audio signals. Accordingly, the MFCC vectors are extracted from the sequence of samples s(n)=Xe(n) (i.e., from the difference between the audio input signal 260 and the weighted television line audio signal 270). These vectors can then be compared with the MFCC vectors of stored voice commands to identify voice command in the audio input signal 260 (if any).
To compare the extracted MFCC vectors to the stored vectors, an audio buffer consisting of 400 samples (25 ms duration) sk, k=0,1, . . . 399 is processed as shown by the flow diagram 400 of
The FFT spectrum of the 512-sample block is
for u=0,1, . . . 511. Persons of ordinary skill in the art will readily recognize that the MFCC coefficients are computed from 24 log spectral energy values Ec, c=0,1, . . . 23 obtained by grouping the FFT spectrum into a set of overlapping mel filter frequency bands:
where bclow and bchigh are the lower and upper bounds of the mel frequency b and c (block 460). The 24 log spectral energy values are transformed by a Discrete Cosine Transform (DCT) to yield 23 coefficients:
for k=1 through 23 and N=24 is the number of filter outputs (block 470). Of these 23 coefficients, the first twelve coefficients are usually retained as the MFCC elements because the first twelve coefficients represent the slowly varying spectral envelope corresponding to the vocal tract resonances. The coefficient C0, which represents the total energy in the block, may be calculated separately as,
and included as the thirteenth element of the MFCC feature vectors (block 480).
Prior to operating the audience metering device 140 with voice commands, the audience metering device 140 captures a set of voice commands from each of the household member(s) 160 as data files during a learning/training phase. The voice commands are edited so that each voice command contains the same number of samples. For example, a suitable value is 8000 samples with a duration of 500 ms. When analyzed as 10 ms segments, each voice command yields a sequence of 50 MFCC feature vectors. These MFCC feature vectors are stored as references in the matcher 250 for use during the operating phase of the audience metering device 140.
When the audio input signal 260 is received at the audio input device 210 in either the learning/training phase or the operating phase, the audio input signal 260 is sampled at 16 kHz and 160-sample segments are used to generate a sequence of MFCC vectors using, for example, the process explained above in connection with
To identify a voice command, an example matching process 500 of
Returning to block 530, if the current dot product score is greater than the stored dot product score, the matcher 250 may replace the stored dot product score with the current dot product score as the highest dot product score (block 550). Further, the matcher 250 may determine if the current dot product score exceeds a predetermined threshold (which may be pre-set at, for example, 0.5) (block 560). If the current dot product score is less than or equal to the threshold, the matcher 250 proceeds to block 540 to determine whether there are other reference sequences to compare to the current sequence of MFCC vectors as described above. In particular, the matcher 250 may return to block 520 if there are other reference sequences to compare to the current sequence of MFCC vectors or the matcher 250 may terminate the process 500 if there is no additional reference sequence. Otherwise if the current dot product score exceeds the threshold (block 560), the voice command is recognized, and the audience metering device 140 issues an LED prompt and/or any other suitable type of indicator to the household member(s) 160 acknowledging the voice command (block 570).
A flow diagram 600 representing machine accessible instructions that may be executed by a processor to operate an audience metering device with voice commands is illustrated in
In the example of
The processor system 1000 illustrated in
As is conventional, the memory controller 1012 performs functions that enable the processor 1020 to access and communicate with a main memory 1030 including a volatile memory 1032 and a non-volatile memory 1034 via a bus 1040. The volatile memory 132 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS Dynamic Random Access Memory (RDRAM), and/or any other type of random access memory device. The non-volatile memory 1034 may be implemented using flash memory, Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), and/or any other desired type of memory device.
The processor system 1000 also includes an interface circuit 1050 that is coupled to the bus 1040. The interface circuit 1050 may be implemented using any type of well known interface standard such as an Ethernet interface, a universal serial bus (USB), a third generation input/output interface (3GIO) interface, and/or any other suitable type of interface.
One or more input devices 1060 are connected to the interface circuit 1050. The input device(s) 1060 permit a user to enter data and commands into the processor 1020. For example, the input device(s) 1060 may be implemented by a keyboard, a mouse, a touch-sensitive display, a track pad, a track ball, an isopoint, and/or a voice recognition system.
One or more output devices 1070 are also connected to the interface circuit 1050. For example, the output device(s) 1070 may be implemented by display devices (e.g., a light emitting display (LED), a liquid crystal display (LCD), a cathode ray tube (CRT) display, a printer and/or speakers). The interface circuit 1050, thus, typically includes, among other things, a graphics driver card.
The processor system 1000 also includes one or more mass storage devices 1080 configured to store software and data. Examples of such mass storage device(s) 1080 include floppy disks and drives, hard disk drives, compact disks and drives, and digital versatile disks (DVD) and drives.
The interface circuit 1050 also includes a communication device such as a modem or a network interface card to facilitate exchange of data with external computers via a network. The communication link between the processor system 1000 and the network may be any type of network connection such as an Ethernet connection, a digital subscriber line (DSL), a telephone line, a cellular telephone system, a coaxial cable, etc.
Access to the input device(s) 1060, the output device(s) 1070, the mass storage device(s) 1080 and/or the network is typically controlled by the I/O controller 1014 in a conventional manner. In particular, the I/O controller 1014 performs functions that enable the processor 1020 to communicate with the input device(s) 1060, the output device(s) 1070, the mass storage device(s) 1080 and/or the network via the bus 1040 and the interface circuit 1050.
While the components shown in
Although certain example methods, apparatus, and articles of manufacture have been described herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus, and articles of manufacture fairly falling within the scope of the appended claims either literally or under the doctrine of equivalents.
This patent arises from a continuation of PCT Application Ser. No. PCT/US2004/028171, filed Aug. 30, 2004, which is incorporated herein by reference and which claims priority from U.S. Provisional Application Ser. No. 60/503,737, filed Sep. 17, 2003.
Number | Name | Date | Kind |
---|---|---|---|
4060695 | Suzuki et al. | Nov 1977 | A |
4283601 | Nakajima et al. | Aug 1981 | A |
4449189 | Feix et al. | May 1984 | A |
4697209 | Kiewit et al. | Sep 1987 | A |
4856067 | Yamada et al. | Aug 1989 | A |
4907079 | Turner et al. | Mar 1990 | A |
4947436 | Greaves et al. | Aug 1990 | A |
5229764 | Matchett et al. | Jul 1993 | A |
5250745 | Tsumura | Oct 1993 | A |
5267323 | Kimura | Nov 1993 | A |
5412738 | Brunelli et al. | May 1995 | A |
5481294 | Thomas et al. | Jan 1996 | A |
5611019 | Nakatoh et al. | Mar 1997 | A |
5615296 | Stanford et al. | Mar 1997 | A |
5765130 | Nguyen et al. | Jun 1998 | A |
5774859 | Houser et al. | Jun 1998 | A |
5872588 | Aras et al. | Feb 1999 | A |
5946050 | Wolff | Aug 1999 | A |
6035177 | Moses et al. | Mar 2000 | A |
6161090 | Kanevsky et al. | Dec 2000 | A |
6317881 | Shah-Nazaroff et al. | Nov 2001 | B1 |
6345389 | Dureau | Feb 2002 | B1 |
6405166 | Huang et al. | Jun 2002 | B1 |
6467089 | Aust et al. | Oct 2002 | B1 |
6542869 | Foote | Apr 2003 | B1 |
6651043 | Ammicht et al. | Nov 2003 | B2 |
20020010919 | Lu et al. | Jan 2002 | A1 |
20020053077 | Shah-Nazaroff et al. | May 2002 | A1 |
20020059577 | Lu et al. | May 2002 | A1 |
20020120925 | Logan | Aug 2002 | A1 |
20020174425 | Markel et al. | Nov 2002 | A1 |
20020194586 | Gutta et al. | Dec 2002 | A1 |
20030005431 | Shinohara | Jan 2003 | A1 |
20030028872 | Milovanovic et al. | Feb 2003 | A1 |
20030126593 | Mault | Jul 2003 | A1 |
20050149965 | Raja | Jul 2005 | A1 |
20060200841 | Ramaswamy et al. | Sep 2006 | A1 |
Number | Date | Country |
---|---|---|
2 294 574 | Jan 1996 | GB |
WO 9927668 | Jun 1999 | WO |
Number | Date | Country | |
---|---|---|---|
20060203105 A1 | Sep 2006 | US |
Number | Date | Country | |
---|---|---|---|
60503737 | Sep 2003 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2004/028171 | Aug 2004 | US |
Child | 11375648 | US |