Various technologies are known for measuring media exposure, where an audio component of the media is processed to either (a) extract code that is embedded in the audio, and/or (b) process the audio itself to extract features and form an audio signature or fingerprint. Exemplary techniques are known and described in U.S. Pat. No. 5,436,653 to Ellis et al., titled “Method and System for Recognition of Broadcast Segments,” U.S. Pat. No. 5,574,962 to Fardeau et al., titled “Method and Apparatus for Automatically Identifying a Program Including a Sound Signal,” U.S. Pat. No. 5,450,490 to Jensen et al., titled “Apparatus and Methods for Including Codes in Audio Signals and Decoding,” U.S. Pat. No. 6,871,180, titled “Decoding of Information in Audio Signals,” U.S. Pat. No. 7,222,071 to Neuhauser et al., titled “Audio Data Receipt/Exposure Measurement with Code Monitoring and Signature Extraction,” and U.S. Pat. No. 7,623,823 to Zito et al. titled “Detecting and Measuring Exposure to Media Content Items.” Each of these references is incorporated by reference in its entirety herein.
Obviously, one of the most important aspects of audience measurement in this field of technology is the processing of the audio to insert and detect codes and/or form and detect audio signatures. For audio codes, it is important to ensure that the codes are capable of being inserted into audio with minimal interference with the audio itself (steganographic encoding), while at the same time having sufficient robustness to be easily detected during the decoding process. For audio signatures, it is important to process the audio so that salient features of the audio may be properly extracted to form an audio signature that effectively identifies the underlying audio.
In addition to audio processing, other aspects must be considered as well; for audience measurement involving many devices over a given area, time processing becomes an important consideration. Typically, devices are equipped with a real-time clock, which may be adjusted using technologies such as a time server and/or Network Time Protocol (NTP). Using techniques such as Cristian's Algorithm, a time server keeps a reference time (e.g., Coordinated Universal Time, or “UTC”), and a device (or client) asks the server for a time. The server responds with its current time, and the client uses the received value T to set its clock. Using techniques such as the Berkeley Algorithm, an elected “master” may be used to synchronize clients without the presence of a time server. The elected master broadcasts time to all requesting devices, adjusts times received for “round-trip delay time” (RTT) and latency, averages times, and tells each machine how to adjust. In certain cases, multiple masters may be used.
For NTP, a network of time servers may be used to synchronize all processes on a network. Time servers are connected via a synchronization subnet tree. The “root” of the tree may directly receive UTC information and forward to other nodes, where each node synchronized its time with children nodes. An NTP subnet operates with a hierarchy of levels, or “stratum.” Each level of this hierarchy is assigned a layer number starting with 0 (zero) at the top. The stratum level defines its distance from the reference clock and exists to prevent cyclical dependencies in the hierarchy. Stratum 0 devices exist at the lowest level and include such devices such as atomic (caesium, rubidium) clocks, GPS clocks or other radio clocks. Stratum 1 devices include computers that are attached to Stratum 0 devices. Normally they act as servers for timing requests from Stratum 2 servers via NTP. These computers are also referred to as time servers. Stratum 2 devices include computers that send NTP requests to Stratum 1 servers. Normally a Stratum 2 computer will reference a number of Stratum 1 servers and use the NTP algorithm to gather the best data sample. Stratum 2 computers will peer with other Stratum 2 computers to provide more stable and robust time for all devices in the peer group. Stratum 2 computers normally act as servers for Stratum 3 NTP requests. Stratum 3 devices may employ the same NTP functions of peering and data sampling as Stratum 2, and can themselves act as servers for lower strata. Further statums (up to 256) may be used as needed for additional peering and data sampling. The architecture and operation of various NTP arrangements, along with more comprehensive descriptions may be found at http://www.ntp.org/.
To date, time processing for audio audience measurement has not been sufficiently utilized to provide accurate time measurements and synchronization for audio codes and/or audio signatures. Systems, devices and techniques are needed to ensure time-based data relating to detected codes and/or captured signatures is accurate for proper content identification. Additionally, there are instances where encoding devices and other devices are unwilling or incapable of directly connecting to time-correcting and time-synchronization devices. A configuration is needed to provide additional ways in which encoders and other devices may accurately keep and synchronize time data when monitoring audio.
In one embodiment, a method is disclosed for synchronizing a processing device, comprising the steps of receiving an audio signal in the processing device; producing first time data in the processing device; receiving second time data via a coupling interface on the processing device; processing the second time data in the processing device to establish if the second time data is a predetermined type; processing the audio signal in the device in order to generate at least one identifiable characteristic relating to the audio; associating the second time data with the identifiable characteristics if the predetermined type is established; and transmitting the identifiable characteristics together with the associated second time data.
In another embodiment, a processing device is disclosed, comprising an audio interface for receiving an audio signal in the processing device; a processor coupled to the audio interface; a timing device for producing first time data in the processing device; a coupling interface for receiving second time data, wherein the processor (i) processes the second time data to establish if the second time data is a predetermined type, (ii) processes the audio signal to generate at least one identifiable characteristic relating to the audio, and (iii) associates the second time data with the identifiable characteristics if the predetermined type is established; and an output for transmitting the identifiable characteristics together with the associated second time data.
In yet another embodiment, a system is disclosed comprising a portable device comprising a data interface for receiving first time data; a processing device, comprising an audio interface for receiving an audio signal in the processing device; a processor coupled to the audio interface; a timing device for producing second time data in the processing device; a coupling interface for receiving the first time data from the portable device, wherein the processor (i) processes the first time data to establish if the first time data is a predetermined type, (ii) processes the audio signal to generate at least one identifiable characteristic relating to the audio, and (iii) associates the second time data with the identifiable characteristics if the predetermined type is established; and an output for transmitting the identifiable characteristics together with the associated second time data.
Embodiments of the present invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
Turning to
Since the ability of the audio signal to mask the code components when they are reproduced as sound depends on the reproduced audio signal's energy content as its varies with frequency as well as over time, the encoder may analyze the audio signal repeatedly over time by producing data representing its frequency spectrum for a time period extending for only a small fraction of a second. This analysis is performed by a digital signal processor of the encoder, a microcomputer specially programmed to perform the analysis using a fast Fourier transform (FFT) that converts digital data representing the audio signal as it varies over time within such brief time period to digital data representing the energy content of the audio signal within that time period as it varies with frequency. This audio signal energy spectrum extends from approximately 1 kHz to 3 kHz and includes separate energy values of the audio signal within hundreds of distinct frequency intervals or “bins”, each only several Hz wide.
Multiple overlapping messages may be inserted into the audio signal so that all messages are present simultaneously; each such message is regarded as a distinct message “layer.” A first layer may carry a message encoding the identity of the broadcaster, multi-caster, cablecaster, etc., as well as time code data. A second layer is may carry a message encoding a network or content provider that distributes the program. This layer may also includes a time code. A third layer may encode a program identification, but does not necessarily require a time code. The third layer message is particularly useful for identifying content such as commercials, public service announcements and other broadcast segments having a short duration, such as fifteen or thirty seconds.
During encoding, an encoder may evaluate the ability of the audio signal to mask the code components of each message symbol using tonal masking and/or narrow band masking. Each of the two evaluations indicates a highest energy level for each code component that will be masked according to the tonal masking effect or the narrow band masking effect, as the case may be. The encoder assigns an energy level to each code component that is equal to the sum of the two highest energy levels that are masked according to the tonal masking effect and the narrow band masking effect. The masking abilities of the audio signal based both on the tonal masking effect and on the narrow band masking effect are separately determined for each code component cluster. More specifically, for each cluster, a group of sequential frequency bins of the audio signal that fall within a frequency band including the frequency bins of the cluster are used in the masking evaluation for that cluster. Each such group may be several hundred Hz wide. Accordingly, a different group of audio signal frequency bins is used in evaluating the masking ability of the audio signal for each cluster (although the groups may overlap). Additional configurations and other details regarding encoding and decoding processes may be found in U.S. Pat. No. 5,574,962, U.S. Pat. No. 5,450,490, and U.S. Pat. No. 6,871,180 referenced above. It is understood by those skilled in the art that other encoding techniques incorporating time data are equally applicable and are contemplated by the present disclosure.
Continuing with
Device 110 comprises a computer processing device such as a smart phone, a Personal People Meter™, a laptop, a personal computer, a tablet, and the like. Device 110 is configured to communicate, in a wired or wireless manner, with any of coupling interfaces 115-117 of encoder 111. Device 110 is also communicatively coupled to time service network 130 that comprises one or more servers 132 and/or cell towers 131. Network 130 is configured to provide a source of time data that may be used as a primary synchronization point for encoder 111, encoders 119-121 or other hardware 122. Server 132 may include one or more time servers, NTP servers, GPS, and the like.
Clock 113 of encoder 111 (as well as clocks of encoders 119-121 and/or other hardware 122) produces a timer that generates an interrupt H times per second. Denoting the value of the encoder clock by Ce(t), where t is accurate (UTC) time, for each encoder, we can determine Ce(t)=t, or, in other words, dC/dt=1. As the encoder physical clock does will not always interrupt exactly H times per second, a drift will inevitably be introduced. Turning briefly to
When processes x and y are performed on an encoder, cx(t) and cy(t) may be used to designate the reading of the clock at each process (x, y) when the real time is t. In this case, the skew may be defined as s(t)=cx(t)−cy(t). Turning to
In the embodiment of
Time synchronization between device 100 and encoder 111 may be arranged to be symmetric so that device 100 synchronizes with encoder 111 and vice versa. Such an arrangement is desirable if time source 118 and network 130 are operating within one or more common NTP networks. If one time source is known to be accurate, then it is placed in a different stratum. For example, if time source 118 is known to be more accurate than time from network 130 (e.g., it is a known reference clock), it would communicate as part of stratum 1 through encoder 111. Thus, as device 110 establishes communication with encoder 111 via interface 112, encoder 111 would not synchronize with device 110, and may further provide time data for updating synchronization for network 130. Such communication can take place with a Berkeley Algorithm using a time daemon. A time daemon would poll each device periodically to ask what times are being registered. As each device responds, the time daemon may them compute an average time and tell each respective device to slow down or speed up.
In the embodiment of
Under a preferred embodiment, devices connecting with network 130 are enabled with access control to determine who communicates with whom, and what level of service, and/or who will trust whom when advertising time synchronization services. Time synchronization clients (e.g. encoder 111) may be configured to accept some or all of the services from one or more devices or servers or, conversely, to access only select services on a specific device, server or group. One filtering mechanism for time synchronization access control are IP addresses, the type for synchronization service being offered or requested, and the direction of communication. For example, access control could allow an encoder to send time requests but not time synchronization control query requests. Alternately, access control could allow the sending of control query requests without allowing the requestor to synchronize its time with the time source to which the query requests are being sent. The level of granularity in access control may vary as a function of the type of device in which time synchronization is being implemented.
Cryptographic authentication may also be used as a security mechanism for enforcing time synchronization data integrity and to authenticate time synchronization messages and resulting associations. Here, symmetric (private) key cryptography may be used to produce a one-way hash that can be used to verify the identity of a device/peer in a time synchronization network. Under one embodiment, communicating devices may be configured with the same key and key identifier, which, for example, could include a 128-bit key and a 32-bit key identifier. On systems where each participating device is under the direct physical control of an administrator, key distribution could be manual and/or use an Autokey protocol. Key may also be distributed via asymmetric (public) key cryptography where a public key is used to encrypt a time synchronization message, and only a private key can be used to decrypt it (and vice versa).
Autokey protocol is a mechanism used to counter attempts to tamper with accurate and synchronized timekeeping. Autokey may be based on the Public Key Infrastructure (PKI) algorithms from within an OpenSSL library. Autokey relies on the PKI to generate a timestamped digital signature to “sign” a session key. The Autokey protocol may be configured to correspond to different time synchronization modes including broadcast, server/client and symmetric active/passive. Depending on the type of synchronization mode used, Autokey operations may be configured to (1) detect packet modifications via keyed message digests, (2) identify and verify a source via digital signatures, and (3) decipher cookie encryption.
Turning to
As time-domain encoding tends to be less robust, frequency-based encoding may be used for inserting code, as shown in
Various techniques may be used to encode audio for the purposes of monitoring media exposure. For example, television viewing or radio listening habits, including exposure to commercials therein, are monitored utilizing a variety of techniques. In certain techniques, acoustic energy to which an individual is exposed is monitored to produce data which identifies or characterizes a program, song, station, channel, commercial, etc. that is being watched or listened to by the individual. Where audio media includes ancillary codes that provide such information, suitable decoding techniques are employed to detect the encoded information, such as those disclosed in U.S. Pat. No. 5,450,490 and No. 5,764,763 to Jensen, et al., U.S. Pat. No. 5,579,124 to Aijala, et al., U.S. Pat. Nos. 5,574,962, 5,581,800 and 5,787,334 to Fardeau, et al., U.S. Pat. No. 6,871,180 to Neuhauser, et al., U.S. Pat. No. 6,862,355 to Kolessar, et al., U.S. Pat. No. 6,845,360 to Jensen, et al., U.S. Pat. No. 5,319,735 to Preuss et al., U.S. Pat. No. 5,687,191 to Lee, et al., U.S. Pat. No. 6,175,627 to Petrovich et al., U.S. Pat. No. 5,828,325 to Wolosewicz et al., U.S. Pat. No. 6,154,484 to Lee et al., U.S. Pat. No. 5,945,932 to Smith et al., US 2001/0053190 to Srinivasan, US 2003/0110485 to Lu, et al., U.S. Pat. No. 5,737,025 to Dougherty, et al., US 2004/0170381 to Srinivasan, and WO 06/14362 to Srinivasan, et al., all of which hereby are incorporated by reference herein.
Examples of techniques for encoding ancillary codes in audio, and for reading such codes, are provided in Bender, et al., “Techniques for Data Hiding”, IBM Systems Journal, Vol. 35, Nos. 3 & 4, 1996, which is incorporated herein in its entirety. Bender, et al. disclose a technique for encoding audio termed “phase encoding” in which segments of the audio are transformed to the frequency domain, for example, by a discrete Fourier transform (DFT), so that phase data is produced for each segment. Then the phase data is modified to encode a code symbol, such as one bit. Processing of the phase encoded audio to read the code is carried out by synchronizing with the data sequence, and detecting the phase encoded data using the known values of the segment length, the DFT points and the data interval. Bender, et al. also describe spread spectrum encoding and decoding, of which multiple embodiments are disclosed in the above-cited Aijala, et al. U.S. Pat. No. 5,579,124. Still another audio encoding and decoding technique described by Bender, et al. is echo data hiding in which data is embedded in a host audio signal by introducing an echo. Symbol states are represented by the values of the echo delays, and they are read by any appropriate processing that serves to evaluate the lengths and/or presence of the encoded delays.
A further technique, or category of techniques, termed “amplitude modulation” is described in R. Walker, “Audio Watermarking”, BBC Research and Development, 2004. In this category fall techniques that modify the envelope of the audio signal, for example by notching or otherwise modifying brief portions of the signal, or by subjecting the envelope to longer term modifications. Processing the audio to read the code can be achieved by detecting the transitions representing a notch or other modifications, or by accumulation or integration over a time period comparable to the duration of an encoded symbol, or by another suitable technique.
Another category of techniques identified by Walker involves transforming the audio from the time domain to some transform domain, such as a frequency domain, and then encoding by adding data or otherwise modifying the transformed audio. The domain transformation can be carried out by a Fourier, DCT, Hadamard, Wavelet or other transformation, or by digital or analog filtering. Encoding can be achieved by adding a modulated carrier or other data (such as noise, noise-like data or other symbols in the transform domain) or by modifying the transformed audio, such as by notching or altering one or more frequency bands, bins or combinations of bins, or by combining these methods. Still other related techniques modify the frequency distribution of the audio data in the transform domain to encode. Psychoacoustic masking can be employed to render the codes inaudible or to reduce their prominence. Processing to read ancillary codes in audio data encoded by techniques within this category typically involves transforming the encoded audio to the transform domain and detecting the additions or other modifications representing the codes.
A still further category of techniques identified by Walker involves modifying audio data encoded for compression (whether lossy or lossless) or other purpose, such as audio data encoded in an MP3 format or other MPEG audio format, AC-3, DTS, ATRAC, WMA, RealAudio, Ogg Vorbis, APT X100, FLAC, Shorten, Monkey's Audio, or other. Encoding involves modifications to the encoded audio data, such as modifications to coding coefficients and/or to predefined decision thresholds. Processing the audio to read the code is carried out by detecting such modifications using knowledge of predefined audio encoding parameters.
It will be appreciated that various known encoding techniques may be employed, either alone or in combination with the above-described techniques. Such known encoding techniques include, but are not limited to FSK, PSK (such as BPSK), amplitude modulation, frequency modulation and phase modulation. By using the aforementioned time synchronization techniques, audio encoders may provide improved time data which in turn produces more accurate results.
In some cases a signature is extracted from transduced media data for identification by matching with reference signatures of known media data. Suitable techniques for this purpose include those disclosed in U.S. Pat. No. 5,612,729 to Ellis, et al. and in U.S. Pat. No. 4,739,398 to Thomas, et al., each of which is assigned to the assignee of the present application and both of which are incorporated herein by reference in their entireties.
Still other suitable techniques are the subject of U.S. Pat. No. 2,662,168 to Scherbatskoy, U.S. Pat. No. 3,919,479 to Moon, et al., U.S. Pat. No. 4,697,209 to Kiewit, et al., U.S. Pat. No. 4,677,466 to Lert, et al., U.S. Pat. No. 5,512,933 to Wheatley, et al., U.S. Pat. No. 4,955,070 to Welsh, et al., U.S. Pat. No. 4,918,730 to Schulze, U.S. Pat. No. 4,843,562 to Kenyon, et al., U.S. Pat. No. 4,450,551 to Kenyon, et al., U.S. Pat. No. 4,230,990 to Lert, et al., U.S. Pat. No. 5,594,934 to Lu, et al., European Published Patent Application EP 0887958 to Bichsel and PCT publication WO91/11062 to Young, et al., all of which are incorporated herein by reference in their entireties.
An advantageous signature extraction technique transforms audio data within a predetermined frequency range to the frequency domain by a transform function, such as an FFT. The FFT data from an even number of frequency bands (for example, eight, ten, sixteen or thirty two frequency bands) spanning the predetermined frequency range are used two bands at a time during successive time intervals. When each band is selected, the energy values of the FFT bins within such band and such time interval are processed to form one bit of the signature. If there are ten FFT's for each interval of the audio signal, for example, the values of all bins of such band within the first five FFT's are summed to form a value “A” and the values of all bins of such band within the last five FFT's are summed to form a value “B”. In the case of a received broadcast audio signal, the value A is formed from portions of the audio signal that were broadcast prior to those used to form the value B. To form a bit of the signature, the values A and B are compared. If B is greater than A, the bit is assigned a value “1” and if A is greater than or equal to B, the bit is assigned a value of “0”. Thus, during each time interval, two bits of the signature are produced.
One advantageous technique carries out either or both of code detection and signature extraction remotely from the location where the research data is gathered, as disclosed in US Published Patent Application 2003/0005430 published Jan. 2, 2003 to Ronald S. Kolessar, which is assigned to the assignee of the present application and is hereby incorporated herein by reference in its entirety.
Turning to
After a transform is applied, feature extraction block 304 identifies perceptually meaningful parameters from the audio that may be based on Mel-Frequency Cepstrum Coefficients (MFCC) or Spectral Flatness Measure (SFM), which is an estimation of the tone-like or noise-like quality for a band in the spectrum. Additionally, features extraction 304 may use band representative vectors that are based on indexes of bands having prominent tones, such as peaks. Alternately, the energy levels of each band may be used, and may further use energies of bark-scaled bands to obtain a hash string indicating energy band differences both in the time and the frequency analysis. In post-processing 305, temporal variations in the audio are determined to produce feature vectors, and the results may be normalized and/or quantized for robustness.
Fingerprint modeling 306 receives a sequence of feature vectors calculated from 305 and processes/models the vectors for later retrieval. Here, the vectors are subjected to (distance) metrics and indexing algorithms to assist in later retrieval. Under one embodiment, multidimensional vector sequences for audio fragments may be summarized in a single vector using means and variances of multi-bank-filtered energies (e.g., 16 banks) to produce a multi-bit signature (e.g., 512 bits). In another embodiment, the vector may include an average zero-crossing rate, an estimated beats per minute (BPM), and/or average spectrum representing a portion of the audio. In yet another embodiment, modeling 306 may be based on sequences (traces, trajectories) of features to produce binary vector sequences. Vector sequences may further be clustered to form codebooks, although temporal characteristics of the audio may be lost in this instance. It is understood by those skilled in the art that multiple modeling techniques may be utilized, depending on the application and processing power of the system used. Once modeled, the resulting signature 307 is stored and ultimately transmitted for subsequent matching.
Continuing with
Turning to
Synchronization signal 336 is received from a local external source (e.g., device 110), where interface 337 updates the accurate time for clock 338. Time data from clock 338 is used for generating timestamps 339 for signatures extracted in 335. Again, utilizing clock synchronization techniques such as those described above increases the accuracy of the subsequent identification of the audio signatures produced. Audio signatures of
Although various embodiments have been described with reference to a particular arrangement of parts, features and the like, these are not intended to exhaust all possible arrangements or features, and indeed many other embodiments, modifications and variations will be ascertainable to those of skill in the art.