None
The subject matter of this application broadly relates to systems and methods that facilitate remote identification of audio or audiovisual content being viewed by a user.
In many instances, it is useful to precisely identify audio or audiovisual content presented to a person, such as broadcasts on live television or radio, content being played on a DVD or CD, time-shifted content recorded on a DVR, etc. As one example, when compiling television or other broadcast ratings, or determining which commercials are shown during particular time slots, it is beneficial to capture the content played on the equipment of an individual viewer, particularly when local broadcast affiliates either display geographically-varying content, or insert local commercial content within a national broadcast. As another example, content providers may wish to provide supplemental material synchronized with broadcast content, so that when a viewer watches a particular show, the supplemental material may be provided to a secondary display device of that viewer, such as a laptop computer, tablet, etc. In this manner, if a viewer is determined to be watching a live baseball broadcast, each batter's statistics may be streamed to a user's laptop as the player is batting.
Contemporaneously determining what content a user is watching at a particular instant is not a trivial task. Some techniques rely on special hardware in a set-top box that analyzes video as the set-top box decodes frames. The requisite processing capability for such systems, however, is often cost-prohibitive. In addition, correct identification of decoded frames typically presumes an aspect ratio for a display, e.g. 4:3, when a user may be viewing content at another aspect ratio such as 16:9, thereby precluding a correct identification of the program content being viewed. Similarly, such systems are too sensitive to a program frame rate that may also be altered by the viewer's system, also inhibiting correct identification of viewed content.
Still other identification techniques add ancillary codes in audiovisual content for later identification. There are many ways to add an ancillary code to a signal so that it is not noticed. For example, a code can be hidden in non-viewable portions of television video by inserting it into either the video's vertical blanking interval or horizontal retrace interval. Other known video encoding systems bury the ancillary code in a portion of a signal's transmission bandwidth that otherwise carries little signal energy. Still other methods and systems add ancillary codes to the audio portion of content, e.g. a movie soundtrack. Such arrangements have the advantage of being applicable not only to television, but also to radio and pre-recorded music. Moreover, ancillary codes that are added to audio signals may be reproduced in the output of a speaker, and therefore offer the possibility of non-intrusively intercepting and distinguishing the codes using a microphone proximate the viewer.
While the use of embedded codes in audiovisual content can effectively identify content being presented to a user, such codes have disadvantages in practical use. For example, the code would need to be embedded at the source encoder, the code might not be completely imperceptible to a user, or might not be robust to sensor distortions in consumer-grade cameras and microphones.
For a better understanding of the invention, and to show how the same may be carried into effect, reference will now be made, by way of example, to the accompanying drawings, in which:
The second device 14 is preferably operatively connected to a microphone 16 or other device capable of receiving an audio signal. The microphone 16 receives the primary audio signal associated with a segment of the content presented on the first device 12. The second device 14 then generates an audio signature of the received signal using either an internal processor or any other processor accessible to it. If one or more additional microphones are used, then the second device preferably processes and combines the received signal from the multiple microphones before generating the audio signature of the received signal. Once an audio signature is generated that corresponds to content contemporaneously displayed on the first device 12, that audio signature is sent to a server 18 through a network 20 such as the Internet, or other network such as a LAN or WAN. The server 18 will usually be at a location remote from the first device 12 and the second device 14.
It should be understood that an audio signature, which may sometimes be called an audio fingerprint, may be represented using any number of techniques. To recite merely a few such examples, a pattern in a spectrogram of the captured audio signal may form an audio signature; a sequence of time and frequency pairs corresponding to peaks in a spectrogram may form an audio signature; sequences of time differences between peaks in frequency bands of a spectrogram may form an audio signature; and a binary matrix in which each entry corresponds to high or low energy in quantized time periods and quantized frequency bands may form an audio signature. Often, an audio signature is encoded into a string to facilitate a database search by a server.
The server 18 preferably stores a plurality of audio signatures in a database, where each audio signature is associated with content that may be displayed on the first device 12. The stored audio signatures may each be associated with a pre-selected interval within a particular item of audio or audiovisual content, such that a program is represented in the database by multiple, temporally sequential audio signatures. Alternatively, stored audio signatures may each continuously span the entirety of a program such that an audio signature for any defined interval of that program may be generated. Upon receipt of an audio signature from the second device 14, the server 18 attempts to match the received signature to one in its database. If a successful match is found, the server 18 may send to the second device 14 supplementary content associated with the matching programming segment. For example, if a person is watching a James Bond movie on the first device 12, at a moment displaying an image of a BMW or other automobile, the server 18 can use the received audio signature to identify the segment viewed, and send to the second device 14 supplementary information about that automobile such as make, model, pricing information, etc. In this manner, the supplementary material provided to the second device 14 is preferably not only synchronized to the program or other content is presented by the device 12 as a whole, but is synchronized to particular portions of content such that transmitted supplementary content may relate to what is contemporaneously displayed on the first device 12.
In operation, the foregoing procedure may preferably be initiated by the second device 14, either by manual selection, or automatic activation. In the latter instance, for example, many existing tablet devices, PDA's, laptops etc, can be used to remotely operate a television, or a set top box, or access a program guide for viewed programming etc. Thus, such a device may be configured to begin an audio signature generation and matching procedure whenever such functions are performed on the device. Once a signature generation and matching procedure is initiated, the microphone 16 is periodically activated to capture audio from the first device 12, and a spectrogram is approximated from the captured audio over each interval for which the microphone is activated. For example, let S[f,b] represent the energy at a band “b” during a frame “f” of a signal s(t) having a duration T, e.g. T=120 frames, 5 seconds, etc. The set of S[f,b] as all the bands are varied (b=1, . . . , B) and all the frames (f=1, . . . , F) are varied within the signal s(t), forms an F-by-B matrix S, which resembles the spectrogram of the signal. Although the set of all S[f,b] is not necessarily the equivalent of a spectrogram because the bands “b” are not Fast Fourier Transform (FFT) bins, but rather are a linear combination of the energy in each FFT bin, for purposes of this disclosure, it will be assumed either that such a procedure does generate the equivalent of a spectrogram, or some alternate procedure to generate a spectrogram from an audio signal is used, which are well known in the art.
Using the generated spectrogram from a captured segment of audio, the second device 14 generates an audio signature of that segment. The second device 14 preferably applies a threshold operation to the respective energies recorded in the spectrogram S[f,b] to generate the audio signature, so as to identify the position of peaks in audio energy within the spectrogram 22. Any appropriate threshold may be used. For example, assuming that the foregoing matrix S[f,b] represents the spectrogram of the captured audio signal, the second device 14 may preferably generate a signature S*, which is a binary F-by-B matrix in which S*[f,b]=1 if S[f,b] is among the P % (e.g. P %=10%) peaks with highest energy among all entries of S. Other possible techniques to generate an audio signature could include a threshold selected as a percentage of the maximum energy recorded in the spectrogram. Alternatively, a threshold may be selected that retains a specified percentage of the signal energy recorded in the spectrogram.
Specifically, server 18 may be operatively connected to a database from which individual ones of a plurality of audio signatures may be extracted. The database may store a plurality of M audio signals s(t), where sm(t) represents the audio signal of the mth asset. For each asset “m,” a sequence of audio signatures {Sm*[fn, b]} may be extracted, in which Sm*[fn, b] is a matrix extracted from the signal sm(t) in between frame n and n+F. Assuming that most audio signals in the database have roughly the same duration and that each sm(t) contains a number of frames Nmax>>F, after processing all M assets, the database would have approximately MNmax signatures, which would be expected to be a very large number (on the order of 107 or more). However, with modern processing power, even this number of extractable audio signatures in the database may be quickly searched to find a match to an audio signature 24 received from the second device 14.
It should be understood that the audio signatures for the database may be generated ahead of time for pre-recorded programs or in real-time for live broadcast television programs. It should also be understood that, rather than storing audio signals s(t), the database may store individual audio signatures, each associated with a segment of programming available to a user of the first device 12 and the second device 14. In another embodiment, the server 18 may store individual audio signatures, each corresponding to an entire program, such that individual segments may be generated upon query by the server 18. Still another embodiment would store audio spectrograms from which audio signatures would be generated. Also, it should be understood that some embodiments may store a database of audio signatures locally on the second device 12, or in storage available to in through e.g. a home network or local area network (LAN), obviating the need for a remote server. In such an embodiment, the second device 12 or some other processing device may perform the functions of the server described in this disclosure.
where, for any two binary matrixes A and B of the same dimensions, <A,B> are defined as being the sum of all elements of the matrix in which each element of A is multiplied by the corresponding element of B and divided by the number of elements summed. In this case, score[n,m] is equal to the number of entries that are 1 in both Sm*[n] and Sq*. After collecting score[n,m] for all possible “m” and “n”, the matching algorithm determines that the audio collected by the second device 14 corresponds to the database signal sm(t) at the delay f corresponding to the highest score[n,m].
Referring to
In an alternative procedure, the database may be searched in a pre-defined sequence and a match is declared when a matching score exceeds a fixed threshold. To facilitate such a technique, a hashing operation may be used in order to reduce the search time. There are many possible hashing mechanisms suitable for the audio signature method. For example, a simple hashing mechanism begins by partitioning the set of integers 1, . . . , F (where F is the number of frames in the audio capture and represents one of the dimensions of the signature matrix) into GF groups, e.g., if F=100, GF=5, the partition would be {1, . . . , 20}, {21, . . . , 40}, . . . , {81, . . . , 100}) Also, the set of integers 1, . . . , B is also partitioned into GB groups, where B is the number of bands in the spectrogram and represents another dimension of the signature matrix. A hashing function H is defined as follows: for any F-by-B binary matrix S*, HS*=S′, where S′ is a GF-by-GB binary matrix in which each entry (GF,GB) equals 1 if one or more entries equal 1 in the corresponding two-dimensional partition of S*.
Referring to
When applying the hashing function H to all MNmax signatures in the database, the database is partitioned into 2̂{GFGB} bins, which can each be represented by a matrix Aj of 0's and 1's, where j=1, . . . , 2̂{GFGB}. A table T indexed by the bin number is created and, for each of the 2̂{GFGB} bins, the table entry T[j] stores the list of the signatures Sm*[n] that satisfies HSm*[n]=Aj. The table entries T[j] for the various values of j are generated ahead of time for pre-recorded programs or in real-time for live broadcast television programs. The matching operation starts by selecting the bin entry given by HSq*. Then the score is computed between Sq* against all the signatures listed in the entry T[HSq*]. If a high enough score is found, the process is concluded. Alternatively, if a high enough score is not found, the process selects ones of the bins whose matrix Aj is closest to HSq* in the Hamming distance (the Hamming distance counts the number of different bits between two binary objects) and scores are computed between Sq* against all the signatures listed in the entry T[j]. If a high enough score is not found, the process selects the next bin whose matrix Aj is closest to HSq* in the Hamming distance. The same procedure is repeated until a high enough score is found or until a maximum number of searches is reached. The process concludes with either no match declared or a match is declared to the reference signature with the highest score. In the above procedure, since the hashing operation for all the stored content in the database is performed ahead of time (only live content is hashed in real time), and since the matching is first attempted against the signatures listed in the bins that are most likely to contain the correct signature, the number of searches and the processing time of the matching process is significantly reduced.
Intuitively speaking, the hashing operation performs a “two-level hierarchical matching.” The matrix HSq* is used to prioritize which bins of the table T in which to attempt matches, and priority is given to bins whose associated matrix Aj are closer to HSq* in the Hamming distance. Then, the actual query Sq* is matched against each of the signatures listed in the prioritized bins until a high enough match is found. It may be necessary to search over multiple bins to find a match. In
The preceding techniques that match an audio signature captured by the second device 14 to corresponding signatures in a remote database work well, so long as the captured audio signal has not been corrupted by, for instance, high energy noise. As one example, given that the second device 14 will be proximate to one or more persons viewing the program on a television or other such first device 12, high energy noise from a user (e.g., speaking, singing, or clapping noises) may also be picked up by the microphone 16. Still other examples might be similar incidental sounds such as doors closing, sounds from passing trains, etc.
Providing an accurate match between an audio signature generated at a location of a user with a corresponding reference audio signature in a remote database, in the presence of extraneous noise that corrupts the audio captured signature, is problematic. An audio signature derived from a spectrogram only preserves peaks in signal energy, and because the source of noise in the recorded audio frequently has more energy than the signal sought to be recorded, portions of an audio signal represented in a spectrogram and corrupted by noise certainly cannot easily be recovered, if ever. Possibly, an audio signal captured by a microphone 16 could be processed to try to filter any extraneous noise from the signal prior to generating a spectrogram, but automating such a solution would be difficult given the unpredictability of the presence of noise. Also, given the possibility of actual program segments being mistaken for noise (segments involving shouting, or explosions, etc.), any effective noise filter would likely depend on the ability to model noise accurately. This might be accomplished by, e.g. including multiple microphones in the second device 14 such that one microphone is configured to primarily capture noise (by being directed at the user, for example). Thus, the audio captured by the respective microphones could be used to model the noise and filter it out. However, such a solution might entail increased cost and complexity, and noise such as user generated audio still corrupts the audio signal intended to be recorded given the close proximity between the second device 14 and the user.
In view of such difficulties,
As noted previously, the spectrogram generated by the audio signature generator 50 may be corrupted by noise from a user, for example. To correct for this noise, the system 42 preferably also includes an audio analyzer 48 that has as an input the audio signal received by the one or more microphones 16. It should also be noted that, although the audio analyzer 48 is shown as simply receiving an audio signal from the microphone 16, the microphone 16 may be under control of the audio analyzer 48, which would issue commands to activate and deactivate the microphone 16, resulting in the audio signal that is subsequently treated by the Audio Analyzer 48 and Audio Signature Generator 50. The audio analyzer 48 processes the audio signal to identify both the presence and temporal location of any noise, e.g. user generated audio. As noted previously with respect to
Once the audio analyzer 48 has identified the temporal location of any detected noise in the audio signal received by the one or more microphones 16, the audio analyzer 48 provides that information to the audio signature generator 50, which may use that information to nullify those portions of the spectrogram it generates that are corrupted by noise. This process can be generally described with reference to
The above procedure can be used not only in audio signature extraction methods in which signatures are formed by binary matrixes, but also in methods in which signatures are formed by various sequences of time differences between peaks, each sequence from a particular frequency band of the spectrogram.
The server 64 includes a matching module 70 that uses the results provided by the audio analyzer 68 to match the audio signature provided by the audio signature generator 66. As one example, let S[f,b] represent the energy in band “b” during a frame “f” of a signal s(t) and let F̂ denote the subset of {1, . . . , F} that corresponds to frames located within regions that were identified by the Audio Analyzer 68 as containing user-generated audio or other such noise corrupting a signal, as explained before; the matching module 70 may disregard portions of the received audio signature determined to contain noise, i.e. perform a matching analysis between the received signature and those in a database only for time intervals not corrupted by noise. More precisely, the query audio signature Sq* used in the matching score is replaced by Sq** defined as follows: if f is not in F̂, Sq**[f,b]=Sq*[f,b] for all b; and if f is in F̂, Sq**[f,b]=0 for all b; and the final matching score is given by <Sm*[n], Sq**>, with the operation <.,.> as defined before. In such an example, the server may select the audio signature from the database with the highest matching score (i.e. the most matches) as the matching signature. Alternatively, the Matching Module 70 may adopt a temporarily different matching score function; i.e., instead of using the operation <Sm*[n], Sq*>, the Matching Module 70 uses an alternative matching operation <Sm*[n], Sq*>F̂, where the operation <A,B>F̂ A between two binary matrixes A and B is defined as being the sum of all elements in the columns not included in F̂ of the matrix in which each element of A is multiplied by the corresponding element of B and divided by the number of elements summed. In this latter alternative, the matching module 70 in effect uses a temporally normalized score to compensate for any excluded intervals. In other words, the normalized score is calculated as the number of matches divided by the ratio of the signature's time intervals that are being considered (not excluded) to the entire time interval of the signature, with the normalized score compared to the threshold. Alternatively, the normalization procedure could simply express the threshold in matches per unit time. In all of the above examples, the Matching Module 70 may adopt a different threshold score above which a match is declared. Once the matching module 70 has either identified a match or determined that no match has been found, the results may be returned to the client device 62.
The system of
The audio signature generator 78 receives both the audio and the information from the audio analyzer 80. The audio signature generator 78 uses the information from the audio analyzer 80 to nullify the segments with user generated audio when generating a single audio signature, as explained in the description of the system 42 of
The matching module 82 receives the audio signature Sq* from the Audio Signature Generator 78 and receives the information about user-generated audio from the Audio Analyzer 80. This information may be represented by the set F̂ of frames located within regions that were identified by the Audio Analyzer 80 as containing user-generated audio. It should be understood that other techniques may be used to send information to the server 76 indicating the existence and location of corruption in an audio signature. For example, the audio signature generator 78 may inform the set F̂ to the Matching Module 82 by making all entries in the audio signature Sq* equal to “1” over the frames contained in F̂; thus, when the Matching Server 76 receives a binary matrix in which a column has all entries marked as “1”, it will identify the frame corresponding to such a column as being part of the set F̂ of frames to be excluded from the matching procedure.
The matching server 76 is operatively connected to a database storing a plurality of reference audio signatures with which to match the audio signature received by the client device 74. The database may preferably be constructed in the same manner as described with reference to
Alternatively, if a hashing procedure is desired during the matching operation, the procedure described above with respect to
The process may conclude with either a “no-match” declaration, or the reference signature with the highest score may be declared a match. The results of this procedure may be returned to the client device 74.
The benefit of providing information to both the Audio Signature Generator 78 and the Matching Module 82 was evaluated in
It should be understood that the system 72 may incorporate many of the features described with respect to the systems 42 and 60 in
Specifically,
In addition, however, the system 90 includes at least one group audio signature generator 98 capable of synthesizing the audio signatures generated by the respective devices 92a and 92b, using the results of both the audio analyzer 92a and the audio analyzer 92b. Specifically, the system 90 is capable of synchronizing the two devices 92a and 92b such that the audio signatures generated by the respective devices encompass the same temporal intervals. With such synchronization, the group audio signature generator 98 may determine whether any portions of an audio signature produced by one device 92a or 92b have temporal segments analyzed as noise, but where the same interval in the audio signature of the other device 92a or 92b was analyzed as being not noise (i.e. the signal) and vice versa. In this manner, the group audio signature generator 98 may use the respective analyses of the incoming audio signal by each of the respective devices 92a and 92b to produce a cleaner audio signature over an interval than either of the devices 92a and 92b could produce alone. The group audio signature generator 98 may then forward the improved signature to the matching server 100 to compare to reference signatures in a database. In order to perform such a task, the Audio Analyzers 96a and 96b may forward raw audio features to the group audio signature generator 98 in order to allow it perform the combination of audio signatures and generate the cleaner audio signature mentioned above. Such raw audio features may include the actual spectrograms captured by the devices 92a and 92b, or a function of such spectrograms; furthermore, such raw audio features may also include the actual audio samples. In this last alternative, the group audio signature generator may employ audio cancelling techniques before producing the audio signature. More precisely, the group audio signature generator 98 could use the samples of the audio segment captured by both devices 92a and 92b in order to produce a single audio segment that contains less user-generated audio, and produce a single audio signature to be send to the matching module.
The group audio signature generator 98 may be present in either one, or both, of the devices 92a and 92b. In one instance, each of the devices 92a and 92b may be capable of hosting the group audio signature generator 98, where the users of the devices 92a and 92b are prompted through a user interface to select which device will host the group audio signature generator 98, and upon selection, all communication with the matching server may proceed through the selected host device 92a or 92b, until this cooperative mode is deselected by either user, or the devices 92a and 92b cease communicating with each other (e.g. one device is turned off, or taken to a different room, etc). Alternatively, an automated procedure may randomly select which device 92a or 92b hosts the group audio signature generator. Still further, the group audio signature generator could be a stand-alone device in communication with both devices 92a and 92b. One of ordinary skill in the art will also appreciate that this system could easily be expanded to encompass more than two client devices.
It should also be understood that, in any of the systems of
It should also be understood that, although several of the foregoing systems of matching audio signatures to reference signatures redressed corruption in audio signatures by nullifying corrupted segments, other systems consistent with the present disclosure may use alternative techniques to address corruption. As one example, a client device such as device 14 in
Given such techniques, a client device after initially identifying the program being watched or listened by the user, may receive a sequence of audio signatures corresponding to still-to-come audio segments from the program. These still-to-come audio signatures are readily available from a remote server when the program was pre-recorded. However, even when the program is live, there is a non-zero delay in the transmission of the program through the broadcast network; thus, it is still possible to generate still-to-come audio signatures and transmit them to the client device before its matching operation is attempted. These still-to-come audio signatures are the audio signatures that are expected to be generated in the client device if the user continues to watch the same program in a linear manner. Having received these still-to-come audio signatures, the client device may collect audio samples, extract audio features, generate audio signatures, and compare them against the stored, expected audio signatures to confirm that the user is still watching or listening to the same program. In other words, both the audio signature generation and matching procedures are done within the client device during this procedure. Since the audio signatures generated during this procedure may also be corrupted by user generated audio, the methods of the systems in
Alternatively, in such techniques, corruption in the audio signal may be redressed by first identifying the presence or absence of corruption such as user-generated audio. If such noise or other corruption is identified, no initial attempt at a match may be made until an audio signature is received where the analysis of the audio indicates that no noise is present. Similarly, once an initial match is made, any subsequent audio signatures containing noise may be either disregarded, or alternatively may be compared to an audio signature of a segment anticipated at that point in time to verify a match. In either case, however, if a “no match” is declared between an audio signature corrupted by, e.g. noise, a decision on whether the user has entered a trick play mode or switched channels is deferred until a signature is received that does not contain noise.
It should also be understood that, although the foregoing discussion of redressing corruption in an audio signature was illustrated using the example of user-generated audio that introduced noise in the signal, other forms of corruption are possible and may easily be redressed using the techniques previously described. For example, satellite dish systems that deliver programming content frequently experience brief signal outages due to high wind, rain, etc. and audio signals may be briefly sporadic. As another example, if programming content stored on a DRV or played on a DVD is being matched to programming content in a database, the audio signal may be corrupted due to imperfections digital storage media. In any case, however, such corruption can be modelled and therefore identified and redressed as previously disclosed.
It will be appreciated that the disclosure is not restricted to the particular embodiment that has been described, and that variations may be made therein without departing from the scope of the disclosure as well as the appended claims, as interpreted in accordance with principles of prevailing law, including the doctrine of equivalents or any other principle that enlarges the enforceable scope of a claim beyond its literal scope. Unless the context indicates otherwise, a reference in a claim to the number of instances of an element, be it a reference to one instance or more than one instance, requires at least the stated number of instances of the element but is not intended to exclude from the scope of the claim a structure or method having more instances of that element than stated. The word “comprise” or a derivative thereof, when used in a claim, is used in a nonexclusive sense that is not intended to exclude the presence of other elements or steps in a claimed structure or method.