The present disclosure relates to identifying content in a media stream. For example, the present disclosure relates to cascading methods of performing a content identification of content in a media stream.
Content identification systems for various data types, such as audio or video, use many different methods. A client device may capture a media sample recording of a media stream (such as radio), and may then request a server to perform a search in a database of media recordings (also known as media tracks) for a match to identify the media stream. For example, the sample recording may be passed to a content identification server module, which can perform content identification of the sample and return a result of the identification to the client device. A recognition result may then be displayed to a user on the client device or used for various follow-on services, such as purchasing or referencing related information. Other applications for content identification include broadcast monitoring or content-sensitive advertising, for example.
Existing content identification systems may require user interaction to initiate a content identification request. Often times, a user may initiate a request after a song has ended, for example, missing an opportunity to identify the song.
In addition, within content identification systems, a central server receives content identification requests from client devices and performs computational intensive procedures to identify content of the sample. A large number of requests can cause delays when providing results to client devices due to a limited number of servers available to perform a recognition.
In one example, a method is provided that comprises receiving at a client device media content rendered by a media rendering source, and the client device making an attempt to determine an identity of the media content based on information stored on the client device. The method also includes based on the attempt of the client device to determine the identity of the media content, determining an identity of the media rendering source. The method further includes based on the attempt of the client device to determine the identity of the media content and on determining the identity of the media rendering source, sending information indicative of the media content to a content recognition server to determine the identity of the media content.
Any of the methods described herein may be provided in a form of instructions stored on a non-transitory, computer readable medium, that when executed by a computing device, cause the computing device to perform functions of the method. Further examples may also include articles of manufacture including tangible computer-readable media that have computer-readable instructions encoded thereon, and the instructions may comprise instructions to perform functions of the methods described herein.
As one example, a non-transitory computer readable medium having stored therein instructions executable by a computing device to cause the computing device to perform functions is provided. The functions comprise receiving media content rendered by a media rendering source, and making an attempt to determine an identity of the media content based on information stored on the client device. The functions also comprise based on the attempt to determine the identity of the media content, determining an identity of the media rendering source. The functions also comprise based on the attempt to determine the identity of the media content and on determining the identity of the media rendering source, sending information indicative of the media content to a content recognition server to determine the identity of the media content.
In still further examples, any type of devices may be used or configured to perform logical functions in any processes or methods described herein. As one example, a device is provided that comprises a database and a content identification module coupled to the database. The database is configured to receive and store information indicative of one or more features of media content and information identifying the media content. The content identification module is configured to (i) make an attempt to determine an identity of received media content rendered by a media rendering source based on a comparison with the stored information in the database, (ii) based on the attempt, to determine an identity of the media rendering source, and (iii) based on the attempt and on the determine the identity of the media rendering source, to send information indicative of the media content to a content recognition server to determine the identity of the media content.
The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the figures and the following detailed description.
In the following detailed description, reference is made to the accompanying figures, which form a part hereof. In the figures, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, figures, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.
This disclosure may describe, inter alia, methods and systems for identifying information of a broadcast station and information of broadcasted content. In one example, a method includes receiving at a client device media content rendered by a media rendering source, and the client device making an attempt to determine an identity of the media content based on information stored on the client device. The method also includes based on the attempt of the client device to determine the identity of the media content, determining an identity of the media rendering source. The method further includes based on the attempt of the client device to determine the identity of the media content and on determining the identity of the media rendering source, sending information indicative of the media content to a content recognition server to determine the identity of the media content.
Referring now to the figures,
A client device 104 receives a rendering of the media stream from the media rendering source 102 through an input interface 106. In one example, the input interface 106 may include antenna, in which case the media rendering source 102 may broadcast the media stream wirelessly to the client device 104. However, depending on a form of the media stream, the media rendering source 102 may render the media using wireless or wired communication techniques. In other examples, the input interface 106 can include any of a microphone, video camera, vibration sensor, radio receiver, network interface, etc. As a specific example, the media rendering source 102 may play music, and the input interface 106 may include a microphone to receive a sample of the music.
Within examples, the client device 104 may not be operationally coupled to the media rendering source 102, other than to receive the rendering of the media stream. In this manner, the client device 104 may not be controlled by the media rendering source 102, and may not be an integral portion of the media rendering source 102. In the example shown in
The input interface 106 is configured to capture a media sample of the rendered media stream. The input interface 106 may be preprogrammed to capture media samples continuously without user intervention, such as to record all audio received and store recordings in a buffer 108. The buffer 108 may store a number of recordings, or may store recordings for a limited time, such that the client device 104 may record and store recordings in predetermined intervals, for example, or in a way so that a history of a certain length backwards in time is available for analysis. In other examples, capturing of the media sample may be caused or triggered by a user activating a button or other application to trigger the sample capture. For example, a user of the client device 104 may press a button to record a ten second digital sample of audio through a microphone, or to capture a still image or video sequence using a camera.
The client device 104 can be implemented as a portion of a small-form factor portable (or mobile) electronic device such as a cell phone, a wireless cell phone, a personal data assistant (PDA), tablet computer, a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. The client device 104 can also be implemented as a personal computer including both laptop computer and non-laptop computer configurations. The client device 104 can also be a component of a larger device or system as well.
The client device 104 further includes a position identification module 110 and a content identification module 112. The position identification module 110 is configured to receive a media sample from the buffer 108 and to identify a corresponding estimated time position (TS) indicating a time offset of the media sample into the rendered media stream (or into a segment of the rendered media stream) based on the media sample that is being captured at that moment. The time position (TS) may also, in some examples, be an elapsed amount of time from a beginning of the media stream. For example, the media stream may be a radio broadcast, and the time position (TS) may correspond to an elapsed amount of time of a song being rendered.
The content identification module 112 is configured to receive the media sample from the buffer 108 and to perform a content identification on the received media sample. The content identification identifies a media stream, or identifies information about or related to the media sample. The content identification module 112 may be configured to receive samples of environmental audio, identify a musical content of the audio sample, and provide information about the music, including the track name, artist, album, artwork, biography, discography, concert tickets, etc.
In this regard, the content identification module 112 includes a media search engine 114 and may include or be coupled to a database 116 that indexes reference media streams, for example, to compare the received media sample with the stored information so as to identify tracks within the received media sample. Once tracks within the media stream have been identified, track identities or other information may be displayed on a display of the client device 104.
The database 116 may store content patterns that include information to identify pieces of content. The content patterns may include media recordings such as music, advertisements, jingles, movies, documentaries, television and radio programs. Each recording may be identified by a unique identifier (e.g., sound_ID). Alternatively, the database 116 may not necessarily store audio or video files for each recording, since the sound_IDs can be used to retrieve audio files from elsewhere. The content patterns may include other information (in addition to or rather than media recordings), such as reference signature files including a temporally mapped collection of features describing content of a media recording that has a temporal dimension corresponding to a timeline of the media recording, and each feature may be a description of the content in a vicinity of each mapped timepoint. Generally, features in the signature file can be chosen to be reproducible in the presence of noise and distortion, for example. The features may be extracted from media recordings sparsely at discrete time positions, and each feature may correspond to a feature of interest. Examples of sparse features include Lp norm power peaks, spectrogram energy peaks, linked salient points, etc. For more examples, the reader is referred to U.S. Pat. No. 6,990,453, by Wang and Smith, which is hereby entirely incorporated by reference.
Alternatively, a continuous time axis could be represented densely, in which every value of time has a corresponding feature value that may be included or represented in a signature file for a media recording. Examples of such dense features include feature waveforms (as described in U.S. Pat. No. 7,174,293 to Kenyon, which is hereby entirely incorporated by reference), spectrogram bitmap rasters (as described in U.S. Pat. No. 5,437,050, which is hereby entirely incorporated by reference), an activity matrix (as described in U.S. Publication Patent Application No. 2010/0145708, which is hereby entirely incorporated by reference), and an energy flux bitmap raster (as described in U.S. Pat. No. 7,549,052, which is hereby entirely incorporated by reference).
In one example, a signature file includes a sparse feature representation of a media recording. The features of the recording may be obtained from a spectrogram extracted using overlapped short-time Fast Fourier Transforms (FFT). Peaks in the spectrogram can be chosen at time-frequency locations where a corresponding energy value is a local maximum. For examples, peaks may be selected by identifying maximum points in a region surrounding each candidate location. A psychoacoustic masking criterion may also be used to suppress inaudible energy peaks. Each peak can be coded as a pair of time and frequency values. Additionally, an energy amplitude of the peaks may be recorded. In one example, an audio sampling rate is 8 KHz, and an FFT frame size may vary between about 64-1024 bins, with a hop size between frames of about 25-75% overlap with the previous frame. Increasing a frequency resolution may result in less temporal accuracy. Additionally, a frequency axis could be warped and interpolated onto a logarithmic scale, such as mel-frequency.
A number of features or information associated with the features may be combined into a signature file. A signature file may order features as a list arranged in increasing time. Each feature Fj can be associated with a time value tj in a data construct, and the list can be an array of such constructs; here j is the index of the j-th construct, for example. In an example using a continuous time representation, e.g., successive frames of a spectrogram, the time axis could be implicit in the index into the list array. The time axis within each media recording can be obtained as an offset from a beginning of the recording, and thus time zero refers to the beginning of the recording.
The feature extraction module 204 may extract features from the media recording, using any of the example methods described above, to generate a signature file 208 for the media recording. The feature extraction module 204 may store the signature file 208 in the media signature database 206. The media signature database 206 may store signature files with an associated identifier, as shown in
A size of a resulting signature file may vary depending on a feature extraction method used. In one example, a density of selected spectrogram peaks (e.g., features) may be chosen to be about between 10-50 points per second. The peaks can be chosen as the top N most energetic peaks per unit time, for example, the top 10 peaks in a one-second frame. In an example using 10 peaks per second, using 32 bits to encode each peak frequency (e.g., 8 bits for the frequency value and 24 bits to encode the time offset), 40 bytes per second may be required to encode the features. With an average song length of about three minutes, a signature file size of approximately 7.2 kilobytes may result for a song. For other signature encoding methods, for example, a 32-bit feature at every offset of a spectrogram with a hop size of 100 milliseconds, a similar size fingerprint results.
In another example, a signature file may be on the order of about 5-10 KB, and may correspond to a portion of a media recording from which a sample was obtained that is about 20 seconds long and refers to a portion of the media recording after an end of a captured sample.
In some examples, the signature file may represent a fingerprint of a media recording by describing features of the recording. In this regard, signatures of a media recording may be considered fingerprints of recording, and signatures or fingerprints may be included in a signature file.
The system shown in
Referring back to
The database 116 may also include information for each stored signature file, such as metadata that indicates information about the signature file like an artist name, a length of song, lyrics of the song, time indices for lines or words of the lyrics, album artwork, or any other identifying or related information to the file. Metadata may also comprise data and hyperlinks to other related content and services, including recommendations, ads, offers to preview, bookmark, and buy musical recordings, videos, concert tickets, and bonus content; as well as to facilitate browsing, exploring, discovering related content on the world wide web.
The database 116 may further include information associated with the media rendering source 102, such as playlists of the media rendering source 102 (e.g., including identity of broadcasted content as well as times at which the content is broadcast). Thus, the database 116 may include both content identification and broadcast station identification in a correlated manner.
The content identification module 112 may also include a signature extractor 118 that may be configured to generate a signature stream of extracted features from captured media samples, and each feature may have a corresponding time position within the sample. The signature stream of extracted features can be used to compare to stored signature files in the database 116 to identify a corresponding media recording. In some examples, the signature extractor 116 may be configured to extract features from a media sample using any of the methods described above for generating a signature file, to generate a signature stream of extracted features. A signature stream may be determined and generated in real-time based on an observed media stream, for example.
The content identification module 112 and/or the signature extractor 118 may further be configured to compare alignment of features within the media sample and the signature file to identify matching features at corresponding times.
The content identification module 112 may further be configured to identify a source of broadcasted content by comparison of an identity of the content with a number of playlists of broadcast stations, for example.
The system in
In some examples, the client device 104 may capture a media sample and may send the media sample over the network 120 to the server 122 to determine an identity of content in the media sample. The position identification module 124 and the content identification module 126 of the server 122 may be configured to operate similar to the position identification module 110 and the content identification module 112 of the client device 104. In this regard, the content identification module 126 includes a media search engine 128 and may include or be coupled to a database 130 that indexes reference media streams, for example, to compare the received media sample with the stored information so as to identify tracks within the received media sample. Once tracks within the media stream have been identified, track identities or other information may be returned to the client device 104.
In response to a content identification query received from the client device 104, the server 122 may identify a media recoding from which the media sample was obtained, and/or retrieve a signature file corresponding to identified media recording. The server 122 may then return information identifying the media recording, and a signature file corresponding to the media recording to the client device 104.
In other examples, the client device 104 may capture a sample of a media stream from the media rendering source 102, and may perform initial processing on the sample so as to create a signature file/fingerprint of the media sample. The client device 104 may then send the fingerprint information to the position identification module 124 and/or the content identification module 126 of the server 122, which may identify information pertaining to the sample based on the fingerprint information alone. In this manner, more computation or identification processing can be performed at the client device 104, rather than at the server 122, for example.
In still other examples, as described above, the client device 104 may further be configured to perform content identifications locally by comparing alignment of features within the media sample and signature files to identify matching features at corresponding times.
Various content identification techniques are known in the art for performing computational content identifications of media samples and features of media samples using a database of media tracks. The following U.S. patents and publications describe possible examples for media recognition techniques, and each is entirely incorporated herein by reference, as if fully set forth in this description: Kenyon et al, U.S. Pat. No. 4,843,562, entitled “Broadcast Information Classification System and Method”; Kenyon, U.S. Pat. No. 4,450,531, entitled “Broadcast Signal Recognition System and Method”; Haitsma et al, U.S. Patent Application Publication No. 2008/0263360, entitled “Generating and Matching Hashes of Multimedia Content”; Wang and Culbert, U.S. Pat. No. 7,627,477, entitled “Robust and Invariant Audio Pattern Matching”; Wang, Avery, U.S. Patent Application Publication No. 2007/0143777, entitled “Method and Apparatus for Identification of Broadcast Source”; Wang and Smith, U.S. Pat. No. 6,990,453, entitled “System and Methods for Recognizing Sound and Music Signals in High Noise and Distortion”; Blum, et al, U.S. Pat. No. 5,918,223, entitled “Method and Article of Manufacture for Content-Based Analysis, Storage, Retrieval, and Segmentation of Audio Information”; and Master, et al, U.S. Patent Application Publication No. 2010/0145708, entitled “System and Method for Identifying Original Music”.
Briefly, the content identification module (within the client device 104 or the server 122) may be configured to receive a media recording and sample the media recording. The recording can be correlated with digitized, normalized reference signal segments to obtain correlation function peaks for each resultant correlation segment to provide a recognition signal when the spacing between the correlation function peaks is within a predetermined limit. A pattern of RMS power values coincident with the correlation function peaks may match within predetermined limits of a pattern of the RMS power values from the digitized reference signal segments, as noted in U.S. Pat. No. 4,450,531, which is entirely incorporated by reference herein, for example. The matching media content can thus be identified. Furthermore, the matching position of the media recording in the media content is given by the position of the matching correlation segment, as well as the offset of the correlation peaks, for example.
Fingerprints can be computed by any type of digital signal processing or frequency analysis of the signal. In one example, to generate spectral slice fingerprints, a frequency analysis is performed in the neighborhood of each landmark timepoint to extract the top several spectral peaks. A fingerprint value may then be the single frequency value of a strongest spectral peak. For more information on calculating characteristics or fingerprints of audio samples, the reader is referred to U.S. Pat. No. 6,990,453, to Wang and Smith, entitled “System and Methods for Recognizing Sound and Music Signals in High Noise and Distortion,” the entire disclosure of which is herein incorporated by reference as if fully set forth in this description.
Thus, referring back to
Referring to
In one example, to generate a score for a file, a histogram of offset values can be generated. The offset values may be differences in landmark time positions between the sample and the reference file where a fingerprint matches.
In addition, systems and methods described within the publications above may return more than an identity of a media sample. For example, using the method described in U.S. Pat. No. 6,990,453 to Wang and Smith may return, in addition to metadata associated with an identified audio track, a relative time offset (RTO) of a media sample from a beginning of an identified sample. To determine a relative time offset of the recording, fingerprints of the sample can be compared with fingerprints of the original files to which the fingerprints match. Each fingerprint occurs at a given time, so after matching fingerprints to identify the sample, a difference in time between a first fingerprint (of the matching fingerprint in the sample) and a first fingerprint of the stored original file will be a time offset of the sample, e.g., amount of time into a song. Thus, a relative time offset (e.g., 67 seconds into a song) at which the sample was taken can be determined. Other information may be used as well to determine the RTO. For example, a location of a histogram peak may be considered the time offset from a beginning of the reference recording to the beginning of the sample recording.
Other forms of content identification may also be performed depending on a type of the media sample. For example, a video identification algorithm may be used to identify a position within a video stream (e.g., a movie). An example video identification algorithm is described in Oostveen, J., et al., “Feature Extraction and a Database Strategy for Video Fingerprinting”, Lecture Notes in Computer Science, 2314, (Mar. 11, 2002), 117-128, the entire contents of which are herein incorporated by reference. For example, a position of the video sample into a video can be derived by determining which video frame was identified. To identify the video frame, frames of the media sample can be divided into a grid of rows and columns, and for each block of the grid, a mean of the luminance values of pixels is computed. A spatial filter can be applied to the computed mean luminance values to derive fingerprint bits for each block of the grid. The fingerprint bits can be used to uniquely identify the frame, and can be compared or matched to fingerprint bits of a database that includes known media. The extracted fingerprint bits from a frame may be referred to as sub-fingerprints, and a fingerprint block is a fixed number of sub-fingerprints from consecutive frames. Using the sub-fingerprints and fingerprint blocks, identification of video samples can be performed. Based on which frame the media sample included, a position into the video (e.g., time offset) can be determined
Furthermore, other forms of content identification may also be performed, such as using watermarking methods. A watermarking method can be used by the position identification module 110 of the client device 104 (and similarly by the position identification module 124 of the server 122) to determine the time offset such that the media stream may have embedded watermarks at intervals, and each watermark may specify a time or position of the watermark either directly, or indirectly via a database lookup, for example.
In some of the foregoing example content identification methods for implementing functions of the content identification module 112, a byproduct of the identification process may be a time offset of the media sample within the media stream. Thus, in such examples, the position identification module 110 may be the same as the content identification module 112, or functions of the position identification module 110 may be performed by the content identification module 112.
In some examples, the client device 104 or the server 122 may further access a media stream library database 132 through the network 120 to select a media stream corresponding to the sampled media that may then be returned to the client device 104 to be rendered by the client device 104. Information in the media stream library database 132, or the media stream library database 132 itself, may be included within the database 116.
An estimated time position of the media being rendered by the media rendering source 102 is determined by the position identification module 110 and used to determine a corresponding position within the selected media stream at which to render the selected media stream. When the client device 104 is triggered to capture a media sample, a timestamp (T0) is recorded from a reference clock of the client device 104. The timestamp corresponding to a sampling time of the media sample is recorded as T0 and may be referred to as the synchronization point. The sampling time may preferably be the beginning, but could also be an ending, middle, or any other predetermined time of the media sample. Thus, the media samples may be time-stamped so that a corresponding time offset within the media stream from a fixed arbitrary reference point in time is known. At any time t, an estimated real-time media stream position Tr(t) is determined from the estimated identified media stream position TS plus elapsed time since the time of the timestamp:
T
r(t)=TS+t−T0 Equation (1)
Tr(t) is an elapsed amount of time from a beginning of the media stream to a real-time position of the media stream as is currently being rendered. Thus, using TS (i.e., the estimated elapsed amount of time from a beginning of the media stream to a position of the media stream based on the recorded sample), the Tr(t) can be calculated. Tr(t) is then used by the client device 104 to present selected media stream in synchrony with the media being rendered by the media rendering source 102. For example, the client device 104 may begin rendering the selected media stream at the time position Tr(t), or at a position such that Tr(t) amount of time has elapsed so as to render and present the selected media stream in synchrony with the media being rendered by the media rendering source 102.
In some embodiments, the estimated position Tr(t) can be adjusted according to a speed adjustment ratio R. For example, methods described in U.S. Pat. No. 7,627,477, entitled “Robust and invariant audio pattern matching”, the entire contents of which are herein incorporated by reference, can be performed to identify the media sample, the estimated identified media stream position TS, and a speed ratio R. To estimate the speed ratio R, cross-frequency ratios of variant parts of matching fingerprints are calculated, and because frequency is inversely proportional to time, a cross-time ratio is the reciprocal of the cross-frequency ratio. A cross-speed ratio R is the cross-frequency ratio (e.g., the reciprocal of the cross-time ratio).
The speed ratio R can be estimated using other methods as well. For example, multiple samples of the media can be captured, and content identification can be performed on each sample to obtain multiple estimated media stream positions TS(k) at reference clock time T0(k) for the k-th sample. Then, R could be estimated as:
To represent R as time-varying, the following equation may be used:
Thus, the speed ratio R can be calculated using the estimated time positions TS over a span of time to determine the speed at which the media is being rendered by the media rendering source 102.
Using the speed ratio R, an estimate of the real-time media stream position can be calculated as:
T
r(t)=TS+R(t−T0) Equation (4)
The real-time media stream position indicates the position in time of the media sample. For example, if the media sample is from a song that has a length of four minutes, and if Tr(t) is one minute, that indicates that the one minute of the song has elapsed. The time information may be determined by the client device during content identification.
It should be understood that for this and other processes and methods disclosed herein, flowcharts show functionality and operation of one possible implementation of present embodiments. In this regard, each block may represent a module, a segment, or a portion of program code, which includes one or more instructions executable by a processor for implementing specific logical functions or steps in the process. The program code may be stored on any type of computer readable medium or data storage, for example, such as a storage device including a disk or hard drive. The computer readable medium may include non-transitory computer readable medium or memory, for example, such as computer-readable media that stores data for short periods of time like register memory, processor cache and Random Access Memory (RAM). The computer readable medium may also include non-transitory media, such as secondary or persistent long term storage, like read only memory (ROM), optical or magnetic disks, compact-disc read only memory (CD-ROM), for example. The computer readable media may also be any other volatile or non-volatile storage systems. The computer readable medium may be considered a tangible computer readable storage medium, for example.
In addition, each block in
The method 400 includes, at block 402, receiving a sample of a media stream at a client device. The client device may receive the media stream continuously, sporadically, or at intervals, and the media stream may include any type of data or media, such as a radio broadcast, television audio/video, or any audio being rendered. The media stream may be continuously rendered by a source, and thus, the client device may continuously receive the media stream. In some examples, the client device may receive a substantially continuous media stream, such that the client device receives a substantial portion of the media stream rendered, or such that the client device receives the media stream at substantially all times. The client device may capture a sample of the media stream using a microphone, for example.
The method 400 includes, at block 404, at the client device, determining a signature stream of features of the sample. For example, a client device may receive via an input interface (e.g., microphone) samples of the media stream in an incremental manner as a media stream is being received, and may extract features of these samples to generate corresponding signature stream increments. Each incremental sample may include content at a time after a previous sample, as the media stream rendered by the media rendering source may have been ongoing. The signature stream may be generated based on samples of the media stream using any of the methods described above for extracting features of a sample, for example.
The signature stream may be generated in an ongoing basis in real-time when the media stream is an ongoing media stream. In this manner, features in the signature stream may increase in number over time.
The method 400 includes, at block 406, determining whether features between the signature stream of the sample and a signature file for at least one media recording are substantially matching over time. For example, the client device may compare the features in the signature stream with features in stored signature files. The features in the signature stream may be or include landmark-fingerprint pairs, and the signature files may include landmark-fingerprint pairs for a given reference file, for example. Thus, the client device may perform comparisons of landmark-fingerprint pairs of the signature stream and signature files.
The method 400 includes, at block 408, determining whether a number of matching features is above a threshold, and based on the number of matching features, identifying a matching media recording at block 410. For example, the client device may be configured to determine a number of matching features between the signature stream of the media sample and stored signature files, and rank the number of matching features for each signature file. A signature file that has a highest number of matching features may be considered a match, and a media recording that is identified by or referenced by the signature file may be identified as a matching recording for the sample.
In one example, block 406 may be repeated after block 408 when the number of matching features is less than a threshold, such that features between the signature stream and the signature files can be repeatedly compared. Over time, when a media stream is continuously received, the client device may receive more content for the signature stream (e.g., a longer portion of a song), and accumulation of data may be processed in aggregate with results from processing earlier segments to look for matches within longer samples.
The client device may receive the media stream continuously and may continuously perform content identifications based on comparisons with stored signature files. In this manner, the client device may attempt to identify all content that is received. The content identifications may be substantially continuously performed, such that content identifications are performed at all times or substantially all the time while the client device is operating, or while an application comprising content identification functions is running, for example.
In some examples, content identifications can be performed upon receiving the media stream. The client device may be configured to continuously receive a data stream from a microphone (e.g., always capture ambient audio). The client device may be configured to continuously perform the content identifications so as to perform a passive content identification without user input (e.g., the user does not have to trigger the client device to perform the content identification). A user of the client device may initiate an application that continuously performs the content identifications or may configure a setting on the client device such that the client device continuously performs the content identifications.
Using the method 400 in
In one example, when featured content is captured by the client device, the client device can perform the content identification and provide a notification (e.g., pop-up window) indicating recognition. The method 400 may provide a zero-click (e.g., passive) tagging experience for users to notify users when featured content is identified.
In addition to determining an identity of the content, the system in
In other examples, a broadcast source may be identified by receiving a time-stamped recording of media content and recordings from broadcast channels, and then identifying characteristics of the recordings for comparison. For example, fingerprints of recordings taken at similar times can be compared, and such a comparison allows for a direct identification of the broadcast channel from which the media content was recorded. Using this method, spectrogram peaks or other characteristics of the signal rather than the direct signals can be compared. Further, the correct broadcast channel can be identified without any content identification (or identification of the content) being required, for example.
At the same time, samples from broadcast channels being monitored are recorded, as shown at block 508. Similar to user samples, each broadcast sample is also time stamped in terms of a “real-time” offset from a common time base. Further, using the technique of Wang and Smith, described below, characteristics and an estimated time offset of the broadcast sample within the “original” recording are determined, as shown at blocks 510 and 512 (e.g., to determine the point in a song when the sample was recorded).
Then the client device sample characteristics are compared with characteristics from broadcast samples that were taken at or near the time the user sample was recorded, as shown at block 514. The client device sample time stamp is used to identify broadcast samples for comparison. Further, the time offset of the client device sample is compared to the time offset of the broadcast sample to identify a match, as shown at block 516. If the real-time offsets are within a certain tolerance, e.g., one second, then the client device sample is considered to be originating from the same source as the broadcast sample, since the probability that a random performance of the same audio content (such as a hit song) is synchronized to less than one second in time is low.
The client device sample is compared with samples from all broadcast channels until a match is found, as shown at blocks 518 and 520. Once a match is found, the broadcast source of the client device sample is identified, as shown at block 522.
The user may provide the sample to a server 608 (e.g., provide the sample over a network to the server 608 or dial a service to identify broadcast information pertaining to the audio sample, such as an IVR answering system, for example). The audio sample can be provided to the server 608 in the form of acoustic waves, radio waves, a digital audio PCM stream, a compressed digital audio stream (such as Dolby Digital or MP3), or an Internet streaming broadcast. The server 608 may identify or compute characteristics or fingerprints of the sample at landmarks. The server 608 may compute the fingerprints by contacting additional recognition engines, such as a fingerprint extractor 610. The system 608 will thus have timestamped fingerprint tokens of the audio sample that can be used to compare with broadcast samples.
A broadcast monitoring station 612 is configured to monitor each broadcast channel of the radio stations 602 to obtain the broadcast samples. The monitoring station 612 includes a multi-channel radio receiver 614 to receive broadcast information from the radio stations 602. The broadcast information is sent to channel samplers 1 . . . k, as referenced by arrow 616. Each channel sampler 616 has a channel fingerprint extractor 618 for calculating fingerprints of the broadcast samples, as described above, and as described within Wang and Smith.
The monitoring station 612 can then sort and store fingerprints for each broadcast sample for a certain amount of time within a fingerprint block sorter 620. The monitoring station 612 can continually monitor audio streams from the broadcasters while noting the times corresponding to the data recording. After a predetermined amount of time, the monitoring station 612 can write over stored broadcast sample fingerprints to refresh the information to coordinate to audio samples currently being broadcast, for example. A rolling buffer of a predetermined length can be used to hold recent fingerprint history. Since the fingerprints within the rolling buffer will be compared against fingerprints generated from the incoming sample, fingerprints older than a certain cutoff time can be ignored, as they will be considered to be representing audio collected too far in the past. The length of the buffer is determined by a maximum permissible delay plausible for a real-time simultaneous recording of audio signals originating from a real-time broadcast program, such as network latencies of Voice-over-IP networks, internet streaming, and other buffered content. The delays can range from a few milliseconds to a few minutes.
A rolling buffer may be generated using batches of time blocks, e.g., perhaps M=10 seconds long each: every 10 seconds blocks of new [hash+channel ID+timestamp] are dumped into a big bucket and sorted by hash. Then each block ages, and parallel searches are done for each of N blocks to collect matching hashes, where N*M is the longest history length, and (N−1)*M is the shortest. The hash blocks can be retired in a conveyor-belt fashion.
Upon receiving an inquiry from the client device 606 to determine broadcast information corresponding to a given audio sample, the monitoring station 612 searches for corresponding fingerprint hashes within the broadcast sample fingerprints (e.g., linearly corresponding). In particular, a processor 622 in the monitoring station 612 first selects a given broadcast channel to determine if a broadcast sample identity of a broadcast sample recorded at or near the client device sample time matches the client device audio sample fingerprints. If not, the sorter 620 selects the next broadcast channel and continues searching for a match.
Fingerprints of the broadcast samples and the client device audio sample are matched by generating correspondences between equivalent fingerprints, and the file that has the largest number of linearly related correspondences or whose relative locations of characteristic fingerprints most closely match the relative locations of the same fingerprints of the audio sample may be deemed the matching media file.
In particular, the client device audio sample fingerprints are used to retrieve sets of matching fingerprints stored in the sorter 620. The set of retrieved fingerprints are then used to generate correspondence pairs containing sample landmarks and retrieved file landmarks at which the same fingerprints were computed. The resulting correspondence pairs are then sorted by media file identifiers, generating sets of correspondences between sample landmarks and file landmarks for each applicable file. Each set is scanned for alignment between the file landmarks and sample landmarks. That is, linear correspondences in the pairs of landmarks are identified, and the set is scored according to the number of pairs that are linearly related. A linear correspondence occurs when a large number of corresponding sample locations and file locations can be described with substantially the same linear equation, within an allowed tolerance. The file of the set with the highest score, i.e., with the largest number of linearly related correspondences, is the winning file.
Furthermore, fingerprint streams of combinatorial hashes from multiple channels may be grouped into sets of [hash+channel ID+timestamp], and these data structures may be placed into a rolling buffer ordered by time. The contents of the rolling buffer may further be sorted by hash values for a faster search for matching fingerprints with the audio sample, e.g., the number of matching temporally-aligned hashes is the score.
A further step of verification may be used in which spectrogram peaks may be aligned. Because the Wang and Smith technique generates a relative time offset, it is possible to temporally align the spectrogram peak records within about 10 ms in the time axis, for example. Then, the number of matching time and frequency peaks can be determined, and that is the score that can be used for comparison.
Once the correct audio sound has been identified, the result can be reported to the client device 606 or a system 624 by any suitable method. For example, the result can be reported by a computer printout, email, web search result page, SMS (short messaging service) text messaging to a mobile phone, computer-generated voice annotation over a telephone, or posting of the result to a web site or Internet account that the user can access later. The reported results can include identifying information of the source of the sound such as the name of the broadcaster, broadcast recording attributes (e.g., performers, conductor, venue); the company and product of an advertisement; or any other suitable identifiers. Additionally, biographical information, information about concerts in the vicinity, and other information of interest to fans can be provided; hyperlinks to such data may be provided. Reported results can also include the absolute score of the sound file or its score in comparison to the next highest scored file.
In alternate examples, a broadcast source may be identified by performing a timestamped identification.
At the same time, broadcast audio samples are taken periodically from each of at least one broadcast channel being monitored by a monitoring station; and similarly, a content identification step is performed for each broadcast channel, as shown at block 708. The broadcast samples should be taken frequently enough so that at least one sample is taken per audio program (i.e., per song) in each broadcast channel. For example, if the monitoring station records 10 second samples, after a content identification, the monitoring station would know the length of the song, and also how much longer before the song is over. The monitoring station could thus calculate the next time to sample a broadcast channel based on the remaining length of time of the song, for example.
For each broadcast sample, a broadcast sample timestamp (BST) is also taken to mark the beginning of each sample based on the standard reference clock, as shown at block 710. Further, a relative time offset between the beginning of the identified content file from the database and the beginning of the broadcast sample being analyzed is computed. Hence, a broadcast sample relative time offset (BSRTO) and a broadcast sample identity is noted as a result of identifying each broadcast audio sample, as shown at block 712.
To identify a broadcast source, the client device sample and broadcast audio samples are compared to first identify matching sample identities, as shown at block 714, and then to identify matching “relative times” as shown at block 716. If no matches are found, another broadcast channel is selected for comparison, as shown at blocks 718 and 720. If a match is found, the corresponding broadcast information is reported back to the client device, as shown at block 722.
The comparisons of the client device (user sample) and broadcast samples are performed as shown below:
(User sample identity)=(Broadcast sample identity) Equation (5)
USRTO+(ref. time−UST)=BSRTO+(ref. time−BST)+delay Equation (6)
where the ref time is a common reference clock time, and (ref. time−UST) and (ref. time−UST) take into account the possibility for different sampling times by the user audio sampling device and the monitoring station (e.g., (ref. time−BST)=elapsed time since last broadcast sample and now). For example, if broadcast stations are sampled once per minute, and since user samples can occur at any time, to find an exact match, a measure of elapsed time since last sample for each of the broadcast and user sample may be needed. In Equation (6), the delay is a small systematic tolerance that depends on the time difference due to propagation delay of the extra path taken by the user audio sample, such as for example, latency through a digital mobile phone network. Furthermore, any algebraic permutation of Equation (6) is within the scope of the present application.
Thus, matching the sample identities ensures that the same song, for example, is being compared. Then, matching the relative times translates the samples into equivalent time frames, and enables an exact match to be made. As a specific example, suppose the monitoring station samples songs from broadcasters every three minutes, so that at 2:02 pm the station begins recording a 10 second interval of a 4 minute long song from a broadcaster, which began playing the song at 2:00 pm. Thus, BST=2:02 pm, and BSTRO=2 minutes. Suppose a user began recording the same song at 2:03 pm. Thus, UST=2:03, and USRTO=3 minutes. If the user contacts the monitoring station now at 2:04 pm to identify a broadcast source of the song, Equation (2) above will be as follows (assuming a negligible delay):
USRTO+(ref. time−UST)=BSRTO+(ref. time−BST)+delay→3+(2:04−2:03)=2+(2:04−2:02)=4
Thus, the monitoring station will know that it has made an exact match of songs, and the monitoring station also knows the origin of the song. As a result, the monitoring station can inform the user of the broadcast source.
The audio recognition engine 810 will then identify the audio sample by performing a lookup within an audio program database 812 using the technique described within Wang and Smith, as described above, for example. In particular, the audio sample may be a segment of media data of any size obtained from a variety of sources. To perform data recognition, the sample should be a rendition of part of a media file indexed in a database. The indexed media file can be thought of as an original recording, and the sample as a distorted and/or abridged version or rendition of the original recording. The sample may correspond to only a small portion of the indexed file. For example, recognition can be performed on a ten-second segment of a five-minute song indexed in the database.
The database index contains fingerprints representing features at particular locations of the indexed media files. The unknown media sample is identified with a media file in the database (e.g., a winning media file) whose relative locations of fingerprints most closely match the relative locations of fingerprints of the sample. In the case of audio files, the time evolution of fingerprints of the winning file matches the time evolution of fingerprints in the sample.
Using the database of files, a relative time offset of sample can be determined. For example, the fingerprints of the audio sample can be compared with fingerprints of original files. Each fingerprint occurs at a given time, so after matching fingerprints to identify the audio sample, a difference in time between a first fingerprint of the audio sample and a first fingerprint of the stored original file will be a time offset of the audio sample, e.g., amount of time into a song. Thus, a relative time offset (e.g., 67 seconds into a song) at which the user began recording the song can be determined.
In addition, an audio sample can be analyzed to identify its content using a localized matching technique. For example, generally, a relationship between two audio samples can be characterized by first matching certain fingerprint objects derived from the respective samples. A set of fingerprint objects, each occurring at a particular location, is generated for each audio sample. Each location is determined in dependence upon the content of respective audio sample and each fingerprint object characterizes one or more local features at or near the respective particular location. A relative value is next determined for each pair of matched fingerprint objects. A histogram of the relative values is then generated. If a statistically significant peak is found, the two audio samples can be characterized as substantially matching.
The audio recognition engine 810 will return the identity of the audio sample to the client device 806, along with a relative time offset of the audio sample as determined using the Wang and Smith technique, for example. The client device 806 may contact the monitoring station 814 and using the audio sample identity, relative time offset, and sample timestamp, the monitoring station 814 can identify the broadcast source of the audio sample.
The broadcast monitoring station 814 monitors each broadcast channel of the radio stations 802. The monitoring station 814 includes a multi-channel radio receiver 816 to receive broadcast information from the radio stations 802. The broadcast information is sent to channel samplers 1 . . . k 818, which identify content of the broadcast samples by contacting the audio recognition engine 810. In addition, the monitoring station 814 may also include a form of an audio recognition engine to reduce delays in identifying the broadcast samples, for example.
The monitoring station 814 can then store the broadcast sample identities for each broadcast channel for a certain amount of time. After a predetermined amount of time, the monitoring station 814 can write over stored broadcast sample identities to refresh the information to coordinate to audio samples currently being broadcast, for example.
Upon receiving an inquiry from the client device 806 to determine broadcast information corresponding to a given audio sample, the monitoring station 814 performs the tests according to Equations (5) and (6) above. In particular, a processor 822 in the monitoring station 814 first selects a given broadcast channel (using selector 820) to determine if a broadcast sample identity of a broadcast sample recorded at or near the user sample time matches the user audio sample identity. If not, the selector 820 selects the next broadcast channel and continues searching for an identity match.
Once an identity match is found, the processor 822 then determines if the client device sample relative time matches the broadcast sample relative time for this broadcast channel. If not, the selector 820 selects the next broadcast channel and continues searching for an identity match. If the relative times match (within an approximate error range) then the processor 722 considers the audio sample and the broadcast sample to be a match.
After finding a match, the processor 822 reports information pertaining to the broadcast channel to a reporting center 824. The processor 822 may also report the broadcast information to the user sampling device 806, for example. The broadcast information may include a radio channel identification, promotional material, advertisement material, discount offers, or other material relating to the particular broadcast station, for example.
As described with reference to
Initially, at block 902, the method 900 includes receive media content rendered by a media rendering source. The media content may be received at a client device in a number of ways including the client device using a microphone to record ambient audio, video, etc., or via any data communications received at the client device.
At block 904, the method 900 includes the client device make an attempt to determine an identity of the media content based on information stored on the client device. As an example, after receiving the media content, the client device may initially attempt to determine an identity of the media content locally by comparing characteristics of the media content with signature files of media content stored on the client device. As described above, each signature file may be indicative of one or more features extracted from recordings of media content and also information identifying the media content. Thus, the client device may determine or extract features of the received media content, and compare the features of the received media content with the features indicated by the signature files stored on the client device to determine a match of one or more features.
At block 906, the method 900 includes, if the attempt at block 904 was successful, determine an identity of the media rendering source. The identity of the media rendering source may be determined in a number of ways. As one example, the client device may determine the identity of the media rendering source using the determined identity of the media content and referring to a playlist of content rendered by the media rendering source. The client device may store a number of playlists for a number of media rendering sources (e.g., radio playlists, television guides, etc.), and can search the playlists for the identified media content to correlate a media rendering source with the identified content.
The client device may receive playlists for media rendering sources based on predetermined settings (e.g., always receive playlists for predetermined sources each day), or based on other criteria. As one example, a broadcast server may receive information indicating a geographic location of the client device and may provide to the client device current playlists for broadcast stations that operate at or near the geographic location of the client device.
As another example, the client device may send information indicative of the identity of the media content to a broadcast identification server to determine the identity of the media rendering source, and the client device can receive information indicative of the identity of the media rendering source from the broadcast identification server. The broadcast identification server may be configured to determine an identity of the media rendering source using any of the methods described herein.
As still another example, the client device itself may determine the identity of the media rendering source using any of the methods described herein, and thus may be configured to perform functions of the broadcast identification server. The client device may determine the identity of the media rendering source based on a temporal comparison of characteristics of the media content with characteristics of a source sample taken from content rendered by the media rendering source, for example.
In some examples, by determining the identity of the media rendering source, a playlist of content of the media rendering source may be created and stored on the client device.
At block 908, the method 900 results in providing an identity of media content and an identity of the media rendering source. Thus, in one example, the method 900 includes the client device determining the identity of the media content, and the client device using the identity of the media content to determine the identity of the media rendering source.
At block 910, the method 900 includes, if the attempt at block 904 was unsuccessful, determine an identity of the media rendering source. As one example, if the client device is unable to determine the identity of the media content, such as in instances in which the client device does not have a matching signature file stored on the client device, the identity of the media rendering source may then be determined first followed by determining the identity of the media content.
As mentioned above, the identity of the media rendering source may be determined in an number of ways including the client device itself making the determination, or the client device sending a query to a broadcast identification server that will make the determination and provide a response to the client device.
At block 912, the method 900 includes, if the determination at block 910 was successful, determine an identity of the media content. For example, the identity of the media content may be determined via reference to a playlist and using the timestamp of the received media content. Block 912 may be performed by the client device or by the broadcast identification server.
At block 914, the method 900 includes providing an identity of media content and an identity of the media rendering source. Thus, in one example, the method 900 includes the client device or a broadcast server first determining the identity of the media rendering source and using the identity of the media rendering source to determine the identity of the media content.
At block 916, the method 900 includes, if the determination at block 910 was unsuccessful, send information indicative of the media content to a content recognition server. As an example, if the attempt of the client device to determine the identity of the media content was unsuccessful and the determination of the identity of the media rendering source was unsuccessful, the media content can be provided to a content recognition server to determine the identity of the media content.
At block 918, the method 900 includes providing an identity of the media content. The content recognition server may provide a response to the client device.
In examples, the method 900 provides functions for determining both an identity of the media content and an identity of the media rendering source in a cascading method so as to use functionality that avoids computational intensity when possible. As an example, a broadcast channel identification with playlist cross lookup may avoid computational identification. As another example, content identification performed by the client device locally may provide a least amount of computational intensity and provide a result in a shortest amount of time (e.g., no need to communicate with a server), and may also take load off of recognition servers. The method 900 illustrates one order of attempts that may be performed. In other examples, after an unsuccessful attempt at block 904, the client device may proceed to block 916 to send information to the content recognition server for identification.
In one example implementation, a user may be listening to a radio station, and may operate a mobile telephone to receive a sample of audio (e.g., record a sample), and the client device may be configured to determine an identity of the song/commercial as well as an identity of the radio station (using the method 900). The client device may further receive information from content servers related to the song or radio station, and provide such information for display.
While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope being indicated by the following claims. Many modifications and variations can be made without departing from its scope, as will be apparent to those skilled in the art. Functionally equivalent methods and apparatuses within the scope of the disclosure, in addition to those enumerated herein, will be apparent to those skilled in the art from the foregoing descriptions. Such modifications and variations are intended to fall within the scope of the appended claims.