The disclosed subject matter relates to methods and systems for identifying similar songs.
Being able to automatically identify similar songs is a capability with many applications. For example, a music lover may desire to identify cover versions of a favorite song in order to enjoy other interpretations of that song. As another example, copyright holders may want to be able identify different versions of their songs, copies of those songs, etc. in order to insure proper copyright license revenue. As yet another example, users may want to be able to identify songs with a similar sound to a particular song. As still another example, a user listening to a song may desire to know the identity of the song or artist performing the song.
While it is generally easy for a human to identify two songs that are similar, automatically doing so with a machine is much more difficult. However, with millions of songs readily available, having humans compare songs manually is practically impossible. Thus, there is a need for mechanisms which can automatically identify similar songs.
Methods and systems for identifying similar songs are provided. In accordance with some embodiments, methods for identifying similar songs are provided, the methods comprising: identifying beats in at least a portion of a song; generating beat-level descriptors of the at least a portion of the song corresponding to the beats; comparing the beat-level descriptors to other beat-level descriptors corresponding to a plurality of songs. In accordance with some embodiments, systems for identifying similar songs are provided, the systems comprising: a digital processing device that: identifies beats in at least a portion of a song; generates beat-level descriptors of the at least a portion of the song corresponding to the beats; and compares the beat-level descriptors to other beat-level descriptors corresponding to a plurality of songs.
In accordance with various embodiments, mechanisms for comparing songs are provided. These mechanisms can be used in a variety of applications. For example, cover songs of a song can be identified. A cover song can include a song performed by one artist after that song was previously performed by another artist. As another example, very similar songs (e.g., two songs with similar sounds, whether unintentional (e.g., due to coincidence) or intentional (e.g., in the case of sampling or copying)) can be identified. As yet another example, different songs with a common, distinctive sound can also be identified. As a still further example, a song being played can be identified (e.g., when a user is listening to the radio and wants to know the name of a song, the user can use these mechanisms to capture and identify the song).
In some embodiments, these mechanisms can receive a song or a portion of a song. For example, songs can be received from a storage device, from a microphone, or from any other suitable device or interface. Beats in the song can then be identified. By identifying beats in the song, variations in tempo between different songs can be normalized. Beat-level descriptors in the song can then be generated. These beat-level descriptors can be stored in fixed-size feature vectors for each beat to create a feature array. By comparing the sequence of beat-synchronous feature vectors for two songs, e.g., by cross-correlating the feature arrays, similar songs can be identified. The results of this identification can then be presented to a user. For example, these results can include one or more names of the closest songs to the song input to the mechanism, the likelihood that the input song is very similar to one or more other songs, etc.
In accordance with some embodiments, songs (or portions of songs) can be compared using a process 100 as illustrated in
In accordance with some embodiments, in order to track beats at 104, all or a portion of a song is converted into an onset strength envelope O(t) 216 as illustrated in process 200 in
In some embodiments, the onset envelope for each musical excerpt can then be normalized by dividing by its standard deviation.
In some embodiments, a tempo estimate p for the song (or portion of the song) can next be calculated using process 400 as illustrated in
Because there can be large correlations at various integer multiples of a basic period (e.g., as the peaks line up with the peaks that occur two or more beats later), it can be difficult to choose a single best peak among many correlation peaks of comparable magnitude. However, human tempo perception (as might be examined by asking subjects to tap along in time to a piece of music) is known to have a bias towards 120 beats per minute (BPM). Therefore, in some embodiments, a perceptual weighting window can be applied at 404 to the raw autocorrelation to down-weight periodicity peaks that are far from this bias. For example, such a perceptual weighting window W(τ) can be expressed as a Gaussian weighting function on a log-time axis, such as:
where τ0 is the center of the tempo period bias (e.g., 0.5 s corresponding to 120 BPM, or any other suitable value), and στ controls the width of the weighting curve and is expressed in octaves (e.g., 1.4 octaves or any other suitable number).
By applying this perceptual weighting window W(τ) to the autocorrelation above, a tempo period strength 406 can be represented as:
Tempo period strength 406, for any given period τ, can be indicative of the likelihood of a human choosing that period as the underlying tempo of the input sound. A primary tempo period estimate τp 410 can therefore be determined at 408 by identifying the τ for which TPS(τ) is largest.
In some embodiments, rather than simply choosing the largest peak in the base TPS, a process 600 of
TPS2(τ2)=TPS(τ2)+0.5TPS(2τ2)+0.25TPS(2τ2−1)+0.25TPS(2τ2+1) (4)
TPS3(τ3)=TPS(τ3)+0.33TPS(3τ3)+0.33TPS(3τ3−1)+0.33TPS(3τ3+1) (5)
Whichever sequence (4) or (5) results in a larger peak value TPS2(τ2) or TPS3(τ3) determines at 606 whether the tempo is considered duple 608 or triple 610, respectively. The value of τ2 or τ3 corresponding to the larger peak value is then treated as the faster target tempo metrical level at 612 or 614, with one-half or one-third of that value as the adjacent metrical level at 616 or 618. TPS can then be calculated twice using the faster target tempo metrical level and adjacent metrical level using equation (3) at 620. In some embodiments, an σr of 0.9 octaves (or any other suitable value) can be used instead of an σr of 1.4 octaves in performing the calculations of equation (3). The larger value of these two TPS values can then be used at 622 to indicate that the faster target tempo metrical level or the adjacent metrical level, respectively, is the primary tempo period estimate τp 410.
Using the onset strength envelope and the tempo estimate, a sequence of beat times that correspond to perceived onsets in the audio signal and constitute a regular, rhythmic pattern can be generated using process 700 as illustrated in connection with
where {ti} is the sequence of N beat instants, O(t) is the onset strength envelope, α is a weighting to balance the importance of the two terms (e.g., α can be 400 or any other suitable value), and F(Δt, τp) is a function that measures the consistency between an inter-beat interval Δt and the ideal beat spacing τp defined by the target tempo. For example, a simple squared-error function applied to the log-ratio of actual and ideal time spacing can be used for F(Δt, τp):
which takes a maximum value of 0 when Δt=τ, becomes increasingly negative for larger deviations, and is symmetric on a log-time axis so that F(kτ,τ)=F(τ/k,τ).
A property of the objective function C(t) is that the best-scoring time sequence can be assembled recursively to calculate the best possible score C*(t) of all sequences that end at time 1. The recursive relation can be defined as:
This equation is based on the observation that the best score for time t is the local onset strength, plus the best score to the preceding beat time τ that maximizes the sum of that best score and the transition cost from that time. While calculating C*, the actual preceding beat time that gave the best score can also be recorded as:
In some embodiments, a limited range of τ can be searched instead of the full range because the rapidly growing penalty term F will make it unlikely that the best predecessor time lies far from t−τp. Thus, a search can be limited to τ=t−2τp . . . t−τ/2 as follows:
To find the set of beat times that optimize the objective function for a given onset envelope, C*(t) and P*(t) can be calculated at 704 for every time starting from the beginning of the range zero at 702 via 706. The largest value of C* (which will typically be within τp of the end of the time range) can be identified at 708. This largest value of C* is the final beat instant tN—where N, the total number of beats, is still unknown at this point. The beats leading up to C* can be identified by ‘back tracing’ via P* at 710, finding the preceding beat time tN−1=P*(tN), and progressively working backwards via 712 until the beginning of the song (or portion of a song) is reached. This produces the entire optimal beat sequence (ti)*714.
In order to accommodate slowly varying tempos, τp can be updated dynamically during the progressive calculation of C*(t) and P*(t). For instance, τp(t) can be set to a weighted average (e.g., so that times further in the past have progressively less weight) of the best inter-beat-intervals found in the max search for times around t. For example, as C*(t) and P*(t) are calculated at 704, τp(t) can be calculated as:
τp(t)=η(t−P*(t))+(1−η)τp(P*(t)) (10)
where η is a smoothing constant having a value between 0 and 1 (e.g., 0.1 or any other suitable value) that is based on how quickly the tempo can change. During the subsequent calculation of C*(t+1), the term F(t−τ, τp) can be replaced with F(t−τ, τp(τ)) to take into account the new local tempo estimate.
In order to accommodate several abrupt changes in tempo, several different τp values can be used in calculating C*( ) and P*( ) in some embodiments. In some of these embodiments, a penalty factor can be included in the calculations of C*( ) and P*( ) to down-weight calculations that favor frequent shifts between tempo. For example, a number of different tempos can be used in parallel to add a second dimension to C*( ) and P*( ) to find the best sequence ending at time t and with a particular tempo τpi. For example, C*( ) and P*( ) can be represented as:
This approach is able to find an optimal spacing of beats even in intervals where there is no acoustic evidence of any beats. This “filling in” emerges naturally from the back trace and may be beneficial in cases in which music contains silence or long sustained notes.
Using the optimal beat sequence {ti}*, the song (or a portion of the song) can next be used to generate a single feature vector per beat as beat-level descriptors, as illustrated at 106 of
In some embodiments, beat-level descriptors are generated as the intensity associated with each of 12 semitones (e.g. piano keys) within an octave formed by folding all octaves together (e.g., putting the intensity of semitone A across all octaves in the same semitone bin A, putting the intensity of semitone B across all octaves in the same semitone bin B, putting the intensity of semitone C across all octaves in the same semitone bin C, etc.).
In generating these beat-level descriptors, phase-derivatives (instantaneous frequencies) of FFT bins can be used both to identify strong tonal components in the spectrum (indicated by spectrally adjacent bins with close instantaneous frequencies) and to get a higher-resolution estimate of the underlying frequency. For example, a 1024 point Fourier transform can be applied to 10 seconds of the song (or the portion of the song) sampled (or re-sampled) at 11 kHz with 93 ms overlapping windows advanced by 10 ms. This results in 513 frequency bins per FFT window and 1000 FFT windows.
To reduce these 513 frequency bins over each of 1000 windows to 12 (for example) chroma bins per beat, the 513 frequency bins can first be reduced to 12 chroma bins. This can be done by removing non-tonal peaks by keeping only bins where the instantaneous frequency is within 25% (or any other suitable value) over three (or any other suitable number) adjacent bins, estimating the frequency that each energy peak relates to from the energy peak's instantaneous frequency, applying a perceptual weighting function to the frequency estimates so frequencies closest to a given frequency (e.g., 400 Hz) have the strongest contribution to the chroma vector, and frequencies below a lower frequency (e.g., 100 Hz, 2 octaves below the given frequency, or any other suitable value) or above an upper frequency (e.g., 1600 Hz, 2 octaves above the given frequency, or any other suitable value) are strongly down-weighted, and sum up all the weighted frequency components by putting their resultant magnitude into the chroma bin with the nearest frequency.
As mentioned above, in some embodiments, each chroma bin can correspond to the same semitone in all octaves. Thus, each chroma bin can correspond to multiple frequencies (i.e., the particular semitones of the different octaves). In some embodiments, the different frequencies (fi) associated with each chroma bin i can be calculated by applying the following formula to different values of r:
fi=f0*2r+(i/N) (11)
where τ is an integer value representing the octave relative to f0 for which the specific frequency fi is to be determined (e.g., r=−1 indicates to determine fi for the octave immediately below 440 Hz), N is the total number of chroma bins (e.g., 12 in this example), and f0 is the “tuning center” of the set of chroma bins (e.g. 440 Hz or any other suitable value).
Once there are 12 chroma bins over 1000 windows, in the example above, the 1000 windows can be associated with corresponding beats, and then each of the windows for a beat combined to provide a total of 12 chroma bins per beat. The windows for a beat can be combined, in some embodiments, by averaging each chroma bin i across all of the windows associated with a beat. In some embodiments, the windows for a beat can be combined by taking the largest value or the median value of each chroma bin i across all of the windows associated with a beat. In some embodiments, the windows for a beat can be combined by taking the N-th root of the average of the values, raised to the N-th power, for each chroma bin i across all of the windows associated with a beat.
In some embodiments, the Fourier transform can be weighted (e.g., using Gaussian weighting) to emphasize energy a couple of octaves (e.g., around two with a Gaussian half-width of 1 octave) above and below 400 Hz.
In some embodiments, instead of using a phase-derivative within FFT bins in order to generate beat-level descriptors as chroma bins, the STFT bins calculated in determining the onset strength envelope O(t) can be mapped directly to chroma bins by selecting spectral peaks for example, the magnitude of each FFT bin can be compared with the magnitudes of neighboring bins to determine if the bin is larger. The magnitudes of the non-larger bins can be set to zero, and a matrix containing the FFT bins multiplied by a matrix of weights that map each FFT bin to a corresponding chroma bin. This results in having 12 chroma bins per each of the FFT windows calculated in determining the onset strength envelope. These 12 bins per window can then be combined to provide 12 bins per beat in a similar manner as described above for the phase-derivative-within-FFT-bins approach to generating beat-level descriptors.
In some embodiments, the mapping of frequencies to chroma bins can be adjusted for each song (or portion of a song) by up to +0.5 semitones (or any other suitable value) by making the single strongest frequency peak from a long FFT window (e.g., 10 seconds or any other suitable value) of that song (or portion of that song) line up with a chroma bin center.
In some embodiments, the magnitude of the chroma bins can be compressed by applying a square root function to the magnitude to improve performance of the correlation between songs.
In some embodiments, each chroma bin can be normalized to have zero mean and unit variance within each dimension (i.e., the chroma bin dimension and the beat dimension). In some embodiments, the chroma bins are also high-pass filtered in the time dimension to emphasize changes. For example, a first-order high-pass filter with a 3 dB cutoff at around 0.1 radians/sample can be used.
In some embodiments, Mel-Frequency Cepstral Coefficients (MFCCs) can also be used to provide beat-level descriptors. The MFCCs can be calculated from the song (or portion of the song) by: calculating STFT magnitudes (e.g., as done in calculating the onset strength envelope); mapping each magnitude bin to a smaller number of Mel-frequency bins (e.g., this can be accomplished, for example, by calculating each Mel bin as a weighted average of the FFT bins ranging from the center frequencies of the two adjacent Mel bins, with linear weighting to give a triangular weighting window); converting the Mel spectrum to log scale; taking the discrete cosine transform (DCT) of the log-Mel spectrum; and keeping just the first N bins (e.g., 20 bins or any other suitable number) of the resulting transform. This results in 20 MFCCs per STFT window. These 20 MFCCs per window can then be combined to provide 20 MFCCs per beat in a similar manner as described above for combining the 12 chroma bins per window to provide 12 chroma bins per beat in the phase-derivative-within-FFT-bins approach to generating beat-level descriptors.
In some embodiments, the MFCC values for each beat can be high-pass filtered.
In some embodiments, in addition to the beat-level descriptors described above for each beat (e.g., 12 chroma bins or 20 MFCCs), other beat-level descriptors can additionally be generated and used in comparing songs (or portions of songs). For example, such other beat-level descriptors can include the standard deviation across the windows of beat-level descriptors within a beat, and/or the slope of a straight-line approximation to the time-sequence of values of beat-level descriptors for each window within a beat. Note, that if transposition of the chroma bins is performed as discussed below, the mechanism for doing so can be modified to insure that the chroma dimension of any matrix in which the chroma bins are stored is symmetric or to account for any asymmetry in the chroma dimension.
In some of these embodiments, only components of the song (or portion of the song) up to 1 kHz are used in forming the beat-level descriptors. In other embodiments, only components of the song (or portion of the song) up to 2 kHz are used in forming the beat-level descriptors.
The lower two panes 800 and 802 of
After the beat-level descriptor processing above is completed for two or more songs (or portions of songs), those songs (or portions of songs) can be compared to determine if the songs are similar. In some embodiments, comparisons can be performed on the beat-level descriptors corresponding to specific segments of each song (or portion of a song). In some embodiments, comparisons can be performed on the beat-level descriptors corresponding to as much of the entire song (or portion of a song) that is available for comparison.
For example, comparisons can be performed using a cross-correlation of the beat-level descriptors of two songs (or portions of songs). For example, a cross correlation of beat-level descriptors can be performed using the following equation:
wherein N is the number of beat-level descriptors in the beat level descriptor arrays x and y for the two songs (or portions of songs) being matched, tx and ty are the maximum time values in arrays x and y, respectively, and τ is the beat period (in seconds) being used for the primary song being examined. Similar songs (or portions of songs) can be indicated by cross-correlations of large magnitudes of r where these large magnitudes occurred in narrow local maxima that fell off rapidly as the relative alignment changed from its best value.
To emphasize these sharp local maxima, in some embodiments when the beat-level descriptors are chroma bins, transpositions of the chroma bins can be selected that give the largest peak correlation. A cross-correlation that facilitates transpositions can be represented as:
wherein N is the number of chroma bins in the beat level descriptor arrays x and y, tx and ty are the maximum time values in arrays x and y, respectively, c is the center chroma bin number, and τ is the beat period (in seconds) being used for the song being examined.
In some embodiments, the cross-correlation results can be normalized by dividing by the column count of the shorter matrix, so the correlation results are bounded to lie between zero and one. Additionally or alternatively, in some embodiments, the results of the cross-correlation can be high-pass filtered with a 3 dB point at 0.1 rad/sample or any other suitable filter.
In some embodiments, the cross correlation can be performed using a fast Fourier transform (FFT). This can be done by taking the FFT of the beat-level descriptors (or a portion thereof) for each song, multiplying the results of the FFTs together, and taking the inverse FFT of the product of that multiplication. In some embodiments, after the FFT of the beat-level descriptors of the song being examined is taken, the results of that FFT can be saved to a database for future comparison. Similarly, in some embodiments, rather than calculating the results of an FFT on the beat-level descriptors for a reference song, those results can be retrieved from a database.
As another example, segmentation time identification and Locality-Sensitive Hashing (LSH) can be used to perform comparisons between a song (or portion of a song) and multiple other songs. For example, segmentation time identification can be performed by fitting separate Gaussian models to the features of the beat-level descriptors in fixed-size windows on each side of every possible boundary, and selecting the boundary that gives the smallest likelihood of the features in the window on one side occurring in a Gaussian model based on the other side. As another example, segmentation time identification can be performed by computing statistics, such as mean and covariance, of windows to the left and right of each possible boundary, and selecting the boundary corresponding to the statistics that are most different. In some embodiments, the possible boundaries are the beat times for the two songs (or portions of songs). The selected boundary can subsequently be used as the reference alignment point for comparisons between the two songs (or portions of songs). In some embodiments, Locality-Sensitive Hashing (LSH), or any other suitable technique, can then be used to solve the nearest neighbor problem between the songs (or portions of songs) when focused on a window around the reference alignment point in each. In some embodiments, when one or more nearest neighbors are identified, a distance between those neighbors can be calculated to determine if those neighbors are similar.
In some embodiments, to improve correlation performance, beat-level-descriptor generation and comparisons (e.g., as described above) can be performed with any suitable multiple (e.g., double, triple, etc.) of the number of beats determined for each song (or portion of a song). For example, if song one (or a portion of song one) is determined to have a beat of 70 BPM and song two (or a portion of song two) is determined to have a beat of 65 BPM, correlations can respectively be performed for these songs at beat values of 70 and 65 BPM, 140 and 65 BPM, 70 and 130 BPM, and 140 and 130 BPM.
In some embodiments, comparison results can be further refined by comparing the tempo estimates for two or more songs (or portions of songs) being compared. For example, if a first song is similar to both a second song and a third song, the tempos between the songs can be compared to determine which pair (song one and song two, or song one and song three) is closer in tempo.
An example of hardware 1000 for implementing the mechanisms described above is illustrated in
The components of hardware 1000 can be included in any suitable devices. For example, these components can be included in a computer, a portable music player, a media center, mobile telephone, etc.
Although the invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention can be made without departing from the spirit and scope of the invention, which is only limited by the claims which follow. Features of the disclosed embodiments can be combined and rearranged in various ways.
This application claims the benefit of U.S. Provisional Patent Application No. 60/847,529, filed Sep. 27, 2006, which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
60847529 | Sep 2006 | US |