The present invention relates to music information retrieval systems, and more particularly, a system and method for determining the similarity of music files based on a perceptual metric.
In the field of music information retrieval, it is often desirable to be able to determine the degree of similarity between music files. For example, a user may have thousands of music files stored on a hard drive, and may wish to locate songs that “sound like” certain favorite songs. As another example, a Web service may wish to provide song recommendations for purchase based on the content of the music that is already stored on the user's hard drive. These examples illustrate a need to classify individual musical compositions in a quantitative manner based on highly subjective features, in order to facilitate rapid search and retrieval.
Classifying information that has subjectively perceived attributes or characteristics is difficult. When the information is one or more musical compositions, classification is complicated by the widely varying subjective perceptions of the musical compositions by different listeners. Different listeners may perceive a particular musical composition quite differently.
In the classical music context, musicologists have developed names for various attributes of musical compositions. Terms such as adagio, fortissimo, or allegro broadly describe the strength with which instruments in an orchestra should be played to properly render a musical composition from sheet music. In the popular music context, there is less agreement upon proper terminology. Composers indicate how to render their musical compositions with annotations such as brightly, softly, etc., but there is no consistent, concise, agreed-upon system for such annotations.
Musical compositions and other information are now widely available for sampling and purchase over global computer networks through online merchants such as AMAZON.COM®, BARNESANDNOBLE.COM®, CDNOW.COM®, etc. A prospective consumer can use a computer system equipped with a standard Web browser to contact an online merchant, browse an online catalog of pre-recorded music, select a song or collection of songs (“album”), and purchase the song or album for shipment direct to the consumer. In this context, online merchants and others desire to assist the consumer in making a purchase selection and desire to suggest possible selections for purchase.
A variety of classification and search approaches are now used. In one approach, a consumer selects a musical composition for listening or for purchase based on past positive experience with the same artist. This approach has a significant disadvantage in that artists often have music of widely varying types.
In another approach, a merchant classifies musical compositions into broad categories or genres. A disadvantage of this approach is that typically the genres are too broad. For example, a wide variety of qualitatively different albums and songs may be classified in the genre of “Popular Music” or “Rock and Roll.”
In still another approach, an online merchant presents a search page to a client associated with the consumer. The merchant receives selection criteria from the client for use in searching the merchant's catalog or database of available music. Normally the selection criteria are limited to song name, album title, or artist name. The merchant searches the database based on the selection criteria and returns a list of matching results to the client. The client selects one item in the list and receives further, detailed information about that item. The merchant also creates and returns one or more critics' reviews, customer reviews, or past purchase information associated with the item.
For example, the merchant may present a review by a music critic of a magazine that critiques the album selected by the client. The merchant may also present informal reviews of the album that have been previously entered into the system by other consumers. Further, the merchant may present suggestions of related music based on prior purchases of others. For example, in the approach of AMAZON.COM®, when a client requests detailed information about a particular album or song, the system displays information stating, “People who bought this album also bought . . . ” followed by a list of other albums or songs. The list of other albums or songs is derived from actual purchase experience of the system. This is called “collaborative filtering.”
However, the use of this approach by itself has a significant disadvantage, namely that the suggested albums or songs are based on extrinsic similarity as indicated by purchase decisions of others, rather than based upon objective similarity of intrinsic attributes of a requested album or song and the suggested albums or songs. A decision by another consumer to purchase two albums at the same time does not indicate that the two albums are objectively similar or even that the consumer liked both. For example, the consumer might have bought one for the consumer and the second for a third party having greatly differing subjective taste than the consumer.
Another disadvantage of this type of collaborative filtering is that output data is normally available only for complete albums and not for individual songs. Thus, a first album that the consumer likes may be broadly similar to a second album, but the second album may contain individual songs that are strikingly dissimilar from the first album, and the consumer has no way to detect or act on such dissimilarity.
Still another disadvantage of collaborative filtering is that it requires a large mass of historical data in order to provide useful search results. The search results indicating what others bought are only useful after a large number of transactions, so that meaningful patterns and meaningful similarity emerge. Moreover, early transactions tend to over-influence later buyers, and popular titles tend to self-perpetuate.
In yet another approach, digital signal processing (DSP) analysis can be used to try to match characteristics from song to song. U.S. Pat. No. 5,918,223, assigned to Muscle Fish, a corporation of Berkeley, Calif. (hereinafter the Muscle Fish Patent), describes a DSP analysis technique. The Muscle Fish Patent describes a system having two basic components, typically implemented as software running on a digital computer. The two components are the analysis of sounds (digital audio data), and the retrieval of these sounds based upon statistical or frame-by-frame comparisons of the analysis results. In that system, the process first measures a variety of acoustical features of each sound file and the choice of which acoustical features to measure is critical to the success of the process. Loudness, bass, pitch, brightness, bandwidth, and Mel frequency cepstral coefficients (MFCCs) at periodic intervals (referred to as “frames”) over the length of the sound file are measured. The per-frame values are optionally stored, for applications that require that level of detail. Next, the per-frame first derivative of each of these features is computed. Specific statistical measurements of each of these features are computed to describe their variation over time. The specific statistical measurements that are computed are the mean and standard deviation. The first derivatives are also included. This set of statistical measurements is represented as an N-vector (a vector with N elements), referred to as the rhythm feature vector for music.
Once the feature vector of the sound file has been stored in a database with a corresponding link to the original data file, the user can query the database in order to access the corresponding sound files. The database system must be able to measure the distance in N-space between two N-vectors.
The sound file database can be searched by four specific methods, enumerated below. The result of these searches is a list of sound files rank-ordered by distance from the specified N-vector, which corresponds to sound files that are most similar to the specified N-vector or average N-vector of a user grouping of songs.
While DSP analysis may be effective for some groups or classes of songs, it is ineffective for others, and there has so far been no technique for determining what makes the technique effective for some music and not others. Specifically, such acoustical analysis as has been implemented thus far suffers defects because 1) the effectiveness of the analysis is being questioned regarding the accuracy of the results, thus diminishing the perceived quality by the user and 2) recommendations are only generally made by current systems if the user manually types in a desired artist or song title, or group of songs from that specific Web site. Accordingly, DSP analysis, by itself, is unreliable and thus insufficient for widespread commercial or other use. Another problem with the DSP analysis is that it ignores the observed fact that oftentimes, sounds with similar attributes as calculated by a digital signal processing algorithm will be perceived as sounding very different. This is because, at present, no previously available digital signal processing approach can match the ability of the human brain for extracting salient information from a stream of data. As a result, all previous attempts at signal classification using digital signal processing techniques miss important aspects of a signal that the brain uses for determining similarity.
In addition, previous attempts at classification based on connectionist approaches, such as artificial neural networks (ANN), and Self-organizing Feature Maps (SOFM), have had only limited success classifying sounds based on similarity. This has to do with the difficulties in training ANN's and SOFM's. The amount of computing resources required to train ANN's and SOFM of the required complexity tend to be cost and resource prohibitive.
The present invention is directed to providing a system that overcomes the foregoing and other disadvantages. More specifically, the present invention is directed to a system and method for determining the similarity of musical recordings based on a perceptual metric.
A system and method for determining the similarity of music files based on a perceptual metric is disclosed. In accordance with one aspect of the invention, rhythmic, harmonic, and melodic components are extracted from music files and compared to determine the degree of similarity between the music files.
In accordance with another aspect of the invention, the invention is comprised of four elements, including a preprocessor, a mapper, a comparer, and a trainer. The preprocessor generates components of the music files. The mapper maps the components of the music files to two-dimensional feature maps. Based on the two-dimensional feature maps, representative vectors are then determined for each of the music files. To compare the similarity between two music files, the comparer compares the representative vectors of the music files. The trainer is used to train the mapper.
In accordance with another aspect of the invention, the preprocessor operates in three steps: generate harmonic components, generate rhythmic components, and generate melodic components. In the first step, generate harmonics, the music file is broken down into a frequency/time representations. This is a two-dimensional array of numbers. The value of each number represents the energy of the musical signal present in a given frequency bin. The vertical axis is the Mel frequency scale, although the vertical scale can represent any of the many “warped” frequency mappings that are used to more closely mimic the perceptual groupings of frequency bands that occurs in the human ear. The horizontal axis is time.
The harmonic components are then sent to the rhythmic component generator. The rhythmic component generator analyzes the activity in a number of critical frequency bands of the harmonic components. A Fourier transform is taken of the amplitude of each of these critical bands as a function of time. The result of this transform yields information on the period of periodically occurring signals. The average intensity of each bin in the transform along with the standard deviation of that intensity is calculated. Bins with intensity greater the average intensity plus some factor times the standard deviation of the intensity are preserved; all other bins have their intensity set to zero. The result of this truncation is returned by the rhythmic component generator.
The melodic component generator returns a data set that represents a two-dimensional decomposition of the musical partials present in the song. A musical partial is a harmonic component that is stationary for a given period of time. The result of the melodic component generator can be thought of as a primitive musical transcriber; turning the sounds into a rough representation of the notes played in the music. The collection component generators thus return a set of data that represents the three major components of music: timbre (a.k.a. harmonics), rhythm, and melody.
In accordance with another aspect of the invention, the mapper reduces the dimensionality of the input to six by mapping various found patterns within the output of the component generators to positions on one of three two-dimensional feature maps. Each of the three feature maps serves their respective component generators. The top N positions in each feature map, along with their amplitudes, are then taken as the representative vectors for the input music file. It will be appreciated that the reduction of the dimensionality of the input to six by the mapper is part of what allows the present invention to process large collections of music files very quickly. This in contrast to a system that utilizes very large feature vectors, such as one utilizing vectors on the order of 24 elements, just to describe one of the aspects of the music file, for which the processing requires a large amount of data.
In accordance with another aspect of the invention, in order to compare the similarity between two music files, the comparer calculates the distance between the two representative vectors. Small distances indicate that the music files are similar while large distances indicate that the music files are not similar.
In accordance with another aspect of the invention, in order to create the mapping of the feature maps, the trainer trains the feature maps. This is done using the standard self-organizing feature map training procedure. The outputs of the three component generators are repeatedly fed to the trainer for a large corpus of music files (e.g., 100,000 songs may be utilized in one example training procedure). One epoch represents the presentation of the entire set of songs. This process is repeated over several epochs until the maps converge to stable values. The result of the training process is stored in training files for use by the mapper.
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein:
The present invention is directed to a system and method for determining the similarity of music files based on perceptual metrics. As will be described in more detail below, in accordance with the present invention the rhythmic, harmonic, and melodic components are extracted from the music files and compared to determine the degree of similarity between the music files.
With reference to
A number of program modules may be stored on the hard disk 39, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including an operating system 35, one or more application programs 36, other program modules 37 and program data 38. A user may enter commands and information into the personal computer 20 through input devices such as a keyboard 40 and pointing device 42. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 21 through a serial port interface 46 that is coupled to the system bus 23, but may also be connected by other interfaces, such as a parallel port, game port or a universal serial bus (USB). A display in the form of a monitor 47 is also connected to the system bus 23 via an interface, such as a video card or adapter 48. One or more speakers 57 may also be connected to the system bus 23 via an interface, such as an audio adapter 56. In addition to the display and speakers, personal computers typically include other peripheral output devices (not shown), such as printers.
The personal computer 20 may operate in a networked environment using logical connections to one or more personal computers, such as a remote computer 49. The remote computer 49 may be another personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the personal computer 20. The logical connections depicted in
When used in a LAN networking environment, the personal computer 20 is connected to the local area network 51 through a network interface or adapter 53. When used in a WAN networking environment, the personal computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52, such as the Internet. The modem 54, which may be internal or external, is connected to the system bus 23 via the serial port interface 46. In a networked environment, program modules depicted relative to the personal computer 20 or portions thereof may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary, and other means of establishing a communications link between the computers may be used.
As will be described in more detail below, the present invention provides a system and method for extracting and comparing the harmonic, rhythmic, and melodic components of the music files. Earlier systems and methods for providing automatic classification of music files according to various properties are described in U.S. patent application Ser. No. 09/900,059, entitled “System and Methods for Providing Automatic Classification of Media Entities According to Consonance Properties”, U.S. patent application Ser. No. 09/935,349, entitled “System and Methods for Providing Automatic Classification of Media Entities According to Sonic Properties”, U.S. patent application Ser. No. 09/905,345, entitled “System and Methods for Providing Automatic Classification of Media Entities According to Tempo Properties”, and U.S. patent application Ser. No. 09/942,509, entitled “System and Methods for Providing Automatic Classification of Media Entities According to Melodic Movement Properties”, all of which are commonly assigned with the present application and all of which are hereby incorporated by reference in their entireties.
The mapper 320 reduces the dimensionality of the input to six by mapping various found patterns within the output of the component generators 311, 312, and 313 to positions on one of three two-dimensional feature maps. Each of the three feature maps serves their respective component generators. The top N positions in each feature map, along with their amplitudes, are taken as the representative vectors of the input music file.
The comparer 330 compares the similarities between two music files and outputs data regarding the similarities. The comparison is performed by calculating the distance between the two representative vectors of the music files. Small distances indicate that the music files are similar, while large distances indicate that the music files are not similar. Specific numerical examples of this process are provided with respect to
The trainer 340 functions to train the feature maps of the mapper 320. In other words, in order to create the mapping of the feature maps, they must be trained. The feature maps are trained at least in part by utilizing a variation of the standard self-organizing feature map (SOFM) training procedure, as is known in the art. The specific operation of the trainer 340 will be described in more detail below with reference to
At a block 420, the harmonic components are sent to the rhythmic component generator. At block 430, the rhythmic component generator analyzes the activity in a number of critical frequency bands of the harmonic components. At block 440, a Fourier transform is taken of the amplitude of each of the critical frequency bands as a function of time. The result of this transform yields information on the period of the periodically occurring signals.
At a block 450, the average intensity of each bin in the transform along with a standard deviation of that intensity is calculated. At a block 460, the bins with intensity greater than the average intensity plus a specified factor times the standard deviation of the intensity are preserved, while all other bins have their intensity set to zero. The result of this truncation is returned by the rhythmic component generator.
At a block 470, the melodic component generator generates and returns a data set that represents a two-dimensional decomposition of the musical partials present in the song. A musical partial is a harmonic component that is stationary for a given period of time. The result of the melodic component generator can be thought of as a primitive musical transcriber; turning the sounds into a rough representation of the notes played in the music. The output of the melodic component generator is described in more detail in the previously incorporated U.S. patent application Ser. Nos. 09/900,059 and 09/942,509. In general, in the operation of the melodic component generator, two 24 element vectors are calculated and combined into one vector. Each element is independently normalized before combination so that its components sum to 1.
As shown in
The operation of the SOFM training procedure that is utilized may be described as follows. The input to each SOFM is the output vectors from the harmonic component generator 311, the rhythmic component generator 312, and the melodic component generator 313. The input vectors are each normalized to unit length in the L1 norm (i.e. the sum of the components equals 1). However, in one embodiment an L2 norm may also be employed (the square root of the sum of the squares of the components equals 1). There is a separate SOFM for the rhythmic and melodic component generators 312 and 313, while the output of the harmonic component generator 311 is fed into two SOFM's. During the training session, the connection weights of each SOFM are initially set to random values between 1 and 0. Over time, the weights converge. In order to maintain topological invariance (i.e. data that is close in the input space is close in the output space), a weighted “winner takes all” training process is used. In this training process, the input neuron with the strongest response is chosen as the winning neuron. The amount of change applied to neurons in the region of the neuron is scaled by h_ij=exp(−d_ij/(2s^2(n)) where d_i,j is the euclidian distance between cell i, the winning cell, and cell j, a neighboring cell, and s^2(n)=s—0*exp(−n/t) where s—0 is an initial scale, n is the iteration number, and t is the scale factor.
As shown in
((m1a−m1b)2+(m2a−m2b)2+(r1a−r1b)2+(r2A−r2b)2+(t1a−t1b)2+(t2a−t2b)2)1/2 (Eq. 1)
As shown in
At a decision block 1040, a determination is made as to whether the last time slice has been processed. If the last time slice has not been processed, then the routine returns to block 1020. If the last time slice has been processed, the routine continues to a block 1050. At block 1050, the sum of the output bins is normalized to 1. Once the entire output from the harmonic component generator is processed, the 512 accumulator bins are presented as the input to a 2d SOFM for final classification in a manner analogous to the output from the other rhythmic and melodic component generators. In one embodiment, the output of each 2d SOFM is a 36 by 36 grid of neurons.
While the preferred embodiment of the invention has been illustrated and described, it will be appreciated that various changes can be made therein without departing from the spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
6051770 | Milburn et al. | Apr 2000 | A |
6657117 | Weare et al. | Dec 2003 | B2 |
20010025561 | Milburn et al. | Oct 2001 | A1 |
20020002899 | Gjerdingen et al. | Jan 2002 | A1 |
20030089218 | Gang et al. | May 2003 | A1 |
20050092165 | Weare et al. | May 2005 | A1 |