The disclosed subject matter relates to classification of digital music collections using a computational model of music similarity.
The sizes of personal digital music collections are constantly growing. Users of digital music are finding choosing music appropriate to a particular situation increasingly difficult. Furthermore, finding music that users would like to listen to from a personal collection or an online music store is also a difficult task. Since finding songs that are similar to each other is time consuming and each user has unique opinions, a need exists to create perform music classification in a machine.
Methods, systems, and media are provided for classifying digital music.
In some embodiments, methods of classifying a song are provided that include: receiving a selection of at least one seed song; receiving a label selection for at least one unlabeled song; training a support vector machine based on the at least one seed song and the label selection; and classifying a song using the support vector machine.
In some embodiments, systems for classifying a song are provided that include: memory for storing at least one seed song, at least one unlabeled song, and a song; and a processor that: receives a selection of the at least one seed song; receiving a label selection for the at least one unlabeled song; trains a support vector machine based on the at least one seed song and the label selection; and classifies the song using the support vector machine.
In some embodiments, computer-readable media containing computer-executable instructions that, when executed by a computer, cause the computer to perform a method for classifying music, wherein the method includes: receiving a selection of at least one seed song; receiving a label selection for at least one unlabeled song; training a support vector machine to based on the at least one seed song and the label selection; and classifying a song using the support vector machine.
Various objects, features, and advantages of the disclosed subject matter can be more fully appreciated with reference to the following detailed description when considered in connection with the following drawings.
a-b illustrate results of an experiment performed on some embodiments of the disclosed subject matter.
Methods, systems, and computer readable media for classifying music are described. In some embodiments Support Vector Machines (SVMs) can be used to classify music. In certain of these embodiments, relevance feedback such as SVM active learning can be used to classify music. Log-frequency cepstral statistics, such as Mel-Frequency Cepstral Coefficient statistics, can also be used to classify music.
Digital music is available in a wide variety of formats. Such formats include MP3 files, WMA files, streaming media, satellite and terrestrial broadcasts, Internet transmission, fixed media, such as CD and DVD, etc. Digital music can also be formed from analog signals using well-known techniques. A song, as that term is used in the specification and claims may be any form of music including complete songs, partial songs, musical sound clips, etc.
Generally speaking, an SVM is a supervised classification system that minimizes an upper bound on an expected error of the SVM. An SVM attempts to find a hyperplane separating two classes of data that will generalize best fit of future data. Such a hyperplane is the so-called maximum margin hyperplane, which maximizes the distance to the closest point from each class.
Given data points {X0, . . . , XN} and class labels {y0, . . . , yN}, yiε{−1,1}, any hyperplane separating the two data classes has the form:
yi(wTXi+b)>0 ∀i (1)
Let {wk} be the set of all such hyperplanes. The maximum margin hyperplane is defined by
and b is set by the Karush Kuhn Tucker conditions where the {α0, α1, . . . , αN} maximize
For linearly separable data, only a subset of the αis will be non-zero. These points are called the support vectors and all classification performed by the SVM depends on only these points and no others. Thus, an identical SVM would result from a training set that omitted all of the remaining examples. This makes SVMs an attractive complement to relevance feedback: if the feedback system can accurately identify the critical samples that will become the support vectors, training time and labeling effort can, in the best case, be reduced drastically with no impact on classifier accuracy.
Since the data points X only enter calculations via dot products, one can transform them to another feature space via a function Φ(X). The representation of the data in this feature space need never be explicitly calculated if there is an appropriate Mercer kernel operator for which
K(Xi, Xj)=Φ(Xi)·Φ(Xj) (5)
Data that is not linearly separable in the original space, may become separable in this feature space. In our implementation, we select a radial basis function (RBF) kernel
K(Xi, Xj)=e−γD
where D2(Xi,Xj) could be any distance function. See
As set forth above, SVM can be used with active learning in certain embodiment. In active learning, the user can become an integral part of the learning and classification process. As opposed to conventional (“passive”) SVM classification where a classifier is trained on a large pool of randomly selected labeled data, in an active learning system the user is asked to label only those instances that would be most informative to classification. Learning proceeds based on the feedback from the user and relevant responses are determined by the individual user's preferences and interpretations.
The duality between points and hyperplanes in feature space and parameter space enables SVM active learning. Notice that Eq. (1) can be interpreted with Xi as points and wk as the normals of hyperplanes, but it can also be interpreted with wk as points and Xi as normals. This second interpretation of the equation is known as parameter space. Within parameter space, the set {wk} is known as version space, a convex region bounded by the hyperplanes defined by the Xi. Finding the maximum margin hyperplane in the original space is equivalent to finding the point at the center of the largest hypersphere in version space.
The user's desired classifier corresponds to a point in parameter space that the SVM active learning system attempts to locate as quickly as possible. Labeled data points place constraints in parameter space, reducing the size of the version space. The fastest way to shrink the version space is to halve it with each labeled example, finding the desired classifier most efficiently. When the version space is nearly spherical, the most informative point to label is that point closest to the center of the sphere, i.e., closest to the decision boundary. In pathological cases, this is not true, nor is it true that the greedy strategy of selecting more than one point closest to a single decision boundary shrinks the version space most quickly.
Angle diversity is one heuristic that may be used for finding the most informative points to label. Angle diversity typically balances the closeness to the decision boundary with coverage of the feature space, while avoiding extra classifier re-trainings. In some cases, explicit enforcement of diversity may not be needed, for example when songs in the feature space are sparse.
In some instances, the first round of active learning can be treated as special. In such instances, the user only seeds the system with positive examples. Because of this, the first group of examples presented to the user by the system for labeling cannot be chosen by a classifier because the system cannot differentiate yet between positive and negative. Therefore, the first examples presented to the user for labeling can be chosen at random, with the expectation that since positive examples are relatively rare in the database, most of the randomly chosen examples will be negative. Additionally and/or alternatively, the first group of examples may be chosen so that they maximally cover the feature space, are farthest from the seed songs, are closest to the seed songs, or based upon any other suitable criteria or criterion. Further, in some embodiments, because features can be pre-computed, the group of songs can be the same for every query.
Various features of songs can be used by an SVM to classify those songs. In some embodiments, the features have the property that they reduce every song, regardless of its original length, into a fixed-size vector, and are based on Gaussian mixture models (GMMs) of Mel-Frequency Cepstral Coefficients (MFCCs).
Generally speaking, MFCCs are short-time spectral decompositions of audio signals that convey the general frequency characteristics important to human hearing. In some embodiments, to calculate MFCCs for a song, the song is first broken into overlapping frames, each for a given amount of time (e.g., approximately 25 ms long) and a time scale at which the signal can be assumed to be stationary. The log-magnitude of the discrete Fourier transform of each frame is then warped to the Mel frequency scale, imitating human frequency and amplitude sensitivity. Next, an inverse discrete cosine transform is used to decorrelate these “auditory spectra” and the so-called “high time” portion of the signal, corresponding to fine spectral detail, is discarded, leaving only the general spectral shape. In an example, MFCCs calculated for songs in a popular database can contain 13 coefficients each and, depending on the length of the song, approximately 30,000 temporal frames.
Although Mel scale is described herein as an example of a scale that could be used, it should be apparent that any other suitable scale could additionally or alternatively be used. For example, Bark scale, Erb scale, and Semitones scale could be used.
In the illustrative explanation set forth below, X denotes matrices of MFCCs, xt denotes individual MFCC frames, songs are indexed by i and j, GMM components are indexed by k, MFCC frames are indexed in time by t, and MFCC frames drawn from a probability distribution are indexed by n.
This first feature listed in
The feature vector is shown in
The second feature listed in
For two distributions, p(x) and q(x), the KL divergences is defined as,
There is a closed form for the KL divergence between two Gaussians,
where d is the dimensionality of the Gaussians. The symmetrized KL divergence shown in
D2(Xi, Xj)=KL(Xi∥Xj)+KL(Xj∥Xi) (9)
The third feature listed in
The distance function shown in
The fourth feature listed in
This feature tends to saturate, generating a non-zero posterior for only a single Gaussian. In order to prevent this saturation, the geometric mean of the frame probabilities can be taken instead of the product. This provides a “softened” version of the true class posteriors.
These geometric means can be compared using Euclidean distance.
The fifth feature listed in
where P(k|xt) is the posterior probability of the kth Gaussian in the mixture given MFCC frame xt, and μk and Σk are the mean and variance of the kth Gaussian. Using this approach can reduce arbitrarily sized songs to 650 dimensional features (i.e., 50 means with 13 dimensions each), for example.
Since the Fisher kernel is a gradient, it measures the partial derivative with respect to changes in each dimension of each Gaussian's mean. The sixth feature listed in
In some instances, referring to
For example, the user can enter a representative seed song 202 (e.g., John Coltrane-Cousin Mary) and begin the active retrieval system by selecting start 204. The system can then present a number of songs 206 (e.g., six songs). The user can then select to label songs as good, bad, or unlabeled. In order to select whether a song is good or bad, radio buttons 208 and 210 corresponding to good and bad for the song can be selected. Next, the user can select the number of songs to return in box 212 and begin the classification process by selecting train classifier button 214. Labeled songs can then be displayed at the bottom of the interface (i.e., songs labeled bad can be shown in box 216 and songs labeled good can be shown in box 218), and songs returned by the classifier can be displayed in list 220.
In some instances, the user can click on a song displayed in the interface to hear a representative segment of that song. After each classification round, the user can be presented with a number of new songs (e.g., six new songs) to label and can perform the process iteratively as many times as desired. Further, in some instances the user does not enter representative song 202, but rather the user relies solely on songs presented by the system for labeling.
In order to test the SVM active music retrieval system, the SVM parameters, features, and the number of training examples were varied per active retrieval round.
The experiment was run on a subset of a database of popular music. To avoid the so called “producer effect” in which songs from the same album share overall spectral characteristics that could swamp any similarities between albums, artists were selected who had enough albums in the database to designate entire albums as training, testing, or validation. Such a division required each artist to have three albums for training and two for testing, each with at least eight tracks to get enough data points per album. The validation set was made up of any albums the selected artists had in the database in addition to those five. In total there were 18 artists (out of 400) who met these criteria. Referring to
Since a goal of SVM active learning is to quickly learn an arbitrary classification task, any categorization of the data points can be used as ground truth for testing. In the experiment, music was classified by All Music Guide (AMG) moods, AMG styles, and artist. AMG is a website (www.allmusic.com) and book that reviews, rates, and categorizes music and musicians. Two ground truth datasets were AMG “moods” and “styles.” In its glossary, AMG defines moods as “adjectives that describe the sound and feel of a song, album, or overall body of work,” for example acerbic, campy, cerebral, hypnotic, rollicking, rustic, silly, and sleazy. While AMG never explicitly defines them, styles are subgenre categories such as “Punk-Pop,” “Prog-Rock/Art Rock,” and “Speed Metal.” In the experiment, styles and moods that included 50 or more songs, which amounted to 32 styles and 100 moods, were used. Referring to
While AMG, in general, only assigns moods and styles to albums and artists, for the purposes of testing, it was assumed that all of the songs on an album had the same moods and styles, namely those attributed to that album, though this assumption does not necessarily hold, for example, with a ballad on an otherwise upbeat album.
Artist identification is the task of identifying the performer of a song given only the audio of that song. While a song can have many styles and moods, it can have only one artist, making this the ground truth of choice for an N-way classification test of the various feature sets.
Before beginning the experiment, the SVM parameters γ and C, the weighting used to trade-off between classifier margin and margin violations for particular points, which are more efficiently treated as mislabeled via the so-called “slack variables,” needed to be set. Simple cross-validation grid search was used to find well-performing values. These results were not exhaustively compared for all combinations of features and ground truth, but only a representative sample. After normalizing all feature columns to be zero mean and unit variance, the best performing classifiers used C=104 and γ=0.01, although other suitable values could also have been used. Settings widely divergent from these tended to generate uninformative classifiers that labeled everything as a negative result.
The experiment compared different sized training sets in each round of active learning on the best-performing features, MFCC Statistics. Active learning should be able to achieve the same accuracy as passive learning with fewer labeled examples because it chooses more informative examples to be labeled first. To measure performance, the mean precision on the top 20 results on unlabeled songs on the test set containing completely different albums were compared.
In this experiment, five different training group sizes were compared. In each trial, an active learning system was randomly seeded with 5 elements from within the class, corresponding to a user supplying songs that they would like the results to be similar to. The system then performed simulated relevance feedback with 2, 5, 10, and 20 songs per round, and one round with 50 songs, the latter of which is equivalent to conventional SVM learning. The simulations stopped once the learner had labeled 50 results so that the different training sets could be compared.
The results of the active retrieval experiments can be seen in
Comparing the ground truth sets to one another, it appears that the system performs best on the style identification task, achieving a maximum mean precision-at-20 of 0.683 on the test set, only slightly worse than the conventional SVM trained on the entire training set which requires more than 13 times as many labels. See
As expected, labeling more songs per round suffers from diminishing returns; performance depends most heavily on the number of rounds of active learning instead of the number of labeled examples. This result is a product of the suboptimal division of the version space when labeling multiple data points simultaneously.
Opposing the use of small training sets, however, is the initial lack of negative examples. Using few training examples per round of feedback can actually hurt performance initially because the classifier has trouble identifying examples that would be most discriminative to label. It might be advantageous, then, to begin training on a larger number of examples perhaps just for the “special” first round and then, once enough negative examples have been found, to reduce the size of the training sets in order to increase the speed of learning.
In some embodiments, music classification techniques, such as SVM active learning, can be integrated with current music players to automatically generate playlists. Such an embodiment is illustrated in
In system 900, server 910 can be any suitable server for executing an application, such as a processor, a computer, a data processing device, or a combination of such devices. Communications network 906 can be any suitable computer network including the Internet, an intranet, a wide-area network (WAN), a local-area network (LAN), a wireless network, a digital subscriber line (DSL) network, a frame relay network, an asynchronous transfer mode (ATM) network, a virtual private network (VPN), telephone network, or any combination of any of the same. Communications links 904 and 908 can be any communications links suitable for communicating data between clients 902 and server 910, such as network links, dial-up links, wireless links, hard-wired links, etc. Clients 902 can be personal computers, laptop computers, mainframe computers, Internet browsers, personal digital assistants (PDAs), two-way pagers, wireless terminals, MP3 player, portable or cellular telephones, etc., or any combination of the same. Clients 902 and server 910 can be located at any suitable location. Clients 902 and server 910 can each contain any suitable memory and processors for performing the functions described herein.
In such a client-server architecture, the server could be used for performing the SVM calculations and storing music content, and the client could be used for viewing the output of the SVM, downloading music from the server, purchasing music from the server, etc.
Although a client-server architecture is illustrated in
Although the present invention has been described and illustrated in the foregoing illustrative embodiments, it is understood that the present disclosure has been made only by way of example, and that numerous changes in the details of implementation of the invention can be made without departing from the spirit and scope of the invention, which is limited only by the claims which follow.
This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Patent Application No. 60/708,664 filed Aug. 16, 2005, which is hereby incorporated by reference herein in its entirety.
The invention disclosed herein was made with U.S. Government support from the National Science Foundation grant IIS-0238301. Accordingly, the U.S. Government may have certain rights in this invention.
Number | Name | Date | Kind |
---|---|---|---|
7091409 | Li et al. | Aug 2006 | B2 |
7348184 | Rich et al. | Mar 2008 | B2 |
7363279 | Ma et al. | Apr 2008 | B2 |
7366325 | Fujimura et al. | Apr 2008 | B2 |
7444018 | Qi et al. | Oct 2008 | B2 |
7461048 | Teverovskiy et al. | Dec 2008 | B2 |
7467119 | Saidi et al. | Dec 2008 | B2 |
7480639 | Bi | Jan 2009 | B2 |
7483554 | Kotsianti et al. | Jan 2009 | B2 |
7487151 | Yamamoto | Feb 2009 | B2 |
7510842 | Podust et al. | Mar 2009 | B2 |
7519994 | Judge et al. | Apr 2009 | B2 |
7548936 | Liu et al. | Jun 2009 | B2 |
7561741 | Song et al. | Jul 2009 | B2 |
Number | Date | Country | |
---|---|---|---|
20080022844 A1 | Jan 2008 | US |
Number | Date | Country | |
---|---|---|---|
60708664 | Aug 2005 | US |