The present invention relates to a method for classifying audio data. The present invention more particularly relates to a fast music similarity computation method based on e.g. N-dimensional music mood space relationships.
Recently, the classification of audio data and in particular of pieces of music becomes more and more important as many electronic devices and in particular customer devices enable a respective user to store and manage a large plurality of music items and titles. In order to enhance the managing mechanism for such music data basis it is necessary to obtain a comparison between different pieces of audio data or different pieces of music in an easy and fast manner.
Therefore, a variety of mechanisms have been developed in order to extract from an analysis of audio data particular properties and features in order to compare pieces of music by comparing the respective sets or n-tuples of properties and features. However, many of the known features to be evaluated within such a comparison mechanism are difficult to calculate and the computational burden is in some cases not reasonable.
It is an object underlying the present invention to provide a method for classifying audio data which enables a reliable and easy and fast to compute comparison and classification of audio data.
The object is achieved according to the present invention by a method for classifying audio data with the features of independent claim 1. Preferred embodiments of the invention method for classifying audio data are within the scope of the dependent subclaims. The object underlying the present invention is also achieved by an apparatus for classifying audio data, by a computer program product, as well as by a computer readable storage medium according to independent claims 18, 19 and 20, respectively.
The method for classifying audio data according to the present invention comprises a step (S1) of providing audio data in particular as input data, a step (S2) of providing mood space data which define and/or which are descriptive or representative for a mood space according to which audio data can be classified, a step (S3) of generating a mood space location within said mood space for said given audio data, a step (S4) of providing at least one comparison mood space location within said mood space, a step (S5) of comparing said mood space location for said given audio data with said at least one comparison mood space location and thereby generating comparison data, and a step (S6) of providing as a classification result said comparison data in particular as output data which can be used in subsequent classification steps, mainly in detailed comparison steps.
It is therefore a key idea of the present invention to obtain from an analysis of given audio data a position or location within a mood space wherein said mood space is pre-defined or given by mood space data. Then the given audio data can be classified or compared by comparing the derived mood space location for said given audio data with said at least one comparison mood space location. The thereby generated comparison data or classification data are provided as a classification result or a comparison result. It is therefore essential to have for a given piece of audio data a position or location, e.g. by means of coordinate n-tuple, which can easily compared with other locations or positions in said mood space, e.g. by simply comparing the respective coordinates of the position or location. Therefore audio data can easily be classified and compared with other audio data.
According to a preferred embodiment of the method for classifying audio data according to the present invention said mood space may be or may be modelled by at least one of an Euclidean space model, a Gaussian mixture model, a neural network model, and a decision tree model.
Additionally or alternatively, according to a further preferred embodiment of the method for classifying audio data according to the present invention said mood space may be or may be modelled by an N-dimensional space or manifold and N may be a given and fixed integer.
Further additionally or alternatively, said comparison data may be alternatively or additionally at least one of being descriptive for, being representative for and comprising at least one of a topology, a metric, a norm, a distance defined in or on said mood space according to a another embodiment of the method for classifying audio data according to the present invention.
Additionally or alternatively, said comparison data and in particular said topology, metric, norm, and said distance may be obtained based on at least one of said Euclidean space model, said Gaussian mixture model, said neural network model, and said decision tree model according to an advantageous embodiment of the method for classifying audio data according to the present invention.
Said comparison data may be derived based on said mood space location within said mood space for said given audio data and they may be based on said comparison mood space location within said mood space according to an additional or alternative embodiment of the method for classifying audio data according to the present invention.
Said mood space and/or the model thereof may be defined based on Thayer's music mood model according to an additional or alternative embodiment of the method for classifying audio data according to the present invention.
According to a further preferred embodiment of the method for classifying audio data according to the present invention said mood space and/or the model thereof may be at least two-dimensional and may be defined based on the measured or measurable entities stress S( ) describing positive, e.g. happy, and negative, e.g. anxious moods and energy E( ) describing calm and energetic moods as emotional or mood parameters or attributes.
Further additionally or alternatively, according to a still further preferred embodiment of the method for classifying audio data according to the present invention said mood space and/or the model thereof are at least three-dimensional and are defined based on the measured or measurable entities for happiness, passion, and excitement.
Said step (S4) of providing said at least one comparison mood space location may additionally or alternatively comprise a step of providing at least one additional audio data in particular as additional input data and a step of generating a respective additional mood space location for said additional audio data, and wherein said respective additional mood space location for said additional audio data is used for said at least one comparison mood space location according to an additional or alternative embodiment of the method for classifying audio data according to the present invention.
At least two samples of audio data may be compared with respect to each other—one of said samples of audio data being assigned to said derived mood space location and the other one of said of audio data being assigned to said additional mood space location or said comparison mood space location—in particular by comparing said derived mood space location and said additional mood space location or said comparison mood space location.
Further additionally or alternatively, according to a still further preferred embodiment of the method for classifying audio data according to the present invention said at least two samples of audio data to be compared with respect to each other may be compared with respect to each other based on said comparison data in a pre-selection process or comparing pre-process and then based on additional features, e.g. based on features more complicated to calculate and/or based on frequency domain related features, in a more detailed comparing process.
In this case said at least two samples of audio data to be compared with respect to each other may be compared with respect to each other in said more detailed comparing process based on said additional features, if said comparison data obtained from said pre-selection process or comparing pre-process are indicative for a sufficient neighbourhood of said at least two samples of audio data.
Alternatively, a plurality of more than two samples of audio data may be compared with respect to each other.
Alternatively or additionally, said given audio data may be compared to a plurality of additional samples of audio data.
In these cases from said comparison a comparison list and in particular a play list may be generated which is descriptive for additional samples of audio data of said plurality of additional samples of audio data which are similar to said given audio data.
According to a further preferred and advantageous embodiment of the method for classifying audio data according to the present invention music pieces are used as samples of audio data
According to a further aspect of the present invention, an apparatus for classifying audio data is provided which is adapted and which comprises means for carrying out a method for classifying audio data according to the present invention and the steps thereof.
According to a further aspect of the present invention a computer program product is provided comprising computer program means which is adapted to realize the method for classifying audio data according to the present invention and the steps thereof, when it is executed on a computer or a digital signal processing means.
Additionally a computer readable storage medium is provided which comprises a computer program product according to the present invention.
These and further aspects of the present invention will be further discussed in the following:
The present invention inter alia relates to a fast music similarity computation method which is in particular based on a N-dimensional music mood space.
It is proposed that a N-dimensional music mood space can be used to limit the number of candidates and hence reduce the computation in similarity list generation. For each of the music piece in a huge database, its location in a N-dimensional music mood space is first determined and only music pieces which are close to the music in the mood space are selected and the similarity are computed between the given music and the pre-selected music pieces.
Music similarity is a relatively new topic, and at this moment, the interest into it is quite academic. Systems have been developed that compare music pieces with one another using statistics over what is called ‘timbre’—a mixture of a variety of low-level features. Various distance measures have been proposed including expensive methods like Monte-Carlo-simulation of samples of a distribution and probability estimation of the artificial samples using the statistics from the other music piece. See e.g. [3] for details.
The state of the art in emotion recognition in music is a rather new topic. While a huge amount of papers have been written about music processing in general, few papers have been published regarding emotion in music. State of the art system used for emotion classification in music classifiers include Gaussian mixtures models, support vector machines, neural networks etc.
There are also studies about perception of emotion in music, but the results are still very preliminary. Reference [1] and [2] provides information about the state-of-the art mood detection techniques.
For applications which involved music retrieval or music suggestion, a music play list is usually displayed and songs in the play list are usually based on the similarity between the query music and the rest of the music in the database. Nowadays, typical commercial music database consists of hundreds of thousands of music. For each of the music in the database, state-of-the-art system usually compute its similarity to all the other music pieces in the database to generate a similarity list. Based on the applications, a play list is then generated from the similarity list. The computation required in similarity generation involved about N*N/2 similarity measure computation, where N is the number of songs in the database. For example, if the number of songs in the database is 500,000, then the computation will be 500,000*500,000/2, which is not practical for real applications.
In this proposal, a fast music similarity list generation method based on mood space are proposed. The emotion expressed in different music are usually different. Some music are perceived as happy by the listeners, but the other songs might be perceived as sad. On the other hand, among songs with similar mood or emotion, listeners generally can distinguish the difference in the degree of emotion expression. For example, one music is happier than the other one, etc. In additional, music with different mood usually are considered as dissimilar. The music similarity list generation approach described in this invention proposal exploits such emotion perception as described above.
In this proposal, we first proposed that the emotion of music can be described by a N-dimensional mood space. Each dimension describes the extent of a particular emotion attribute. For each of the music in the database, the value of each emotion attribute are first generated. According to the coordinates of a particular music in this N-dimensional space, music that are located in the proximity of the given music are first selected. After the pre-selection stage, instead of computing the similarity of the given music to the rest of the database, only the similarity between the given music and the pre-selected music are computed.
Any music emotion/mood model proposed in the literature can be used to construct the N-dimensional mood space. For example, the two-dimensional model proposed by Thayer [1]. The model adopts the theory that the mood is entrailed from two factors stress (positive/negative) and energy (calm/energetic). According to Thayer's mood model, any music can be described by a stress value and an energy value and such values give the coordinates of a given music and hence determine the location of the emotion in the mood space. In
The coordinates of a music in the mood space is proposed to be generated from any machine learning algorithms such as Neural Network, Decision Tree and Gaussian Mixture Models etc. For example, taking
After the location of the music in the mood space are determined, music pieces that are close to a given music in the mood space are identified by using simple distance measure such as Euclidean distance, Mahalanobis distance or Cosine angles etc.
For example, in
To generate a similarity list for music x, a similarity measure is introduced to compute the similarity between music x and the pre-selected music piece. The similarity measure can be any known similarity measure algorithms, e.g., each music is modelled by Gaussian Mixture Model. Any model distance criterion (see e.g. [3]) can then be used to measure the distance between the two Gaussian Models.
The main advantage is the significant reduction in computation to generate music similarity lists for a large database without affecting the similarity ranking performance from the perceptual point of view.
The invention will now be explained based on preferred embodiments thereof and by taking reference to the accompanying and schematical figures.
In the following functional and structural similar or equivalent element structures will be denoted with the same reference symbols. Not in each case of their occurrence a detailed description will be repeated.
The mood space M shown in
LADx:=LAD(S(x),E(x))=S(x),E(x).
The same may hold for second and third audio data y and z with measurement values S(y), E(y) and S(z), E(z), respectively. According to the general properties for the locations or positions LADy and LADz in said mood space M the following expressions are given:
LADy:=LAD(S(y),E(y))=S(y),E(y)
and
LADz:=LAD(S(z),E(z))=S(z),E(z).
As can be seen from the representation of
Additionally certain regions of the complete mood space M can be assigned to certain characteristics moods such as contentment, depression, exuberance, and anxiousness.
After initialization step START a sample of audio data AD is received as an input I in a first method step S1.
Then, in a following step S2 information is provided with respect to a mood space underlying the inventive method. Therefore in step S2 respective mode space data MSD are provided which define and/or which are descriptive or representative for said mood space M according to which audio data AD, AD′ can be classified and compared.
A step S3 follows wherein a mood space location LAD for said given audio data AD within said mood space M is generated. Contained is a substep S3a for analyzing said audio data AD, e.g. with respect to a given feature set FS which might be obtained from a respective data base. In the following substep S3b the mood space location LAD for said audio data AD is generated as a function of said audio data AD:
LAD:=LAD(AD).
In the following step S4 a comparison mood space location CL is received, for instance also from a data base. Said comparison mood space location CL might be dependent on one or a plurality of additional audio data AD′ to which the given audio data AD shall be compared to. Additionally in this case the comparison mood space location CL might also be dependent on the feature set FS underlying the present classification scheme.
In the following step S5 the locations LAD for the given sample of audio data AD and the comparison location are compared in order to generate respective comparison data CD. Said comparison data CD might also be realized by indicating a distance between said locations LAD and CL.
In the following step S6 the comparison data CD are given as an output ◯.
Finally, the process demonstrated in
Number | Date | Country | Kind |
---|---|---|---|
05005994.8 | Mar 2005 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2006/002398 | 3/15/2006 | WO | 00 | 8/25/2008 |