The invention relates to an apparatus and method for modelling layers in a music signal. The invention also relates to an apparatus and method for modelling chords of a music signal. The invention also relates to an apparatus and method for modelling music region content of a music signal. The invention also relates to an apparatus and method for tokenizing a segmented music signal. The invention also relates to an apparatus and method for deriving a vector for a frame of a tokenized music signal. The invention also relates to an apparatus and method for determining a similarity between a query music segment and a stored music segment.
In recent years, increasingly powerful technology has made it easier to compress, distribute and store digital media content. There is an increasing demand for the development of tools for automatic indexing and retrieval of music recordings. One task of music retrieval is to rank a collection of music signals according to the relevance to a query of each of the music signals. One format of music in a music information retrieval (MIR) application for popular songs is a raw audio format. The challenges of a MIR system include effective indexing of music information that supports run-time quick search, accurate query representation as the music descriptor, and robust retrieval modelling that rank the music database by relevance score.
Many MIR systems have been reported; two such examples are references [1][2]. The MIR communities initially focused on developing text-based MIR systems where both database music and the music query portions were in MIDI format and the information was retrieved by matching the melody of the query portion with the database portions as in, for example, references [5][6][7][11][24]. Since the melody information of both query portions and song database portions are text based (MIDI), efforts in this area were devoted to database organization of the music information (monophonic and/or polyphonic nature) and to text-based retrieval models. The retrieval models in those systems included dynamic programming (DP) [8][12][24], n-gram-based matching [6][11] [24].
Recently, with the advances in information technologies, the MIR community has started looking into developing MIR systems for music in raw audio format. One popular example of such systems are the query-by-humming systems [5][12], which allow a user to input the query by humming a melody line via the microphone. To do so, research efforts have been made to extract the pitch contours from the hummed audio, and to build a retrieval model that measures the relevance between the pitch contour of the query and the melody contours of the intended music signals. Autocorrelation [5], harmonic analysis [12] and statistical modelling via audio feature extraction [13] are some of the techniques that have been employed for extracting pitch contour from hummed queries. In [4][9][10], fixed length audio segmentation, spectral and pitch contour sensitive features are discussed to measure similarity between music clips.
However, the melody-based retrieval model is insufficient for MIR because it is highly possible that different songs share an identical melody contour. Hitherto, existing MIR systems simply have not addressed this and other issues.
The invention is defined in the independent claims. Some optional features of the invention are defined in the dependent claims.
A system in accordance with one or more of the independent claims provides a novel framework for music content indexing and retrieval. In this framework, a piece of music, such as a popular song can be characterised by layered music structure information including: timing, harmony/melody and music region content. These properties will be discussed in more detail below. Such a system uses, for example, chord and acoustic events in layered music information as the indexing terms to model the piece of music in vector space.
In general a system in accordance with one or more of the independent claims may provide musicians and scholars tools that search and study different musical pieces of similar music structures (rhythmic structure, melody/harmony structure, music descriptions, etc); help entertainment service providers index and retrieve the songs of similar tone and semantics in response to the user queries which are in the form of music clips, referred to as query-by-example.
The present invention will now be described, by way of example only, and with reference to the accompanying drawings in which:
A challenge for MIR of music in raw audio format, addressed by the techniques disclosed herein, is to represent the music content including harmony/melody, vocal and song structure information holistically.
The framework 100 of
The first layer 102 is the foundation of the pyramidal music structure of
It is noted that popular songs are similar in many ways, for example, they may have a similar beat cycle—common beat patterns, similar harmony/melody—common chord patterns, similar vocal—similar lyrics and similar semantic content—music pieces or excerpts that creates similar auditory scenes or sensation. Using the musical structure of
Musical signals representing songs/pieces of music are indexed in a database using vectors of the event models in layers 102, 104, 106. The retrieval process is implemented using vectors of n-gram statistics of the vectors of these layers of a query music segment. An overall architecture for an apparatus for modelling layers in a music signal is illustrated in
A more detailed description of rhythm modelling module 202 of
In one implementation, the frequency transient analysis module 312 performs frequency transient analysis for the first to the fourth octave sub-bands. The reason for this is discussed below. In one implementation, the energy transient analysis module 314 performs energy transient analysis for the fifth to eighth octave sub-band. Again, the reason for this is discussed below. Apparatus 202 also comprises a segmentation module 318 for deriving a frame of the music signal. The music signal frame has a length corresponding to the smallest note length. The segmentation module 318 also designates a reference point in the music signal corresponding to a first dominant onset of the music signal.
Apparatus 202 also comprises a tempo rhythm cluster (TRC) module (not shown) which derives a tempo rhythm cluster of the music signal from the smallest note length and multiples thereof. The inventors have found that tempo of pieces of popular music usually have a tempo of 60 to 200 BPM (beats per minute). In one implementation, this range is divided into steps of 20 BPM clusters. Thus songs of 60-80 BPM are grouped into a corresponding cluster 1. Cluster 2 is a group of songs with tempo in the range of 81-100 BPM and so on. For a given query clip, the clip's tempo is computed after detection of the smallest note length.
Then the search space pointer is set not only to the cluster which the query tempo fall in but also to the clusters where integer multiples of query tempo falls in. This is discussed in more detail below with respect to
Additionally, apparatus 202 comprises a silence detection module 320 for detecting that a frame of the music signal is a silent frame from a short-time energy calculation of the frame. Alternatively, silence detection module 320 is provided as a separate module distinct from apparatus 202. After apparatus 202 segments the music into smallest note size signal frames, silence detection module 320 performs a calculation of short time energy (STE) such as in, say, [14], for the or each frame of the music signal. If the normalised STE is below a predefined threshold (say, less than 0.1 on the normalised scale), then that frame is denoted a silent (S) frame and is then excluded from any of the processing of the music signal described below. This may done by, for example, tokenizing module 902 of
The fundamental step for audio content analysis is the signal segmentation where the signal within a frame can be considered as quasi-stationary. With quasi-stationary music frames, apparatus 202 extracts features to describe the content and model the features with statistical techniques. The adequacy of the signal segmentation has an impact on system level performance of music information extraction, modelling and retrieval.
Earlier music content analysis [4][9][10] approaches use fixed length signal segmentation only.
A music note can be considered as the smallest measuring unit of the music flow. Usually smaller notes (⅛, 1/16 or 1/32 notes) are played by one or more musicians in the bars to align the melody with the rhythm of the lyrics and to fill in the gap between lyrics. Therefore the information within the duration of a music note can be considered quasi-stationary. The disclosed techniques segment a music signal into frames of the smallest note length instead of fixed length frames as has been done previously. Since the inter-beat interval of a song is equal to the integer multiples of the smallest note, this music framing strategy is called Beat Space Segmentation (BSS). BSS captures the timing (rhythm) information (the first structural layer of
BSS provides a means to detect music onsets and perform the smallest note length calculation.
Apparatus 202 then segments the sub-band signals into 60 ms with 50% overlapping. Both the frequency and energy transients are analyzed using a similar method to that in [20]. Frequency transient analysis module 312 measures the frequency transients in terms of progressive distances in octave sub-bands 01 to 04 because fundamental frequencies (F0s) and harmonics of music notes in popular music are strong in these sub-bands. Energy transient analysis module measures the energy transients in sub-band 05 to 08 as the energy transients are found to be stronger in these sub-bands.
Equation 1 describes the computation of final (dominant) onset at time ‘t’, On(t) which is the weighted summation of sub-band onsets SOr(t).
On(t)=Σr=18w(r)·SOr(t) (1)
The output of moving threshold calculation module 316 is supplied to the octave sub-band onset determination modules 310. Summation module 302 derives a composite onset of the music signal from a weighted summation of octave sub-band onsets of the music signal output by modules 310. It has been found that the weights, w1, w2, . . . , wn of weighted matrix w having matrix elements {0.6, 0.9, 0.7, 0.9, 0.7, 0.5, 0.8, 0.6} provides the best set of weightings for calculating the dominant onsets in the music signal.
The output of summation module 302 is supplied to autocorrelation module 304 where an autocorrelation of the composite onset is performed to derive an estimated inter-beat proportional note length. Interval length determination module 304 varies this estimated note length to check for patterns of equally spaced intervals between dominant onsets On(.). In one implementation, the interval length determination module uses a dynamic programming module using known dynamic programming techniques to check for these patterns. A repeating interval length—perhaps the most popularly found or common smallest interval which is also integer fractions of other longer intervals—is taken as the smallest note length by note length determination module 308. A segmentation module 318 is provided to segment the music signal into one or more music frames according to the smallest note length. Segmentation module 318 also designates a reference point in the music signal corresponding to a first dominant onset of the music signal as determined by summation module 302.
The processing of the music signal is illustrated with reference to
In the apparatus 202, an assumption is made that the tempo of the song is constant. Therefore the starting point of the song is used as the reference point for BSS. This is illustrated in
The smallest note length and its multiples form the tempo/rhythm cluster (TRC). By comparing the TRC of query clip with TRC of the songs in the database we narrow down the search space.
Silence is defined as a segment of imperceptible music, including unnoticeable noise and very short clicks. Apparatus 202 calculates the short-time energy function to detect the silent frames.
Referring again to
The progression of music chords describes the harmony of music. A chord is constructed by playing set of notes (>2) simultaneously. Typically there are 4 chord types (Major, Minor, Diminish and Augmented) and 12 chords per chord type that can be found in western music. For efficient chord detection, the tonal characteristics (the fundamental frequencies—F0s, the harmonics and the sub-harmonics) of the music notes which comprise a chord should be well characterized by the feature. Goldstein (1973) [17] and Terhardt (1974) [18] proposed two psycho-acoustical approaches: harmonic representation and sub-harmonic representation, for complex tones respectively. It is noted that harmonics and sub-harmonics of a music note are closely related with the F0 of another note. For example, the third and sixth harmonics of note C4 are close to (related to) the fundamental frequencies F0 of G5 and G6. Similarly the fifth and seventh sub-harmonics of note E7 are closed to F0 of C5 and F#4 respectively.
A more detailed view of the harmony modelling module 204 of
Referring to
In the chord modelling/detection system of
The reasons for using filters to extract tonal characteristics of notes are primarily two-fold:
It has been found that the tonal characteristics in an individual octave can effectively represent the music chord. The two-layer hierarchical model for music chord modelling of
PCP
OC
n(α)=[S(.)W(OC,α)]2 OC=1 . . . 7, α=1 . . . 12. (3)
W(OC, α) is the filter whose position and the pass-band frequency range varies with both octave index (OC) and αth note in the octave (OC). If the octave index is 1, then the respective octave is C2B2.
Seven respective statistical models 508 are trained with the PCP vectors 506 in the first layer of the model using the training data set. Then the same training data is fed to first layer as test data and store in a memory (not shown) the outputs given by the seven models 508 in the first layer. Seven multi-dimensional probabilistic vectors 510 are constructed from the outputs of the layer one models 508 which are then used to train the second layer model 512 of the chord model.
That is, the second layer model 512 is trained with probabilistic feature vector outputs 510 of the first layer models 508. In one implementation, four Gaussian mixtures models are used for each model in the first and second layers 508, 512. This two-layer modelling can be visualized as first transforming feature space represented tonal characteristics of the music chord into probabilistic space at the first layer 508 and then modelling them at the second layer 512. This two-layer representation is able to model 48 music chords in the chord detection system 204 of
The training process 1720 of the first layer GMMs 508 is also illustrated in
As discussed above, PV, PI, IMV and S are the regions that can be seen in a song (third layer 106 of
An apparatus 700 for modelling of music region content of layer three of
The apparatus comprises, principally, octave scale filter banks (the octave scale/frequency transformation of which is illustrated in
Optionally, apparatus 700 also includes first and second gaussian mixture modules 708, 710 which are trained by the OSCC feature vector construction per frame for use in the tokenization process described below with respect to
A sung vocal line carries more descriptive information about the song than other regions. In the PI regions, extracted features must be able to capture the information generated by lead instruments which typically defines the tunes/melody. To this end, the apparatus of
The output Y(b) of the bth filter is computed according to Equation 4 where S(.) is the frequency spectrum in decibel (dB), Hb(.) is the bth filter, and mb and nb are boundaries of bth filter.
Equation 5 describes the computation of βth cepstral coefficient where kb, Nf and Fn are the centre frequency of the bth filter, number of frequency sampling points and number of filters respectively (Fn=12 in the present case).
Singular values (SVs) indicate the variance of the corresponding structure. Comparatively high singular values of the diagonal matrix of the SVD process describe the number of dimensions with which the structure can be represented orthogonally. SVD is a technique for checking the level of the correlation among the feature coefficients for groups of information classes. Higher singular values in a diagonal matrix resulting from the decomposition illustrates the lesser correlation between the coefficients of a particular feature for an information class. If the feature coefficients are less correlated then the modelling of the information using that feature is more successful. That is, smaller singular values indicate the correlated information in the structure and considered to be noise. We perform singular value decomposition (SVD) over feature matrices extracted from PT and V regions with respect to the process 1900 of
As shown in
An apparatus or apparatus module for tokenizing the music signal, constructing a vector of the tokenized music signal and comparing vectors of stored and query music segments is illustrated in
Vector space modelling has been used previously in, for example, text document analysis. It has not hitherto been used in music signal analysis. Perhaps the principal reason for this is that, unlike modelling of text documents that uses words or phrases as indexing terms, a music signal is a running digital signal without obvious anchors for indexing. Thus, the primary challenges for indexing music signals are two-fold. First, good indexing anchors must be determined. Secondly a good representation of music contents for search and retrieval must be derived. Apparatus for tokenizing the music signal and an apparatus for deriving a vector representation of the music signal again makes use of the multi-layer model of
As discussed above with respect to
In general terms, an apparatus for tokenizing the music signal—that is, deriving the “vocabulary” of the song for vector representation of the song—comprises a tokenizing module 902 to receive a frame of a segmented music signal (segmented by the apparatus of
In one implementation, the token symbol comprises a chord event and the token library comprises a library of modelled chords, the tokenizing module being configured to determine a probability the frame of the music signal corresponds with a chord event.
Suppose an apparatus comprises 48 trained frame-based chord models 508, 512 as shown in
The chord events represent the harmony of music. Note that a music signal is characterised by both harmony sequence and the vocal/instrumental patterns. To describe the music content, the vocal and instrumental events of third layer 106 are defined. Thus, in one implementation, the token symbol comprises an acoustic event and the token library comprises a library of acoustic events, and the tokenizing module determines a probability the frame of the music signal corresponds with an acoustic event. The acoustic event may comprise at least one of a voice event or an instrumental event.
Pure instrumental (PI) and the vocal (V) regions contain the descriptive information about the music content of a song. A song can be thought of as a sequence of interweaved PI and V events, called acoustic events. Two Gaussian Mixture models, GMMs (64 GMs in each) are trained to model each of them with the 20 OSCC features extracted from each frame described above with respect to the music region modelling apparatus of
The contents in silence regions (S) are indexed with zero observation. Thus, the disclosed techniques use the events as indexing terms to design a vector for a music segment.
The chord and acoustic decoders serve as the tokenizers for music signal. The tokenization process results in two synchronized streams of events, a chord and an acoustic sequence, for each music signal. An event is represented by a tokenization symbol. They are represented in text-like format. It is noted that n-gram statistics has been used in natural language processing tasks to capture short-term substring constraints such as letter n-gram in language identification [22] and spoken language identification [23]. If one thinks of the chord and acoustic tokens, as the letters of music, then a music signal is an article of chord/acoustic transcripts. Similar to the letter n-gram in text, it has been found it is possible to use the token n-gram of music as the indexing terms, which aims at capturing the short-term syntax of musical signal. The statistics of tokens themselves represent the token unigram. Thus, a vector defining a music segment (which can be a music frame or multiple thereof) can be derived. Thus, an apparatus for deriving a vector for a frame of a tokenized music signal comprises a vector construction module configured to construct a vector having a vector element defining a token symbol (e.g. a chord or an acoustic event) score for the frame of the tokenized music signal.
A more detailed view of the tokenization process is illustrated in
Vector space modelling (VSM) has become a standard tool in text-based IR systems since its introduction decades ago [21]. It uses a vector to represent a text document. One of the advantages of the method is that it makes partial matching possible. Known systems derive the distance between documents easily as long as the vector attributes are well defined characteristics of the documents. Each coordinate in the vector reflects the presence of the corresponding attribute, which is typically a term. The novel techniques disclosed herein define chord/acoustic tokens in a music signal. These are used as terms in an article. Thus it has been found that it is now possible to use a vector to represent a music segment. If a music segment is thought of as an article of chord/acoustic tokens, then the statistics of the presence of the tokens or token n-grams describe the content of the music. A vector construction module 904 constructs a vector having a vector element defining a token symbol score for the frame of the tokenized music signal.
Suppose a music token sequence, t1t2t3t4 is defined. The tokenizing module 902 derives the unigram statistics from the token sequence itself. Module 902 derives the bigram statistics from t1(t2) t2(t3) t3(t4) t4(#) where the acoustic vocabulary is expanded over the token's right context. The # sign is a place holder for free context. In the interest of manageability, the present technique only use up to bigrams, but it is possible also to derive the trigram statistics from the t1(#,t2) t2(t1,t3) t3(t2,t4) t4(t3,#) to account for left and right contexts.
Thus, for an acoustic vocabulary of |c|=48 token entries in the chord stream, we have 48 unigram frequency items fni in the chord vector {right arrow over (f)}n={fn1, . . . , fni, . . . fn48} as in
Referring to
Alternatively, the vector construction module defines the vector element(s) as a probability score of whether the frame corresponds with a token symbol (chord event). This is illustrated in
In the same way, vector construction module 904 constructs an acoustic vector of two unigram frequency items in the acoustic vector for the acoustic stream. For simplicity, the vector construction module 904 only formulates {right arrow over (f)}n next.
To capture the short-term dynamics, the vector construction module is also configured to derive a bigram representation for two consecutive frames. As such, we build a chord bigram vector of [48×48=2304] dimensions, {right arrow over (f)}n′={fn1,1, . . . , fni,j, . . . , fn48,48} where if both fn1=1 and fn+1j=1, then fni,j=1; otherwise fni,j=0. Similarly an acoustic bigram vector of [2×2=4] dimensions is formed.
Thus, for a music segment of N frames, a chord unigram vector {right arrow over (f)}N={fN1, . . . , fNi, . . . , fN48} is constructed by aggregating the frame vectors with the ith element as
fNi=Σn=1Nfni (8)
The chord bigramvector of [48×48=2304] dimensions {right arrow over (f)}′N={fN1,1, . . . , fNi,j, . . . , fN48,48} is constructed in a similar way with the element (i, j)th as
fNi,j=Σn=1Nfni,j (9)
The acoustic vector can be formulated in a similar way with a two-dimensional vector for unigram and [2×2=4] dimensional vector for bigram.
The hard-indexing scheme above provides acceptable results. Although it would be convenient to derive the term count from token sequences of a query or a music segment, it is found that tokenization is affected by many factors. It does not always produce identical token sequence for two similar music segments. The difference could be due to the variation in beat detection, variation of music productions between the query and the intended music. The inconsistency between the tokenization of the query and the intended music present undesired mismatch as far as MIR is concerned. Assuming that the numbers of beats in query and music are detected correctly, the inconsistency is characterized by substitutions of tokens between the desired label and the tokenization results. If a token is substituted, then it presents a mismatch between the query and the intended music segment. To address this problem, the soft-indexing scheme uses the tokenizers as probabilistic machines that generate a posteriori probability for each of the chord and acoustic events. If we think of the n-gram counting as integer counting, then the posteriori probability can be seen as soft-hits of the events. In one implementation, the soft-hits are formulated for both the chord and acoustic vectors, although it possible to do this only for the chord vector. Thus, according to Bayes' rule, we have
where p(ci) is the prior probability of the event. Assuming no prior knowledge about the events, p(cici) can be dropped from Eq. (12), which is then simplified as
Let P(ci|on) be denoted as pni. It can be interpreted as the expected frequency of event ci at nth frame, with the following properties, (a) 0≦pni≦1, (b). A frame is represented by a vector of continuous value as illustrated in
Assuming the music frames are independent each other, the joint posteriori probability of two events i and j between two frames, nth and (n+1)th can be estimated as
p
n
i,j
=p
n
i
×p
n+1
j (14)
where Pni,j has similar properties as Pni, (a) 0≦Pni,j≦1, (b) Σi=148Σi=148pni,j=1. For a query of N frames, the expected frequency of unigram and bigram can be estimated as
E{fNi}=Σn=1Npni (15)
E{fNi,j}Σn=1Npni,j (16)
Thus the soft-indexing vector for query E{{right arrow over (f)}Ni(q)} and music segment E{{right arrow over (f)}Ni(d)}. Replacing {right arrow over (f)}N(q) with E{{right arrow over (f)}Ni(q)}, {right arrow over (f)}Ni(d) with E{{right arrow over (f)}Ni(d)} in Equation 12 and Equation 13, the similar relevance scores can be used for soft-indexing ranking.
This processing is carried out by vector comparison module 906 of
Although we use two-dimensional coordinate for the bigram count, the vector can be treated as a one-dimensional array. The process of deriving unigram and bigram vectors for a music segment involves minimum computation. In practice, those vectors are computed at run-time directly from the chord/acoustic transcripts resulting from the tokenization. Note that the tokenization process evaluates a music frame against all the chord/acoustic models at higher computational cost. This can be done off-line.
The MIR process evaluates the similarity between a query music segment and all the candidate music segments. For simplicity, the chord unigram vector (48 dimension) is denoted {right arrow over (f)}Ni(q) and {right arrow over (f)}Ni,j (q) denotes the chord bigram vector (2,304 dimension) for a query of N frames. Similarly, a chord unigram vector {right arrow over (f)}Ni(d) and a chord bigram vector {right arrow over (f)}Ni,j(d) can be obtained from any segment of N frames in the music database.
The similarity between two n-gram vectors is determined from a comparison of the two unigram and two bigram vectors respectively is as follows:
Ranking module 908 then ranks the stored music segments according to their relevance from the similarity comparison to the query music segment. This is done as a measure of the distance between the respective vectors. The relevance is can be defined by the fusion of unigram and bigram similarity scores. The fusion can be made, for example, as simple addition of the unigram and bigram scores or, in the alternative, as an averaging of these.
In a simulation, chord and acoustic modelling performance was studied first followed by MIR experiments. The apparatus used in the simulation was apparatus 1600 of
A song database 1602 is processed by apparatus 1600. The rhythm extraction, beat segmentation and silence detection process of
Similar processes are carried out for a query music clip (segment) 1620 to derive a query music vector 1622.
Tokenization, indexing and vector comparison (distance calculation) are carried out by module 1616. The indexed music content database is illustrated at 1618. The n-gram relevance ranking is carried out at 1624 between the query music clip vector 1622 and the indexed music content databases 1618. A list or results of possible matches are returned at 1624. In one implementation, the most likely candidate for the query is returned as a single result.
A song database DB1 for MIR experiments was established, extracted from original music CDs, digitized at 44.1 kHz sampling rate with 16 bits per sample in mono channel format. The retrieval database comprises 300 songs by 20 artists as listed in Table 2, each on average contributing 15 songs. The tempos of the songs are in the rage of 60-180 beats per minute.
Each of the 48 chord models is a two-layer representation of Gaussian mixtures as in
G−PCP
n(α)=ΣOC=17PCPOCn(α) (17)
It was noted that the proposed TLM with feature extracted from BSS outperformed the SLM approach by 5% in absolute accuracy.
The compare performance of OSCCs and MFCCs for modelling regions PI and V was compared. SVD analysis depicted in
Table 3 shows the correct region detection accuracies for an optimized number of both the filters and coefficients of MFCC and OSCC features. The correct detection accuracy for PI-region and V-region is reported, when the frame size is equal to beat space. The accuracy when fixing the frame size to 30 ms is reported. Both OSCC and MFCC performed better when the frame size is beat space. OSCC generally outperformed MFCC, and is therefore particularly useful for modelling acoustic events.
In DB1, 4 clips of 30-second music were selected as queries from each artist in the database, totaling 80 clips. Out of 4 clips, two clips belong to V region and other two belonged mainly to PI region. For a given query, the relevance score between a song and the query is defined as the sum of the similarity score between the top K most similar indexing vectors and the query vector. Typically, K is set to be 30.
After computing the smallest note length in the query, check the tempo/rhythm clusters of the songs in the data base is checked. For song relevance ranking, only the songs whose smallest note lengths are in the same range (with ±30 ms tolerance) are considered as the smallest note length of the query or integer multiples of them. Then the surviving songs in the DB1 were ranked according to their respective relevance scores.
In Table 4 the chord event effect and combined effects of chord and acoustic events in terms of retrieval accuracy is shown.
The simulations show that the vector space modelling is effective in representing the layered music information, achieving 82.5% top-5 retrieval accuracy using 15-sec music clips as the queries.
It can be found that the soft-indexing outperforms hard-indexing see Equations 8 and 9. In general, combining acoustic events and chord events yields better performance. This can be understood by the fact that similar chord patterns are likely to occur in different songs. The acoustic content helps differentiate one from another.
Thus, in summary, the disclosed techniques have proposed a novel framework for MIR. The contribution of these techniques include:
It has been found that octave scale music information modelling followed by the inter-beat interval proportion segmentation is more efficient than known fixed length music segmentation techniques. In addition, the disclosed soft-indexing retrieval model may be more effective than the disclosed hard-indexing one, and may be able to index greater details of music information.
The fusion of chord model and acoustic model statistics improves retrieval accuracy effectively. Further, music information in different layers complements each other in achieving improved MIR performance. The robustness in this retrieval modelling framework depends on how well the information is captured.
Even though music retrieval is the prime application in this framework, proposed vector space music modelling framework is useful for developing many other applications such as music summarization, streaming, music structure analysis, and creating multimedia documentary using music semantics. Thus, the disclosed techniques have application in other relevant areas.
It will be appreciated that the invention has been described by way of example only and that various modifications may be made in detail without departing from the spirit and scope of the appended claims. It will also be appreciated that features presented in combination in one aspect of the invention may be freely combined in other aspects of the invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SG2007/000299 | 9/7/2007 | WO | 00 | 7/24/2009 |
Number | Date | Country | |
---|---|---|---|
60843496 | Sep 2006 | US |