1. Technical Field of the Invention
The present invention relates to a technology for analyzing features of sound.
2. Description of the Related Art
A technology for analyzing features (for example, tone) of music has been suggested in the art. For example, Jouni Paulus and Anssi Klapuri, “Measuring the Similarity of Rhythmic Patterns”, Proc. ISMIR 2002, p. 150-156 describes a technology in which the time sequence of the feature amount of each of unit periods (frames) having a predetermined time length, into which an audio signal is divided, is compared between different pieces of music. The feature amount of each unit period includes, for example, Mel-Frequency Cepstral Coefficients (MFCCs) indicating tonal features of an audio signal. A DP matching (Dynamic Time Warping (DTW)) technology, which specifies corresponding locations on the time axis (i.e., corresponding time-axis locations) in pieces of music, is employed to compare the feature amounts of the pieces of music.
However, since respective feature amounts of unit periods over the entire period of an audio signal are required to represent the overall features of the audio signal, the technology of Jouni Paulus and Anssi Klapuri, “Measuring the Similarity of Rhythmic Patterns”, Proc. ISMIR 2002, p. 150-156 has a problem in that the amount of data representing feature amounts is large, especially in the case where the time length of the audio signal is long. In addition, since a feature amount extracted in each unit period is set regardless of the time length or tempo of music, an audio signal extension/contraction process such as the above-mentioned DP matching should be performed to compare the features of pieces of music, causing high processing load.
The invention has been made in view of these circumstances and it is an object of the invention to reduce processing load required to compare tones of audio signals representing pieces of music while reducing the amount of data required to analyze tones of audio signals.
In order to solve the above problems, an audio analysis apparatus according to the invention comprises: a component acquisition part that acquires a component matrix composed of an array of component values from an audio signal which is divided into a sequence of unit periods in a time-axis direction, columns of the component matrix corresponding to the sequence of unit periods of the audio signal and rows of the component matrix corresponding to a series of unit bands of the audio signal arranged in a frequency-axis direction, the component value representing a spectrum component of the audio signal belonging to the corresponding unit period and belonging to the corresponding unit band; a difference generation part that generates a plurality of shift matrices each obtained by shifting the columns of the component matrix in the time-axis direction with a different shift amount, and that generates a plurality of difference matrices each composed of an array of element values in correspondence to the plurality of the shift matrices, the element value representing a difference between the corresponding component value of the shift matrix and the corresponding component value of the component matrix; and a feature amount extraction part that generates a tonal feature amount including a plurality of series of feature values corresponding to the plurality of difference matrices, one series of feature values corresponding to the series of unit bands of the difference matrix, one feature value representing a sequence of element values arranged in the time-axis direction at the corresponding unit band of the difference matrix.
In this configuration, the tendency of temporal change of the tone of the audio signal is represented by a plurality of feature value series. Accordingly, it is possible to reduce the amount of data required to estimate the tone of the audio signal, compared to the prior art configuration (for example, Jouni Paulus and Anssi Klapuri, “Measuring the Similarity of Rhythmic Patterns”, Proc. ISMIR 2002, p. 150-156) in which a feature amount is extracted for each unit period. In addition, since the number of the feature value series does not depend on the time length of the audio signal, it is possible to easily compare temporal changes of the tones of audio signals without requiring a process for matching the time axis of each audio signal even when the audio signals have different time lengths. Accordingly, there is an advantage in that load of processing required to compare tones of audio signals is reduced.
A typical example of the audio signal is a signal generated by receiving vocal sound or musical sound of a piece of music. The term “piece of music” or “music” refers to a time sequence of a plurality of sounds, no matter whether it is all or part of a piece of music created as a single work. Although the bandwidth of each unit band is arbitrary, each unit band may be set to a bandwidth corresponding to, for example, one octave.
In a preferred embodiment of the invention, the difference generation part comprises: a weight generation part that generates a sequence of weights from the component matrix in correspondence to the sequence of the unit periods, the weight corresponding to a series of component values arranged in the frequency axis direction at the corresponding unit period; a difference calculation part that generates each initial difference matrix composed of an array of difference values of component values between each shift matrix and the component matrix; and a correction part that generates each difference matrix by applying the sequence of the weights to each initial difference matrix.
In this embodiment, a difference matrix, in which the distribution of difference values arranged in the time-axis direction has been corrected based on the initial difference matrix by applying the weight sequence to the initial difference matrix, is generated. Accordingly, there is an advantage in that it is possible to, for example, generate a tonal feature amount in which the difference between the component matrix and the shift matrix is emphasized for each unit period having large component values of the component matrix (i.e., a tonal feature amount which emphasizes, especially, tones of unit periods, the strength of which is high in the audio signal).
In a preferred embodiment of the invention, the feature amount extraction part generates the tonal feature amount including a series of feature values derived from the component matrix in correspondence to the series of the unit bands, each feature value corresponding to a sequence of component values of the component matrix arranged in the time-axis direction at the corresponding unit band.
In this embodiment, the advantage of ease of estimation of the tone of the audio signal is especially significant since the tonal feature amount includes a feature value series derived from the component matrix, in which the average tonal tendency (frequency characteristic) over the entirety of the audio signal is reflected, in addition to a plurality of feature value series derived from the plurality of difference matrices in which the temporal change tendency of the tone of the audio signal is reflected.
The invention may also be specified as an audio analysis apparatus that compares tonal feature amounts generated respectively for audio signals in each of the above embodiments. An audio analysis apparatus that is preferable for comparing tones of audio signals comprises a storage part that stores a tonal feature amount for each of first and second ones of an audio signal; and a feature comparison part that calculates a similarity index value indicating tonal similarity between the first audio signal and the second audio signal by comparing the tonal feature amounts of the first audio signal and the second audio signal with each other, wherein the tonal feature amount is derived based on a component matrix of the audio signal which is divided into a sequence of unit periods in a time-axis direction and based on a plurality of shift matrices derived from the component matrix, the component matrix being composed of an array of component values, columns of the component matrix corresponding to the sequence of unit periods of the audio signal and rows of the component matrix corresponding to a series of unit bands of the audio signal arranged in a frequency-axis direction, the component value representing a spectrum component of the audio signal belonging to the corresponding unit period and belonging to the corresponding unit band, each shift matrix being obtained by shifting the columns of the component matrix in the time-axis direction with a different shift amount, and wherein the tonal feature amount includes a plurality of series of feature values corresponding to a plurality of difference matrices which are derived from the plurality of the shift matrices, each difference matrix being composed of an array of element values each representing a difference between the corresponding component value of each shift matrix and the corresponding component value of the component matrix, one series of feature values corresponding to the series of unit bands of the difference matrix, one feature value representing a sequence of element values arranged in the time-axis direction at the corresponding unit band of the difference matrix.
In this configuration, since the amount of data of the tonal feature amount is reduced by representing the tendency of temporal change of the tone of the audio signal by a plurality of feature value series, it is possible to reduce capacity required for the storage part, compared to the prior art configuration (for example, Jouni Paulus and Anssi Klapuri, “Measuring the Similarity of Rhythmic Patterns”, Proc. ISMIR 2002, p. 150-156) in which a feature amount is extracted for each unit period. In addition, since the number of the feature value series does not depend on the time length of the audio signal, it is possible to easily compare temporal changes of the tones of audio signals even when the audio signals have different time lengths. Accordingly, there is also an advantage in that load of processing associated with the feature comparison part is reduced.
The audio analysis apparatus according to each of the above embodiments may not only be implemented by hardware (electronic circuitry) such as a Digital Signal Processor (DSP) dedicated to analysis of audio signals but may also be implemented through cooperation of a general arithmetic processing unit such as a Central Processing Unit (CPU) with a program. The program according to the invention is executable by a computer to perform processes of: acquiring a component matrix composed of an array of component values from an audio signal which is divided into a sequence of unit periods in a time-axis direction, columns of the component matrix corresponding to the sequence of unit periods of the audio signal and rows of the component matrix corresponding to a series of unit bands of the audio signal arranged in a frequency-axis direction, the component value representing a spectrum component of the audio signal belonging to the corresponding unit period and belonging to the corresponding unit band; generating a plurality of shift matrices each obtained by shifting the columns of the component matrix in the time-axis direction with a different shift amount; generating a plurality of difference matrices each composed of an array of element values in correspondence to the plurality of the shift matrices, the element value representing a difference between the corresponding component value of the shift matrix and the corresponding component value of the component matrix; and generating a tonal feature amount including a plurality of series of feature values corresponding to the plurality of difference matrices, one series of feature values corresponding to the series of unit bands of the difference matrix, one feature value representing a sequence of element values arranged in the time-axis direction at the corresponding unit band of the difference matrix.
The program achieves the same operations and advantages as those of the audio analysis apparatus according to the invention. The program of the invention may be provided to a user through a computer readable storage medium storing the program and then installed on a computer and may also be provided from a server device to a user through distribution over a communication network and then installed on a computer.
The storage device 14 stores various data used by the arithmetic processing unit 12 and a program PGM executed by the arithmetic processing unit 12. Any known machine readable storage medium such as a semiconductor recording medium or a magnetic recording medium or a combination of various types of recording media may be employed as the storage device 14.
As shown in
The arithmetic processing unit 12 implements a plurality of functions (including a signal analyzer 22, a display controller 24, and a feature comparator 26) required to analyze each audio signal X through execution of the program PGM stored in the storage device 14. The signal analyzer 22 generates a tonal feature amount F(F1, F2) representing the features of the tone color or timbre of the audio signal X. The display controller 24 displays the tonal feature amount F generated by the signal analyzer 22 as an image on the display device 16 (for example, a liquid crystal display). The feature comparator 26 compares the tonal feature amount F1 of the first audio signal X1 and the tonal feature amount F2 of the second audio signal X2. It is also possible to employ a configuration in which each function of the arithmetic processing unit 12 is implemented through a dedicated electronic circuit (DSP) or a configuration in which each function of the arithmetic processing unit 12 is distributed on a plurality of integrated circuits.
The frequency analyzer 322 generates a spectrum PX of the frequency domain for each of N unit periods (frames) σT[1] to σT[N] having a predetermined length into which the audio signal X is divided, where N is a natural number greater than 1.
The matrix generator 324 of
The difference generator 34 generates K different difference matrices D1 to DK from the component matrix A, where K is a natural number greater than 1.
The shift matrix generator 42 of
The unit Δ of the shift amount kΔ is set to a time length corresponding to one unit period σT[n]. That is, the shift matrix Bk is a matrix obtained by shifting each component value a[m, n] of the component matrix A by k unit periods σT[n] to the front side of the time-axis direction (i.e., backward in time). Here, component values a[m, n] of a number of columns of the component matrix A (hatched in
The difference calculator 44 of
The weight generator 46 of
The corrector 48 of
The feature amount extractor 36 of
The feature value series EK+1 located at the (K+1)th column of the tonal feature amount F is a sequence of M feature values eK+1[1] to eK+1[M] corresponding to different unit bands σF[m]. The element value eK+1[m] is set according to N component values a[m, 1] to a[m, N] corresponding to the unit band σF[m] among component values of the component matrix A generated by the component acquirer 32. For example, the sum or average of the N component values a[m, 1] to a[m, N] is calculated as the feature value eK+1[m]. Accordingly, the feature value eK+1[m] increases as the strength of the components of the unit band σF[m] over the entire period of the audio signal X increases. That is, the feature value eK+1[m] serves as a feature amount representing an average tone (average frequency characteristics) of the audio signal X over the entire period of the audio signal X.
The feature value series Ek (E1 to EK) is a sequence of M feature values ek[1] to ek[M] corresponding to different unit band σF[m]. The mth feature value ek[m] of the feature value series Ek is set according to N element values dk[m, 1] to dk[m, N] corresponding to the unit band σF[m] among element values of the difference matrix Dk. For example, the sum or average of the N element values dk[m, 1] to dk[m, N] is calculated as the feature value ek[m]. As can be understood from the above description, the feature value ek[m] is set to a greater value as the strength of the components in the unit band σF[m] of the audio signal X in each of the unit periods σT[1] to σT[N] more significantly changes in a period that spans the shift amount kΔ from the unit period σTn. Accordingly, in the case where the K feature values e1[m] to eK[m] (arranged in the horizontal direction) corresponding to each unit band σF[m] in the tonal feature amount F include many great feature values ek[m], it is estimated that the components of the unit band σF[m] of the audio signal X are components of sound whose strength rapidly changes in a short time. On the other hand, in the case where the K feature values e1[m] to eK[m] corresponding to each unit band σF[m] include many small feature values ek[m], it is estimated that the components of the unit band σF[m] of the audio signal X are components of sound whose strength does not greatly change over a long time (or that the components of the unit band σF[m] are not generated). That is, the K feature value series E1 to EK included in the tonal feature amount F serve as a feature amount indicating temporal changes of the components of each unit band σF[m] of the audio signal X (i.e., temporal changes of tone of the audio signal X).
The configuration and operation of the signal analyzer 22 of
The display controller 24 displays tone images G (G1, G2) of
As shown in
As shown in
Specifically, the user can easily identify the tendency of the average tone (frequency characteristics) of the audio signal X over the entire period of the audio signal X from the M unit figures u(1, K+1) to u(M, K+1) (the feature value series EK+1) of the (K+1)th column among the unit figures of the tone image G. The user can also easily identify the tendency of temporal changes of the components of each unit band σF[m] (i.e., each octave) of the audio signal X from the unit figures u(m, k) of the 1st to Kth columns among the unit figures of the tone image G. In addition, the user can easily compare the tone of the audio signal X1 and the tone of the audio signal X2 since the number M of rows and the number (K+1) of columns of the unit figures u[m, κ] are common to the tone image G1 and the tone image G2 regardless of the time length of each audio signal X.
The feature comparator 26 of
In the above embodiment, the tendency of the average tone of the audio signal X over the entire period of the audio signal X is represented by the feature value series EK+1 and the tendency of temporal changes of the tone of the audio signal X over the entire period of the audio signal X is represented by K feature value series E1 to EK corresponding to the number of shift matrices Bk (i.e., the number of feature amounts kΔ). Accordingly, it is possible to reduce the amount of data required to estimate the tone color or timbre of a piece of music, compared to the prior art configuration (for example, Jouni Paulus and Anssi Klapuri, “Measuring the Similarity of Rhythmic Patterns”, Proc. ISMIR 2002, p. 150-156) in which a feature amount such as an MFCC is extracted for each unit period σT[n]. In addition, since feature values eκ[m] of the tonal feature amount F are calculated using unit bands σF[m], each including a plurality of component values x, as frequency-axis units, the amount of data of the tonal feature amount F is reduced, for example, compared to the prior art configuration in which a feature value is calculated for each frequency corresponding to each component value x. There is also an advantage in that the user can easily identify the range of each feature value eκ[1] to eκ[M] of the tonal feature amount F since each unit band σF[m] is set to a bandwidth of one octave.
Further, since the number K of the feature value series E1 to EK representing the temporal change of the tone of the audio signal X does not depend on the time length of the audio signal X, the user can easily estimate the tonal similarity between the tone of the audio signal X1 and the tone of the audio signal X2 by comparing the tone image G1 and the tone image G2 even when the time lengths of the audio signal X1 and the audio signal X2 are different. Furthermore, in principle, the process for locating corresponding time points between the audio signal X1 and the audio signal X2 (for example, DP matching required in the technology of Jouni Paulus and Anssi Klapuri, “Measuring the Similarity of Rhythmic Patterns”, Proc. ISMIR 2002, p. 150-156) is unnecessary since the number M of rows and the number (K+1) of columns of the tonal feature amount F do not depend on the audio signal X. Therefore, there is also an advantage in that load of processing for comparing the tones of the audio signal X1 and the audio signal X2 (i.e., load of the feature comparator 26) is reduced.
<Modifications>
Various modifications can be made to each of the above embodiments. The following are specific examples of such modifications. Two or more modifications selected from the following examples may be combined as appropriate.
(1) Modification 1
The method of calculating the component value a[m, n] of each unit band σF[m] is not limited to the above method in which an average (arithmetic average) of a plurality of component values x in the unit band σF[m] is calculated as the component value a[m, n]. For example, it is possible to employ a configuration in which the weighted sum, the sum, or the middle value of the plurality of component values x in the unit band σF[m] is calculated as the component value a[m, n] or a configuration in which each component value x is directly used as the component value a[m, n] of the component matrix A. In addition, the bandwidth of the unit band σF[m] may be arbitrarily selected without being limited to one octave. For example, it is possible to employ a configuration in which each unit band σF[m] is set to a bandwidth corresponding to a multiple of one octave or a bandwidth corresponding to a divisional of one octave divided by an integer.
(2) Modification 2
Although the initial difference matrix Ck is corrected to the difference matrix Dk using the weight sequence W in the above embodiment, it is possible to omit correction using the weight sequence W. For example, it is possible to employ a configuration in which the feature amount extractor 36 generates the tonal feature amount F using the initial difference matrix Ck calculated by the difference calculator 44 of
(3) Modification 3
Although the tonal feature amount F including the K feature value series E1 to EK generated from difference matrices Dk and the feature value series EK+1 corresponding to the component matrix A is generated in the above embodiment, the feature value series EK+1 may be omitted from the tonal feature amount F.
(4) Modification 4
Although each shift matrix Bk is generated by shifting the component values a[m, n] at the front edge of the component matrix A to the rear edge in the above embodiment, the method of generating the shift matrix Bk by the shift matrix generator 42 may be modified as appropriate. For example, it is possible to employ a configuration in which a shift matrix Bk of m rows and (N−kΔ) columns is generated by eliminating a number of columns corresponding to the shift amount kΔ at the front side of the component matrix A from among the columns of the component matrix A. The difference calculator 44 generates an initial difference matrix Ck of m rows and (N−kΔ) columns by calculating difference values ck[m, n] between the component values a[m, n] and the component values dk[m, n] only for an overlapping portion of the component matrix A and the shift matrix Bk. Although each component value a[m, n] of the component matrix A is shifted to the front side of the time axis in the above example, it is also possible to employ a configuration in which the shift matrix Bk is generated by shifting each component value a[m, n] to the rear side of the time axis (i.e., forward in time).
(5) Modification 5
Although the frequency analyzer 322 of the component acquirer 32 generates the spectrum PX from the audio signal X while the matrix generator 324 generates the component matrix A from the time sequence of the PX in the above embodiment, the component acquirer 32 may acquire the component matrix A using any other method. For example, it is possible to employ a configuration in which the component matrix A of the audio signal X is stored in the storage device 14 in advance (such that storage of the audio signal X may be omitted) and the component acquirer 32 acquires the component matrix A from the storage device 14. It is also possible to employ a configuration in which a time sequence of each spectrum PX of the audio signal X is stored in the storage device 14 in advance (such that storage of the audio signal X or the frequency analyzer 322 may be omitted) and the component acquirer 32 (the matrix generator 324) generates the component matrix A from the spectrum PX in the storage device 14. That is, the component acquirer 32 may be any element for acquiring the component matrix A.
(6) Modification 6
Although the audio analysis apparatus 100 includes both the signal analyzer 22 and the feature comparator 26 in the above example, the invention may also be realized as an audio analysis apparatus including only one of the signal analyzer 22 and the feature comparator 26. That is, an audio analysis apparatus used to analyze the tone of the audio signal X (i.e., used to extract the tonal feature amount F) (hereinafter referred to as a “feature extraction apparatus”) may have a configuration in which the signal analyzer 22 is provided while the feature comparator 26 is omitted. On the other hand, an audio analysis apparatus used to compare the tones of the audio signal X1 and the audio signal X2 (i.e., used to calculate the similarity index value Q) (hereinafter referred to as a “feature comparison apparatus”) may have a configuration in which the feature comparator 26 is provided while the signal analyzer 22 is omitted. The tonal feature amounts F (F1, F2) generated by the signal analyzer 22 of the feature extraction apparatus is provided to the feature comparison apparatus through, for example, a communication network or a portable recording medium and is then stored in the storage device 14. The feature comparator 26 of the feature comparison apparatus calculates the similarity index value Q by comparing the tonal feature amount F1 and the tonal feature amount F2 stored in the storage device 14.
Number | Date | Country | Kind |
---|---|---|---|
2010-088354 | Apr 2010 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
6430533 | Kolluru et al. | Aug 2002 | B1 |
7502312 | Zhang et al. | Mar 2009 | B2 |
7509294 | Kim | Mar 2009 | B2 |
7659471 | Eronen | Feb 2010 | B2 |
8712185 | Ludwig | Apr 2014 | B2 |
20020005110 | Pachet et al. | Jan 2002 | A1 |
20030205124 | Foote et al. | Nov 2003 | A1 |
20050117532 | Zhang et al. | Jun 2005 | A1 |
20080072741 | Ellis | Mar 2008 | A1 |
20080236371 | Eronen | Oct 2008 | A1 |
20080300702 | Gomez et al. | Dec 2008 | A1 |
20090005890 | Zhang | Jan 2009 | A1 |
20110268284 | Arimoto et al. | Nov 2011 | A1 |
20120237041 | Pohle | Sep 2012 | A1 |
20130289756 | Resch et al. | Oct 2013 | A1 |
20130322777 | Ludwig | Dec 2013 | A1 |
Number | Date | Country |
---|---|---|
1 577 877 | Sep 2005 | EP |
2 093 753 | Aug 2009 | EP |
Entry |
---|
Paulus, J. et al. (2002). “Measuring the Similarity of Rhythmic Patterns,” IRCAM, seven pages. |
Tsunoo, E. et al. (Mar. 2008). “Rhythmic Features Extraction from Music Acoustic Signals using Harmonic/Non-Harmonic Sound Separation,” 2008 Spring Meeting of the Acoustic Society of Japan, pp. 905-906, with English Translation, seven pages. |
U.S. Appl. No. 13/081,337, filed Apr. 6, 2011, by Arimoto et al. |
Bello, J.P. (Oct. 2009). “Grouping Recorded Music by Structural Similarity,” ISMIR Conference, Kobe, JP, Oct. 26-30, 2009, six pages. |
European Search Report mailed Sep. 2, 2011, for EP Application No. 11161259.4, eight pages. |
Number | Date | Country | |
---|---|---|---|
20110268284 A1 | Nov 2011 | US |