The present invention relates to a technology for analyzing a sound signal of a musical piece.
Various technologies have been proposed for analyzing a sound signal of a musical piece is known. For example, Japan Laid-Open Patent Application No. 2015-79110 (hereinafter referred to as Patent Document 1) describes a technology for analyzing a genre or a style of a musical piece using nonnegative matrix factorization (NMF).
A sound signal processing method in accordance with some embodiments including acquiring a beat number per unit time period from an input sound signal, executing a normalization process for normalizing the input sound signal with the beat number per unit time period, calculating a rhythm similarity between the beat spectrum of the normalized input sound signal and a normalized beat spectrum calculated from a reference sound signal.
A sound signal processing apparatus in accordance with some embodiments including an information processing apparatus having an acquisition unit, a beat number acquisition unit, a normalization unit, a beat spectrum calculation unit and a rhythm similarity calculation unit; the acquisition unit being configured to acquire an input sound signal; the beat number acquisition unit being configured to acquire a beat number per unit time period from the input sound signal; the normalization unit being configured to normalize the input sound signal with the beat number per unit time period; the beat spectrum calculation unit being configured to calculate a beat spectrum of the normalized input sound signal; and the rhythm similarity calculation unit being configured to calculate a rhythm similarity between the beat spectrum of the normalized input sound signal and a normalized beat spectrum calculated from a reference sound signal.
In conventional systems, there is a possibility that analyzing a rhythm pattern using NMF fails to analyze a detailed rhythm patterns. In view of the above circumstances, it is an object of some embodiments to analyze a detailed rhythm pattern.
Selected embodiments will now be explained with reference to the drawings. It will be apparent to those skilled in the sound field from this disclosure that the following descriptions of the embodiments are provided for illustration only and not for the purpose of limiting the invention as defined by the appended claims and their equivalents.
1. Configuration
In this example, the musical piece search system 1 includes a digital musical instrument 10 and an information processing apparatus 20. The digital musical instrument 10 is an example of a musical piece storage apparatus that stores musical piece data that become a search target. The information processing apparatus 20 is an example of a user terminal that provides a user interface. The musical piece data stored in the digital musical instrument are called data of musical pieces for accompaniment (such data are hereinafter referred to as “accompaniment data,” and sound of a musical piece for accompaniment is referred to as “accompaniment sound”). A user would input information of a musical piece to be played by the user itself from now on to the information processing apparatus 20. Although the information of a musical piece is a sound signal of the musical piece based on sound data, for example, of a non-compressed or compressed format (way, mp3 or the like), it is not limited to any of them. Further, the information of musical pieces may be stored in advance in a storage 203 of the information processing apparatus 20 hereinafter described or may be input from the outside of the information processing apparatus 20. The information processing apparatus 20 searches the accompaniment data stored in the digital musical instrument for accompaniment data similar to the input musical piece. If accompaniment sound similar to the input musical piece is found out, then the information processing apparatus 20 instructs the digital musical instrument 10 to reproduce the accompaniment sound. The digital musical instrument 10 reproduces the instructed accompaniment sound. The user would play the digital musical instrument 10 in accordance with the reproduced accompaniment.
The acquisition unit 11 acquires an input sound signal. The specification unit 12 specifies a target section that becomes a target of later processing from within the input sound signal. The database 14 has stored therein information regarding a plurality of accompaniment data. The first similarity calculation unit 13 calculates, within the target section of the input sound signal, a similarity between the input sound and the accompaniment sound using nonnegative matrix factorization (NMF). The second similarity calculation unit 15 calculates a similarity between the input sound and the accompaniment sound using a beat spectrum within the target section of the input sound signal. The integration unit 16 integrates the similarity calculated by the first similarity calculation unit 13 and the similarity calculated by the second similarity calculation unit 15. The selection unit 17 selects a musical piece similar to the input sound from within the database 14 on the basis of the integrated similarity. The outputting unit 18 outputs the selected musical piece.
In this example, from among the functions of the musical piece search system 1 depicted in
In the information processing apparatus 20, a program for causing a computer apparatus to function as a user terminal in the musical piece search system 1 is stored in the storage 203. By the CPU 201 executing this program, the functions as the acquisition unit 11, the specification unit 12, the first similarity calculation unit 13, the database 14, the second similarity calculation unit 15, the integration unit 16, and the selection unit 17 are incorporated in the information processing apparatus 20. The CPU 201 that executes this program is an example of the acquisition unit 11, the specification unit 12, the first similarity calculation unit 13, the second similarity calculation unit 15, the integration unit 16, and the selection unit 17. The storage 203 is an example of the database 14. Further, in the digital musical instrument 10, the outputting unit 104 is an example of the outputting unit 18.
2. Operation
2-1. Overview
2-2. Target Section Specification Process
The calculation of a similarity at steps S3 and S4 may be performed for all input sound signals. However, if all input sound signals are made a target, then this gives rise to the following problems. First, if all input sound signals are made a target, then time is required as much for the calculation. Second, the input sound signal sometimes includes, in so-called intro or outro (ending) thereof, a place that includes no rhythm, and if the similarity is calculated including also such a place, then the reliability of the similarity degrades. In the present embodiment, in order to cope with the problems, the portion that is to be made a target of similarity calculation from within the input sound signal is restricted to part of the input sound signal.
At step S212, the specification unit 12 calculates a feature amount of a tone color (hereinafter referred to as “tone color feature amount”) from the input sound signal. As the tone color feature amount, for example, a predetermined number of (for example, 12) mel-frequency spectrum coefficients (MFCCs) are used. The MFCC is calculated for each unit section defined at step S211.
At step S213, the specification unit 12 calculates a feature amount of a chord (hereinafter referred to as “chord feature amount”) from the input sound signal. The chord feature amount is calculated for each of frames (periods corresponding, for example, to an eighth node or a sixteenth node) into which a unit section is subdivided on the basis of the beat points. As the chord feature amount, for example, a so-called chroma vector is used. The chroma vector is obtained by separating energy in a frequency range obtained by spectrum analysis, for example, for each semitone and adding the energy pieces in one octave. If one octave is separated for each semitone, then totaling 12 sounds are obtained, and therefore, the chroma vector is a 12-dimensional vector. The chroma vectors calculated for individual frames represent a temporal change of the chord, namely, a chord progress.
At step S214, the specification unit 12 estimates a musical piece structure of the input sound by posterior distribution estimation using a probability model. In particular, the specification unit 12 estimates a probability distribution (posterior distribution) of the posterior probability when time series of a tone color feature amount and a chord feature amount are observed in regard to a probability model that describes a probability by which a time series of feature amounts is observed under a certain musical piece structure.
As the probability model, for example, a musical piece structure model, a tone color observation model, and a chord observation model are used. The musical piece structure model is a model that probabilistically describes a musical piece structure. The tone color observation model is a model that probabilistically describes a generation process of a tone color feature amount. The chord observation model is a model that probabilistically describes a generation process of a chord feature amount. In the probability models, the unit sections are grouped such that those unit sections that are similar or common in musical structure belong to a same structure section. The groups are identified by section codes (for example, A, B, C, . . . ).
The musical piece structure model is a state transition model in which, for example, a plurality of states linked to each other are arrayed in a state space, more particularly, a hidden Markov model. The tone color observation model is a probability model that follows, for example, an infinite mixed Gaussian distribution where a normal distribution is used as the probability distribution and that does not rely upon the duration in the structure section although it depends upon the section code. The chord observation model is a probability model that follows, for example, an infinite mixed Gaussian distribution where a normal distribution is used as the probability distribution and depends upon both the section code and the duration in the structure section. The posterior distribution in each probability model is estimated by an iterative estimation algorithm such as, for example, a variational Bayes method or the like. The specification unit 12 estimates a musical piece structure that maximizes the posterior distribution.
At step S215, the specification unit 12 specifies the musical piece structure on the basis of a result of the estimation at step S214.
It is to be noted that the criterion for allocating a priority is not limited to that described above. Some other criterion may be used in place of or in addition to the example described above. As an example, a criterion is used by which, for example, a high priority is given to a unit section having a comparatively long time length while a low priority is given to a unit section having a comparatively short time length. In other words, at step S23 in this different example, selection of a section that is to be made a target of calculation of the rhythm similarity is performed in the descending order of the time length from among a plurality of unit sections. Although the time length in the example of
At step S232, the specification unit 12 adds a section having the highest priority from among the sections that have not been selected as a target section as yet (such a section is hereinafter referred to as “non-selected section”) to the target sections. In the case where a plurality of sections have the highest priority, the specification unit 12 adds one section selected from among the plurality of sections in accordance with a different criterion, for example, a section having the earliest number to the target sections.
At step S233, the specification unit 12 decides whether the cumulative time length of the target sections exceeds a threshold value. As the threshold value, for example, a predetermined ratio to the overall time length of the input sound signal, as an example, 50%, is used. In the case where it is decided that the cumulative time length of the target sections does not exceed the threshold value (S233: NO), the specification unit 12 advances its processing to step S232. In the case where it is decided that the cumulative time length of the target sections exceeds the threshold value (S233: YES), the specification unit 12 ends the flow of
In the example of
According to this example, a portion of part selected on the basis of a musical structure of an input sound signal, for example, a section that appears repetitively, can be restricted as a target of later processing. Such a section as just described is frequently a portion having a musically high impact like a so-called chorus or melody A. By excluding a portion that may possibly be different in rhythm or tone color from the other portions like intro or outro from the target of processing, the load of processing can be reduced while the accuracy in search is maintained.
2-3. Similarity Calculation by NMF
Now, the similarity calculation by NMF at step S3 is described. Before details of the similarity calculation are described, an overview of NMF is described first. NMF is a low rank approximation algorithm that decomposes a nonnegative matrix into the product of two nonnegative matrixes. The nonnegative matrix is a matrix whose components are all nonnegative values (namely zeros or positive values). Generally, NMF is represented by the following expression (1):
[Expression 1]
Y≈HU (1)
where Y indicates a given matrix, namely, an observation matrix (m rows n columns). H is called basis matrix (m rows k columns) and U is called activation (or coefficient) matrix (k rows n columns). In other words, the NMF is a process for approximating an observation matrix Y with the product of a basis matrix H and the activation matrix U.
In order to apply the NMF to similarity calculation of a musical piece, it is supposed to use a matrix representative of an amplitude spectrogram of a sound signal as the observation matrix Y. The amplitude spectrogram represents a time variation of the frequency spectrum of a sound signal and is three-dimensional information including time, frequency, and amplitude. The amplitude spectrogram is obtained, for example, by sampling a sound signal in the time domain and taking absolute values for a complex spectrogram obtained by short time Fourier transforming the samples. Here, if the axis of abscissa is divided into n and the axis of ordinate is divided into m and then the amplitude in each of the regions obtained by the division are digitized, then the amplitude spectrogram can be represented as a matrix. This matrix includes temporal information in the row direction and frequency information in the column direction, and the value of each component includes information relating to an amplitude. Since the value of the amplitude is nonnegative, this matrix is a nonnegative matrix.
The NMF is used to calculate a basis matrix H and an activation matrix U when the observation matrix Y is known. In particular, the NMF is defined as a problem for minimizing a distance D between the matrix Y and a matrix product HU as given by the following expression (2). As the distance D, for example, a Euclidean distance, a generalized KL distance, an Itakura Saito distance, or a β divergence is used. Although a solution to the expression (2) cannot be obtained in a closed form, several effective iterative solutions are known (for example, Lee D. D., & Sueng, H. S. (2001), Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13(1) V621-V624).
It is to be noted that the expression above signifies to calculate the matrices H and U that minimize the distance D. This similarly applies also to the expressions given hereinbelow.
It is to be noted that, in the case where musical instruments included in input sound and accompaniment sound are known to some in advance, namely, in the case where candidates for musical instruments included in input sound and accompaniment sound are restricted to some in advance, semi-supervised NMF may be applied. Such semi-supervised NMF is described, for example, in Smaragdis P, Raj B, Shashanka M V. Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures, In: ICA. 2007. p. 414-421.
At step S33, the first similarity calculation unit 13 acquires a reference basis matrix Hr (one example of a second matrix) and a reference activation matrix Ur of the reference sound signal. In the present example, the NMF is applied in advance to each of a plurality of accompaniment data to calculate a reference basis matrix and a reference activation matrix. The calculated reference basis matrix and the reference activation matrix are recorded as information relating to accompaniment data in the database 14. The first similarity calculation unit 13 successively selects accompaniment sound to be made reference sound from among the plurality of accompaniment data recorded in the database and acquires a reference basis matrix and a reference activation matrix corresponding to the selected accompaniment sound from the database 14.
It is to be noted that the reference basis matrix and the reference activation matrix recorded in the database 14 may not necessarily have been calculated using all reference sound. The NMF may be applied only to some sections specified by a process similar to the target section specification process for the input sound to calculate a reference basis matrix and a reference activation matrix.
At step S34, the first similarity calculation unit 13 calculates a combination similarity of bases in each frame. The combination of bases is a combination of basis vectors activated within a certain period from among the plurality of basis vectors included in the basis matrix.
A combination similarity of bases is obtained, for example, by extracting column vectors corresponding to a certain frame from an activation matrix in regard to input sound and reference sound individually, and calculating the inner product of the column vectors. This inner product indicates a combination similarity of bases in one frame. In other words, at step S34, the similarity of a combination of first components in the first matrix and the second matrix is calculated for each second component.
According to this example, not only a rhythm similarity but also a tone color similarity are calculated. Accordingly, in comparison with an alternative case in which only the rhythm similarity is used, a musical piece can be searched out with a higher of accuracy.
2-4. Similarity Calculation by Beat Spectrum
At step S41, the second similarity calculation unit 15 acquires BPM of the input sound signal. In this example, the second similarity calculation unit 15 calculates BPM by analyzing the input sound signal. A known technique is used for calculation of BPM. At step S42, the second similarity calculation unit 15 calculates an amplitude spectrogram of the input sound signal. At step S43, the second similarity calculation unit 15 acquires a feature amount, in this example, a spectral difference, from the amplitude spectrogram. The spectral difference is a difference in amplitude between frames adjacent each other on the time axis from the amplitude spectrogram. In other words, the spectral difference is time on the axis of abscissa and data of the amplitude different from that of the preceding frame on the axis of ordinate. At step S44, the second similarity calculation unit 15 normalizes the input sound signal with a beat number per unit time period. In particular, the second similarity calculation unit 15 normalizes the time axis of the spectral difference with the BPM. More particularly, the second similarity calculation unit 15 can normalize the time axis in a unit of 1/n by dividing the time axis of the spectral difference by n times the BPM.
At step S45, the second similarity calculation unit 15 calculates a beat spectrum of the normalized input sound signal. In particular, the second similarity calculation unit 15 calculates a beat spectrum from autocorrelation of the normalized spectral difference. At step S46, the second similarity calculation unit 15 acquires a normalized beat spectrum of the reference sound signal. In this example, a beat spectrum is calculated in advance for each of a plurality of accompaniment data. The calculated beat spectra are recorded as information relating to accompaniment data in the database 14. The second similarity calculation unit 15 successively selects accompaniment sound to be made reference sound from among the plurality of accompaniment data recorded in the database and acquires a beat spectrum corresponding to the accompaniment sound from the database 14. At step S47, the second similarity calculation unit 15 compares the normalized beat spectrum of the input sound signal and the normalized beat spectrum calculated from the reference sound signal with each other to calculate a rhythm similarity between the beat spectra of the input sound and the reference sound. In particular, the second similarity calculation unit 15 compares the similarity of the beat spectra of the input sound and the accompaniment sound. The step S47 is a different example of a step for calculating a rhythm similarity with the reference sound signal for at least some of a plurality of sections included in the input sound signal.
In the similarity calculation using NMF, a rhythm similarity is calculated from an activation matrix. However, generally the NMF is insufficient in time resolution and cannot decide a difference in detailed rhythm structure such as so-called even or shuffle. Although it is possible to calculate a rhythm similarity with time analyzed more finely in the NMF, there is a problem that the calculation amount increases significantly. Further, although an example in which bases of individual musical instruments are separated clearly is depicted by the example of
In contrast, in this example, a rhythm similarity is calculated using a beat spectrum. Therefore, a detailed rhythm structure can be captured more accurately. Further, since, in a beat spectrum, generally a difference in BPM has an influence on a feature amount, even if beat spectra are merely compared with each other, it is difficult to evaluate a rhythm structure as a rhythm similarity. However, in this example, before a beat spectrum is calculated, a spectral difference is normalized with the BPM, and the difference in BPM between the input sound and the reference sound is absorbed.
2-5. Integration of Similarity s, Selection of Musical Piece
Integration of similarity s at step S5 is particularly performed in the following manner. In this example, two similarity s (tone color similarity and rhythm similarity) are obtained by NMF and one similarity (rhythm similarity) is obtained by a beat spectrum. Those similarity s are normalized to a common scale (for example, the lowest similarity is zero and the highest similarity is one).
The integration unit 16 integrates a plurality of similarity s by weighted arithmetic operation in which the similarity by NMF and the similarity by a beat spectrum are adjusted with a predetermined weight, in the present example, adjusted so as to be 1:1. In particular, the integration unit 16 calculates a similarity Di (one example of a third similarity) integrated in accordance with the following expression (3).
Di=2·DtN+DrN+Drb (3)
Here, DtN and DrN indicate the tone color similarity and the rhythm similarity obtained by the NMF, and Drb indicates the rhythm similarity obtained by a beat spectrum. According to this example, the similarity by NMF and the similarity by a beat spectrum are evaluated with an equal weight. The integrated similarity is calculated for each of the plurality of accompaniment data.
The selection unit 17 selects, from among the plurality of accompaniment data, accompaniment data having the highest similarity to the input sound. In this example, since the selection unit 17 is included in the information processing apparatus 20 and the outputting unit 18 is included in the digital musical instrument 10, the information processing apparatus 20 notifies the digital musical instrument 10 of an identifier of the accompaniment data selected by the selection unit 17. In the digital musical instrument 10, the outputting unit 18 reads out the accompaniment data corresponding to the notified identifier and outputs accompaniment data, namely, a musical piece.
3. Modifications
The present invention is not limited to the embodiment described hereinabove and allows various modifications. In the following, several modifications are described. Two or more of the modifications described below may be used in combination.
The corresponding relationship between the functional configuration and the hardware configuration in the musical piece search system 1 is not limited to the example described in the description of the embodiment. For example, the musical piece search system 1 may have all functions aggregated in the information processing apparatus 20. In this case, the musical piece that becomes a search target is not limited to accompaniment sound of digital musical instrument. For example, the musical piece search system 1 may be applied for search for a general musical piece content to be reproduced by a music player. Alternatively, the musical piece search system 1 may be applied for search for a musical piece in a karaoke apparatus. Further, some of the functions of the information processing apparatus 20 may be incorporated in a server apparatus on a network. For example, from among the functions of the musical piece search system 1, the specification unit 12, the first similarity calculation unit 13, the database 14, the second similarity calculation unit 15, the integration unit 16, and the selection unit 17 may be incorporated in a server apparatus. In this case, if the information processing apparatus 20 acquires an input sound signal, then it transmits a search request including the input sound signal in the form of data to the server apparatus. The server apparatus searches for a musical piece similar to the input sound signal included in the received search request and answers a result of the search to the information processing apparatus 20.
The method by the specification unit 12 for specifying a target section from an input sound signal is not restricted to the example described in the description of the embodiment. The specification unit 12 may specify a section selected from among a plurality of sections obtained by the musical piece structure analysis, for example, at random or in response to an instruction of the user as a target section. Further, the specification unit 12 is not limited to unit that performs selection of a target section until the cumulative time length of the target section exceeds a threshold value. The specification unit 12 may perform selection of a target section, for example, until the number of sections selected as a target section exceeds a threshold value. Alternatively, the specification unit 12 may perform selection of a target section until after a section having a priority higher than the threshold value does not remain any more.
The signal processing performed for a target section specified by the specification unit 12 is not limited to that performed by the first similarity calculation unit 13 and the second similarity calculation unit 15. A process other than calculation of a similarity may be performed for a target section specified by the specification unit 12.
The first similarity calculation unit 13 is not limited to unit that calculates both a rhythm similarity and a tone color similarity. The first similarity calculation unit 13 may calculate only one of a rhythm similarity and a tone color similarity. Further, in the first similarity calculation unit 13, the reference matrix acquisition unit 132 may not acquire a basis matrix and an activation matrix corresponding to a reference sound signal from the database 14 but may acquire a reference sound signal itself from the database 14 and calculate a basis matrix and an activation matrix by NMF.
One of the first similarity calculation unit 13 and the second similarity calculation unit 15 may be omitted. In this case, the integration unit 16 is unnecessary, and the selection unit 17 selects a musical piece on the basis only of a similarity by one of the first similarity calculation unit 13 and the second similarity calculation unit 15.
The acquisition unit 11, the specification unit 12, the first similarity calculation unit 13, the second similarity calculation unit 15, the integration unit 16, and the selection unit 17 are not limited to those incorporated in a computer apparatus by software. At least some of them may be incorporated as hardware, for example, by an integrated circuit for exclusive use.
A program to be executed by the CPU 201 or the like of the information processing apparatus 20 may be provided through a recording medium such as an optical disk, a magnetic disk, a semiconductor memory or the like or may be downloaded through a communication line such as the Internet. Further, the program may not necessarily include all of the steps of
Further, the similarity calculation by a beat spectrum at step S4 may not necessarily include all steps. In the case where a spectral difference is not used as a characteristic, calculation, normalization, or autocorrelation of a spectral difference may not be performed.
Number | Date | Country | Kind |
---|---|---|---|
2016-043219 | Mar 2016 | JP | national |
This application is a continuation-in-part application of International Application No. PCT/JP2017/009074, filed Mar. 7, 2017, which claims priority to Japanese Patent Application No. 2016-043219 filed in Japan on Mar. 7, 2016. The entire disclosures of International Application No. PCT/JP2017/009074 and Japanese Patent Application No. 2016-043219 are hereby incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
6542869 | Foote | Apr 2003 | B1 |
9245508 | Sugano | Jan 2016 | B2 |
9378768 | Wu | Jun 2016 | B2 |
20030205124 | Foote | Nov 2003 | A1 |
20080072741 | Ellis | Mar 2008 | A1 |
20110271819 | Arimoto | Nov 2011 | A1 |
20130064379 | Pardo | Mar 2013 | A1 |
Number | Date | Country |
---|---|---|
2375407 | Oct 2011 | EP |
2003-330460 | Nov 2003 | JP |
2008-275975 | Nov 2008 | JP |
2011-221156 | Nov 2011 | JP |
2015-79110 | Apr 2015 | JP |
2015-114361 | Jun 2015 | JP |
Entry |
---|
Paris Smaragdis et al., “Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures” In: ICA. 2007. p. 414-421. |
Daniel D. Lee et al., “Algorithms for non-negative matrix factorization” Advances in neural information processing systems, 13(1) V621-V624, 2001, pp. 7. |
International Search Report and Written Opinion of PCT Application No. PCT/JP2017/009074, dated May 30, 2017, 02 pages of English Translation and 07 pages of ISRWO. |
Shota Kawabuchi et al., “NMF O Riyo shita Gakkyokukan Ruiji Shakudo no Kosei Hoho nl Kansuru Kento”, Report of the 2011 Spring Meeting, the Acoustical Society of Japan CDROM [CD-ROM], Mar. 2, 2011, pp. 1035 to 1036, 3-1-4. |
Foote et al., “The Beat Spectrum: A New Approach to Rhythm Analysis”, 2001 IEEE International Conference on Multimedia and Expo, Oct. 20, 2003, pp. 1088-1091. |
Number | Date | Country | |
---|---|---|---|
20190005935 A1 | Jan 2019 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2017/009074 | Mar 2017 | US |
Child | 16123478 | US |