This disclosure relates to audio signal analysis technology.
Analysis techniques for estimating beat points (beats) of a musical piece by analyzing audio signals that represent the sound of the performed musical piece have been proposed in the prior art. For example, Japanese Laid-Open Patent Application No. 2015-114361 discloses a technology for estimating the beat points of a musical piece by using a stochastic model such as a hidden Markov model.
In techniques of the prior art for estimating beat points of a musical piece, there is the possibility that upbeats of the musical piece may be incorrectly estimated as beat points, or that beat points corresponding to twice the original tempo of the musical piece may be incorrectly estimated. There is also the possibility that the result of the beat point estimation does not conform to the intention of the user, as in the case in which upbeats of a musical piece are estimated in a situation where the user is expecting the downbeats to be estimated. In consideration of these circumstances, it is important to have a configuration that allows the user to change the positions on the time axis of multiple beat points estimated from the audio signal. However, there is the problem that the workload of changing individual beat points over the entire musical piece to the desired time points by the user is excessive. In consideration of these circumstances, the object of one aspect of this disclosure is to obtain a time series of beat points in accordance with the intentions of the user, while reducing the burden on the user to issue an instruction to change the position of each beat point.
In order to solve the problem described above, an audio analysis system according to one aspect of this disclosure estimates a plurality of beat points of a musical piece by analyzing an audio signal representing a performance sound of the musical piece, receives an instruction from a user to change a location of at least one beat point of the plurality of beat points, and updates a plurality of locations of the plurality of beat points in response to the instruction from the user.
An audio analysis system according to one aspect of this disclosure comprises an electronic controller including at least one processor. The electronic controller is configured to execute an analysis processing unit configured to estimate a plurality of beat points of a musical piece by analyzing an audio signal representing a performance sound of the musical piece, an instruction receiving unit configured to receive an instruction from a user to change a location of at least one beat point of the plurality of beat points, and a beat point updating unit configured to update a plurality of locations of the plurality of beat points in response to the instruction from the user.
A non-transitory computer-readable medium storing a program according to one aspect of this disclosure causes a computer system to execute a process comprising estimating a plurality of beat points of a musical piece by analyzing an audio signal representing a performance sound of the musical piece, receiving an instruction from a user to change a location of at least one beat point of the plurality of beat points, and updating a plurality of locations of the plurality of beat points in response to the instruction from the user.
Selected embodiments will now be explained in detail below, with reference to the drawings as appropriate. It will be apparent to those skilled from this disclosure that the following descriptions of the embodiments are provided for illustration only and not for the purpose of limiting the invention as defined by the appended claims and their equivalents.
The control device 11 is an electronic controller that includes one or more processors that control each element of the audio analysis system 100. For example, the control device 11 is configured to comprise one or more types of processors, such as a CPU (Central Processing Unit), an SPU (Sound Processing Unit), a DSP (Digital Signal Processor), an FPGA (Field Programmable Gate Array), an ASIC (Application Specific Integrated Circuit), etc. Here, the term “electronic controller” as used herein refers to hardware, and does not include a human.
The storage device 12 includes one or more computer memories or memory units for storing a program that is executed by the control device 11 and various data that are used by the control device 11. The storage device 12 comprises a known storage medium, such as a magnetic storage medium or a semiconductor storage medium, or a combination of a plurality of various types of storage media. A portable storage medium that can be attached to or detached from the audio analysis system 100 or a storage medium (for example, cloud storage) that the control device 11 can read from or write to via a communication network such as the Internet can also be used as the storage device 12. The storage device 12 is one example of a non-transitory storage medium.
The storage device 12 stores the audio signal A. The audio signal A is a sampled sequence representing the waveform of performance sounds of a musical piece. Specifically, the audio signal A represents instrument sounds and/or singing sounds of a musical piece. The data format of the audio signal A is arbitrary. The audio signal A can be supplied to the audio analysis system 100 from a signal supply device that is separate from the audio analysis system 100. The signal supply device is, for example, a reproduction device that supplies the audio signal A stored on a storage medium to the audio analysis system 100, or a communication device that supplies audio signal A received from a distribution device (not shown) via a communication network to the audio analysis system 100.
The display device (display) 13 displays images under the control of the control device 11. For example, various display panels such as a liquid-crystal display panel or an organic EL (Electroluminescence) display panel are used as the display device 13. The display device 13, which is separate from the audio analysis system 100, can be connected to the audio analysis system 100 wirelessly or by wire. The operation device 14 is an input device (user operable input(s)) that receives instructions from a user. For example, the operation device 14 is a controller operated by the user, or a touch panel that detects contact from the user.
The sound output device 15 reproduces sound under the control of the control device 11. For example, a speaker or headphones are used as the sound output device 15. A sound output device 15 that is separate from the audio analysis system 100 can be connected to the audio analysis system 100 wirelessly or by wire.
The analysis processing unit 20 estimates a plurality of beat points in a musical piece by analyzing the audio signal A. More specifically, the analysis processing unit 20 generates beat point data B from the audio signal A. The beat point data B are data that represent each beat point in a musical piece. More specifically, the beat point data B are time-series data that specify the time of each of the plurality of beat points in a musical piece. For example, the time of each beat point with respect to the starting point of the audio signal A is specified by beat point data B. The analysis processing unit 20 of the first embodiment includes a feature extraction unit 21, a probability calculation unit 22, and an estimation processing unit 23.
The feature extraction unit 21 generates feature data F[m] for each analysis time point t[m]. The feature data F[m] corresponding to a given analysis time point t[m] are a time series of a plurality of feature values f[m] within a period of time U (hereinafter referred to as “unit period”) that includes the analysis time point t[m].
The probability calculation unit 22 of
There is a correlation between the feature data F[m] at each analysis time point t[m] of the audio signal A and the likelihood that the analysis time point t[m] corresponds to a beat point. The estimation model 50 is a statistical model that has learned the above-described correlation. Specifically, the estimation model 50 is a learned model that has learned the relationship between the feature data F[m] and the output data O[m] by machine learning.
The estimation model 50 comprises a deep neural network (DNN), for example. The estimation model 50 is realized by a combination of a program that causes the control device 11 to execute a calculation for generating the output data O[m] from the feature data F[m] and a plurality of variables (specifically, weighted values and biases) that are applied to the calculation. The program and the plurality of variables that realize the estimation model 50 are stored in the storage device 12. The numerical values of each of the plurality of variables defining the estimation model 50 are set in advance by machine learning.
The plurality of intermediate layers 52 are hidden layers located between the input layer 51 and the output layer 53. The plurality of intermediate layers 52 include a plurality of intermediate layers 52a and a plurality of intermediate layers 52b. The plurality of intermediate layers 52a are located between the input layer 51 and the plurality of intermediate layers 52b. Each of the intermediate layers 52a is composed of a combination of a convolutional layer and a pooling layer, for example. Each of the intermediate layers 52b is a fully-connected layer with, for example, ReLU as the activation function. The output layer 53 outputs the output data O[m].
The estimation model 50 is divided into a first part 50a and a second part 50b. The first part 50a is the part of the estimation model 50 on the input side. Specifically, the first part 50a is the first half of the model composed of the input layer 51 and the plurality of intermediate layers 52a. The second part 50b is the part of the estimation model 50 on the output side. Specifically, the second part 50b is the second half of the model composed of the output layer 53 and the plurality of intermediate layers 52b. The first part 50a is the part that generates intermediate data D[m] according to the feature data F[m]. The intermediate data D[m] are data representing the features of the feature data F[m]. Specifically, the intermediate data D[m] are data representing features that contribute to outputting statistically valid output data O[m] with respect to the feature data F[m]. The second part 50b is the part that generates output data O[m] according to the intermediate data D[m].
A plurality of training data Z is used for the machine learning of the estimation model 50. Each of the plurality of pieces of training data Z is composed of a combination of feature data Ft for training and output data Ot for training. The feature data Ft represent the feature values of the audio signal A prepared for learning at specific points in time. Specifically, the feature data Ft is composed of a time series of a plurality of feature values corresponding to different time points on the time axis, similar to the above-mentioned feature data F[m]. The output data Ot for training corresponding to a specific point in time are data representing the probability that the time point corresponds to a beat point of the musical piece (that is, the correct answer value). A plurality of training data Z are prepared for a large number of known musical pieces.
The machine learning system 200 calculates an error function representing the error between the output data O[m] output by an initial or provisional model (hereinafter referred to as “provisional model”) 59 when the feature data Ft of the training data Z are input, and the output data Ot of the training data Z. The machine learning system 200 then updates the plurality of variables of the provisional model 59 such that the error function is reduced. The provisional model 59 at the point in time when the above-described process is iterated for each of the plurality of training data Z is set as the estimation model 50.
Thus, the estimation model 50 outputs statistically valid output data O[m] for unknown feature data F[m] under the potential relationship between the feature data Ft and the output data Ot in the plurality of training data Z. That is, the estimation model 50 is a learned model that has learned the relationship between the feature data Ft for training corresponding to each time point on the time axis and the output data Ot for training that represents the probability that the time point corresponds to a beat point. The probability calculation unit 22 inputs the feature data F[m] of each analysis time point t[m] into the estimation model 50 established by the procedure described above, thereby generating the output data O[m] representing the probability P[m] that the analysis time point t[m] corresponds to a beat point.
When the probability calculation process Sa is started, the probability calculation unit 22 inputs the feature data F[m] corresponding to the analysis time point t[m] into the estimation model 50 (Sa1). The probability calculation unit 22 acquires the intermediate data D[m] output by the first part 50a of the estimation model 50 and stores the intermediate data D[m] in the storage device 12 (Sa2). In addition, the probability calculation unit 22 acquires the output data O[m] output by the estimation model 50 (second part 50b) and stores the output data O[m] in the storage device 12 (Sa3).
The probability calculation unit 22 determines whether the process described above has been executed for M analysis time points t[1] t[M] in the musical piece (Sa4). If the determination result is negative (Sa4: NO), the probability calculation unit 22 generates the intermediate data D[m] and the output data O[m] (Sa1˜Sa3) for the unprocessed analysis time points t[m]. The probability calculation unit 22 terminates the probability calculation process Sa once the process has been executed for M analysis time points t[1]˜t[M] (Sa4: YES). As can be understood from the foregoing explanation, as a result of the probability calculation process Sa, M pieces of intermediate data D[1] D[M] corresponding to different analysis time points t[m], and M pieces of output data O[1]˜O[M] corresponding to different analysis time points t[m] are stored in the storage device 12. Estimation processing unit 23
The estimation processing unit (beat point estimation unit) 23 in
Each of the N states Q of the state transition model 60 corresponds to one of a plurality of tempos X[i] (i=1, 2, 3, . . . ). Specifically, each of the N states Q corresponds to a different combination of each of the plurality of tempos X[i] and each of the plurality of transition points Y[0]˜Y[4]. That is, for each tempo X[i], there is a time series of five states Q corresponding to different transition points Y[j]. In the following description, the state Q that corresponds to the combination of a tempo X[i] and a transition point Y[j] can be expressed as “state Q[i, j].” On the other hand, when no particular attention is paid to the distinction between the tempo X[i] and the transition point Y[j], it is simply denoted as “state Q.” The distinction of the state Q by the transition point Y[j] can be omitted. That is, an implementation in which each of a plurality of states Q corresponds to a different tempo X[i] is conceivable. In an implementation in which the transition point Y[j] is not distinguished, for example, a hidden Markov model (HMM) is used as the state transition model 60.
In the first embodiment, it is assumed that the tempo X changes only at the beat points (that is, transition point Y[0]) on the time axis. Under the assumption described above, state Q[i, j] corresponding to each transition point Y[j] other than transition point Y[0] transitions only to state Q[i, j−1] corresponding to the immediately following transition point Y[j−1]. For example, state Q[i, 4] transitions to state Q[i, 3], state Q[i, 3] transitions to state Q[i, 2], and state Q[i, 2] transitions to state Q[i, 1]. On the other hand, state Q[i, 0] which corresponds to the beat points, will have transitions from a plurality of states Q[i, 1] (Q[1, 1], Q[2, 1], Q[3, 1], . . . ) corresponding to different tempos X[i].
When the beat point estimation process Sb is started, the estimation processing unit 23 calculates an observation likelihood A[m] for each of the M analysis time points t[1] t[M] (Sb1). The observation likelihood A[m] for each analysis time point t[m] is set to a numerical value corresponding to the probability P[m] represented by the output data O[m] of the analysis time point t[m]. For example, the observation likelihood A[m] is set to the probability P[m] represented by the output data O[m] or to a numerical value calculated by a prescribed computation performed on the probability P[m].
The estimation processing unit 23 calculates a path p[i, j] and likelihood λ[i, j] for each analysis time point t[m] for each state Q [i, j] of the state transition model 60. The path p[i, j] is a path from another state Q to the state Q[i, j], and the likelihood λ[i, j] is an index of the probability that the state Q[i, j] is observed.
As described above, only unidirectional transitions occur between plural states Q[i, 0] Q[i, 4] corresponding to any given tempo X[i]. Therefore, as can be understood from
On the other hand, tempo X[i] can change at transition point Y[0]. Therefore, as can be understood from
The estimation processing unit 23 generates a time series of M states Q (hereinafter referred to as “state series”) corresponding to different analysis time points t[m] (Sb3). Specifically, the estimation processing unit 23 connects paths p[i, j] from state Q[i, j] corresponding to the maximum value of N likelihoods λ[i, j] calculated for the last analysis time point t[M] of the musical piece in sequence along the reverse direction of the time axis and generates a state series from M states Q located on the series of connected paths (that is, the maximum likelihood path). That is, a state series is generated by arranging the states Q having the greatest likelihoods λ[i, j] among the N states Q at each analysis time point t[m].
The estimation processing unit 23 estimates, as a beat point, each analysis time point t[m] at which state Q corresponding to the transition point Y[0] is observed among the M states Q that constitute the state series and generates the beat point data B that specify the time of each beat point (Sb4). As can be understood from the foregoing explanation, analysis time points t[m] at which probability P[m] represented by output data O[m] is high and at which there is an acoustically natural transition of the tempo are estimated as beat points in the musical piece.
As described above, in the first embodiment, the output data O[m] for each analysis time point t[m] are generated by inputting feature data F[m] for each analysis time point t[m] into the estimation model 50, and a plurality of beat points are estimated from the output data O[m]. Therefore, statistically valid output data O[m] can be generated for unknown feature data F[m] under the potential relationship between output data Ot for training and feature data Ft for training. The foregoing is a specific example of the configuration of the analysis processing unit 20.
The display control unit 24 of
The analysis screen 70 includes a first region 71 and a second region 72. The first region 71 displays a waveform 711 of the audio signal A. The second region 72 displays the results of an analysis of a portion 712 of the period specified in the first region 71 (hereinafter referred to as “specified period”) of the audio signal A. The second region 72 includes a waveform region 73, a probability region 74, and a beat point region 75.
A common time axis is set for the waveform region 73, the probability region 74, and the beat point region 75. The waveform region 73 displays a waveform 731 of the audio signal A within the specified period 712 and sound generation points (onset) 732 in the audio signal A. The probability region 74 displays a time series 741 of the probabilities P[m] represented by the output data O[m] of each analysis time point t[m]. The time series 741 of probabilities P[m] represented by the output data O[m] can be displayed within the waveform area 73 superimposed on the waveform 731 of the audio signal A.
A plurality of beat points in the musical piece estimated by analyzing the audio signal A is displayed in the beat point region 75. Specifically, a time series of a plurality of beat images 751 corresponding to different beat points in the musical piece is displayed in the beat point region 75. Of the plurality of beat points in the musical piece, one or more beat images 751 corresponding to one or more beat points that satisfy a prescribed condition (hereinafter referred to as “correction candidate points”) are highlighted in a different display mode than the other beat images 751. The correction candidate points are beat points that the user is likely to issue an instruction for change.
The reproduction control unit 25 of
It should be noted that in the process of estimating a plurality of beat points in a musical piece from the audio signal A, there is a possibility that, for example, upbeats of the musical piece are incorrectly estimated as beat points. There is also the possibility that the result of estimating beat points does not conform to the intention of the user, such as is the case when the upbeats of a musical piece are estimated in a situation in which the user is expecting the downbeats to be estimated. The user can operate the operation device 14 to issue an instruction to change one or more location(s) on the time axis of any beat point(s) of the plurality of beat points in the musical piece. Specifically, by moving any one of the plurality of beat images 751 within the beat point region 75 in the time axis direction, the user issues an instruction to change the location of the beat point corresponding to the beat image 751. For example, the user issues an instruction to change the location of the correction candidate point from among the plurality of beat points.
The instruction receiving unit 26 shown in
The estimation model updating unit 27 in
In the estimation model updating process Sc, an adaptation block 55 is added between the first part 50a and the second part 50b of the estimation model 50. The adaptation block 55 comprises, for example, an attention in which the activation function has been initialized to an identity function. Therefore, the initial adaptation block 55 supplies intermediate data D[m] output from the first part 50a to the second part 50b without change.
The estimation model updating unit 27 sequentially inputs feature data F[m1] at analysis time point t[m1] where the beat point before the change is located and feature data F[m2] at analysis time point t[m2] where the beat point after the change is located to the first part 50a (input layer 51). The first part 50a generates intermediate data D[m1] corresponding to feature data F[m1] and intermediate data D[m2] corresponding to feature data F[m2]. Each piece of intermediate data D[m1] and intermediate data D[m2] is sequentially input to the adaptation block 55.
The estimation model updating unit 27 also sequentially provides each of the M pieces of intermediate data D[1] D[M] calculated in the immediately preceding probability calculation process Sa (Sa2) to the adaptation block 55. That is, intermediate data D[m] (D[m1], D[m2]) corresponding to some of the analysis time points t[m] among the M analysis time points t[1] t[M] in the musical piece pertaining to the change instruction and M pieces of intermediate data D[1] D[M] throughout the entire musical piece are input to the adaptation block 55. The adaptation block 55 calculates the degree of similarity between the intermediate data D[m] (D[m1], D[m2]) corresponding to the analysis time points t[m] pertaining to the change instruction and the intermediate data D[m] supplied from the estimation model updating unit 27.
As described above, the analysis time point t[m2] is a time point that was estimated not to correspond to a beat point in the immediately preceding probability calculation process Sa, but that was instructed to be a beat point because of the change instruction. That is, the probability P[m2] represented by the output data O[m2] of the analysis time point t[m2] is set to a small numerical value in the immediately preceding probability calculation process Sa but should be set to a numerical value close to 1 under the change instruction from the user. Further, not only for analysis time point t[m2], but also for each analysis time point t[m] in which intermediate data D[m] that are similar to the intermediate data D[m2] of the analysis time point t[m2] are observed among the M analysis time points t[1] t[M] in the musical piece, the probability P[m] represented by output data O[m] of analysis time point t[m] should also be set to a numerical value close to 1. Thus, the estimation model updating unit 27 updates the plurality of variables of the estimation model so that the probability P[m] of output data O[m] approaches a sufficiently large numerical value (for example, 1) when the degree of similarity between intermediate data D[m] and intermediate data D[m2] exceeds a prescribed threshold value. Specifically, the estimation model updating unit 27 updates the coefficients that define each of the first part 50a, the adaptation block 55, and the second part 50b, so that the error between probability P[m] of output data O[m] generated by the estimation model 50 from each piece of intermediate data D[m], whose degree of similarity to intermediate data D[m2] exceeds the threshold value, and the numerical value indicating a beat point (i.e., 1) is reduced.
On the other hand, the analysis time point t[m1] is a time point that was estimated to correspond to a beat point in the immediately preceding probability calculation process Sa, but that was instructed not to correspond to a beat point due to the change instruction. That is, probability P[m1] represented by output data O[m1] of analysis time point t[m1] is set to a large numerical value in the immediately preceding probability calculation process Sa but should be set to a numerical value close to zero under the change instruction from the user. Further, not only for analysis time point t[m1], but also for each analysis time point t[m] in which intermediate data D[m] that are similar to intermediate data D[m1] of the analysis time point t[m1] are observed among the M analysis time points t[1] t[M] in the musical piece, the probability P[m] represented by output data O[m] of the analysis time point t[m] should also be set to a numerical value close to zero. Thus, the estimation model updating unit 27 updates the plurality of variables of the estimation model 50 so that probability P[m] of output data O[m] approaches a sufficiently small numerical value (for example, zero) when the degree of similarity between intermediate data D[m] and intermediate data D[m1] exceeds a prescribed threshold value. Specifically, the estimation model updating unit 27 updates the coefficients that define each of the first part 50a, the adaptation block 55, and the second part 50b, so that the error between the probability P[m] of output data O[m] generated by the estimation model 50 from each piece of the intermediate data D[m], whose degree of similarity to intermediate data D[m1] exceeds the threshold value, and the numerical value indicating that it does not correspond to a beat point (i.e., zero) is reduced.
As can be understood from the foregoing explanation, in the first embodiment, in addition to intermediate data D[m1] and intermediate data D[m2] directly related to the change instruction, intermediate data D[m] that are similar to intermediate data D[m1] or intermediate data D[m2] among the M pieces of intermediate data D[1] D[M] throughout the entire musical piece, are also used to update the estimation model 50. Therefore, even though the beat point(s) for which the user issues a change instruction are only a part of the beat points in the musical piece, the estimation model 50, following execution of the estimation model update process Sc, can generate M pieces of output data O[1]˜O[M] that reflect the change instruction throughout the entire musical piece.
As discussed above, in the first embodiment, both intermediate data D[m1] and intermediate data D[m2] are used to update the estimation model 50. However, only one of intermediate data D[m1] and intermediate data D[m2] can be used to update the estimation model 50.
If the estimation model update process Sc is started, the estimation model updating unit 27 determines whether the adaptation block 55 has already been added to the estimation model 50 (Sc1). If the adaptation block 55 has not been added to the estimation model 50 (Sc1: NO), the estimation model updating unit 27 adds a new initial adaptation block 55 between the first part 50a and the second part 50b of the estimation model 50. On the other hand, if the adaptation block 55 has already been added in a previous estimation model update process Sc (Sc1: YES), the addition of the adaptation block 55 (Sc2) is not performed.
If a new adaptation block 55 is added, the estimation model 50 including the new adaptation block 55 is updated by the following process, and if the adaptation block 55 has already been added, the estimation model 50 including the existing adaptation block 55 is also updated by the following process. In other words, in a state in which the adaptation block 55 is added to the estimation model 50, the estimation model updating unit 27 performs additional training (Sc3 and Sc4) to which are applied the locations of beat points before and after the change according to the change instruction from the user, thereby updating the plurality of variables of the estimation model 50. If the user has issued an instruction to change the locations of two or more beat points, the additional training (Sc3 and Sc4) is performed for each beat point pertaining to the change instruction.
The estimation model updating unit 27 uses feature data F[m1] at analysis time point t[m1] where the beat point is located before the change according to the change instruction to update the plurality of variables of the estimation model 50 (Sc3). Specifically, the estimation model updating unit 27 sequentially supplies each of the M pieces of intermediate data D[1] D[M] to the adaptation block 55 in parallel with the supply of feature data F[m1] to the estimation model 50 and updates the plurality of variables of the estimation model 50 so that the probability P[m] of output data O[m] generated from each piece of intermediate data D[m] similar to intermediate data D[m1] of feature data F[m1] approaches zero. Thus, the estimation model 50 is trained to produce output data O[m] representing a probability P[m] close to zero when feature data F[m] similar to feature data F[m1] at the analysis time point t[m1] are input.
The estimation model updating unit 27 also updates the plurality of variables of the estimation model 50 using feature data F[m2] at analysis time point t[m2] where the beat point is located after the change according to the change instruction (Sc4). Specifically, the estimation model updating unit 27 sequentially supplies each of the M pieces of intermediate data D[1] D[M] to the adaptation block 55 in parallel with the supply of feature data F[m2] to the estimation model 50 and updates the plurality of variables of the estimation model 50 so that the probability P[m] of output data O[m] generated from each piece of intermediate data D[m] similar to intermediate data D[m2] of feature data F[m2] approaches 1. Therefore, the estimation model 50 is trained to generate output data O[m] representing a probability P[m] close to one when feature data F[m] similar to feature data F[m2] at analysis time point t[m2] are input.
In addition to the estimation model 50 being updated in accordance with a change instruction by the estimation model update process Sc as described above, in the first embodiment, the plurality of updated beat points are estimated by performing the beat point estimation process Sb under the constraint condition according to the change instruction.
As described above, of the five transition points Y[0]˜Y[4] in the beat interval δ, transition point Y[0] corresponds to a beat point and the remaining four transition points Y[1]˜Y[4] do not correspond to beat points. Analysis time point t[m2] on the time axis corresponds to a beat point after the change according to the change instruction. Therefore, from the N likelihoods λ[i, j] corresponding to different states Q at the analysis time point t[m2], the estimation processing unit 23 forcibly sets the likelihood λ[i, j′] corresponding to the transition point Y[j′] (j′=1˜4) other than the transition point Y[0] to zero. In addition, from the N likelihoods λ[i, j] at analysis time point t[m2], the estimation processing unit 23 maintains the likelihood λ[i, 0] corresponding to transition point Y[0] to a numerical value calculated by the method described above. Therefore, in the generation of the state series (Sb3), a maximum likelihood path that necessarily passes through the state Q of the transition point Y[0] at the analysis time point t[m2] is estimated. That is, the analysis time point t[m2] is estimated to correspond to a beat point. As can be understood from the foregoing explanation, the beat point estimation process Sb is performed under the constraint condition that the state Q of the transition point Y[0] is observed at the analysis time point t[m2] of the beat point after the change according to the change instruction from the user.
On the other hand, the analysis time point t[m1] on the time axis does not correspond to a beat point after the change according to the change instruction. Thus, from among the N likelihoods λ[i, j] corresponding to different states Q at the analysis time point t[m1], the estimation processing unit 23 forcibly sets the likelihood λ[i, 0] corresponding to the transition point Y[0] to zero. In addition, from the N likelihoods λ[i, j] at the analysis time point t[m1], the estimation processing unit 23 maintains the likelihood λ[i, j′] corresponding to the transition points Y[j′] other than the transition point Y[0] to a significant numerical value calculated by the method described above. Therefore, in the generation of the state series (Sb3), the maximum likelihood path that does not pass through the state Q of the transition point Y[0] at analysis time point t[m1] is estimated. That is, the analysis time point t[m1] is estimated not to correspond to a beat point. As can be understood from the foregoing explanation, the beat point estimation process Sb is executed under the constraint condition that the state Q of the transition point Y[0] is not observed at the analysis time point t[m1] before the change according to the change instruction from the user.
As described above, the likelihood λ[i, 0] of the transition point Y[0] at analysis time point t[m1] is set to zero, and the likelihood λ[i, j′] of the transition points Y[j′] other than the transition point Y[0] at analysis time point t[m2] is set to zero, thereby changing the maximum likelihood path throughout the entire musical piece. That is, even though the beat points for which the user instructs a change are only a part of the beat points in the musical piece, the change instruction is reflects the plurality of beat points throughout the entire musical piece.
The control device 11 (as probability calculation unit 22) executes the probability calculation process Sa illustrated in
The control device 11 (as display control unit 24) identifies one or more correction candidate points among the plurality of beat points estimated by the beat point estimation process Sb (S14). Specifically, a beat point for which the beat interval δ between the beat point and the immediately preceding or immediately following beat point deviates from the average value in the musical piece, or a beat point for which the time length of the beat interval δ differs significantly from the time length(s) of a beat interval(s) δ before and/or after the beat interval δ, is identified as a correction candidate point. In addition, from the plurality of beat points, the beat point with a probability P[m] less than a prescribed value can be identified as a correction candidate point. The control device 11 (display control unit 24) causes the display device 13 to display the analysis screen 70 illustrated in
When the initial analysis process illustrated above is executed, the control device 11 (as instruction receiving unit 26) waits until a change instruction from the user pertaining to a part of beat points from among the plurality of beat points in the musical piece is received, as is illustrated in
By using the estimation model 50 after the update by the estimation model update process Sc to execute the probability calculation process Sa of
As can be understood from the foregoing explanation, a plurality of updated beat points are estimated by the estimation model update process Sc for updating the estimation model 50, the probability calculation process Sa that uses the updated estimation model 50, and the beat point estimation process Sb that uses the output data O[m] generated by the probability calculation process Sa. In other words, an element (beat point updating unit) that updates the locations of the estimated plurality of beat points is realized by the estimation model updating section 27, the probability calculation section 22, and the estimation processing unit 23.
The control device 11 (display control unit 24) identifies one or more correction candidate points from the plurality of beat points estimated by the beat point estimation process Sb (S34), in the same manner as in the above-mentioned Step S14. The control device 11 (display control unit 24) causes the display device 13 to display the analysis screen of
When the beat point update process illustrated above is executed, the control device 11 determines whether the user has issued an instruction to terminate the process, as shown in
As described above, in the first embodiment, in accordance with a user change instruction pertaining to a part of the plurality of beat points estimated by analyzing the audio signal A, the locations of a plurality of beat points in the musical piece including beat points other than the aforesaid part of beat points are updated. That is, the change instruction for a part of the musical piece is reflected on the entire musical piece. Therefore, compared to a configuration in which the user must issue an instruction to change the location of all of the beat points in the musical piece, a time series of beat points in accordance with the intentions of the user can be obtained, while reducing the burden on the user to issue instructions to change the location of each beat point.
With an adaptation block 55 added between the first part 50a and the second part 50b of the estimation model 50, the estimation model 50 is updated by additional training that applies the locations of beat points before and after the change according to the change instruction from the user. Therefore, the estimation model 50 can be specialized in the estimated beat points are in accordance with the intentions or preferences of the user.
In addition, the state transition model 60 comprising a plurality of states Q corresponding to any of the plurality of tempos X[i] is used to estimate the plurality of beat points. Therefore, a plurality of beat points can be estimated so that the tempo X[i] transitions in a natural manner. Particularly, in the first embodiment, the plurality of states Q of the state transition model 60 correspond to different combinations of each of the plurality of tempos X[i] and each of the plurality of transition points Y[j] in the beat interval δ, and the beat point estimation process Sb is executed under the constraint condition that the state Q corresponding to transition point Y[0] is observed at analysis time point t[m] of the beat point after the change according to a change instruction from the user. Therefore, a plurality of beat points can be estimated that include time points after the change according to the change instruction from the user.
The second embodiment will now be described. In each of the embodiments described below, elements that have the same functions as in the first embodiment have been assigned the same reference numerals as those used to describe the first embodiment and their detailed descriptions have been appropriately omitted.
The analysis processing unit 20 of the second embodiment estimates the tempo T[m] of a musical piece in addition to estimating a plurality of beat points in the musical piece. That is, the analysis processing unit 20 analyzes the audio signal A to estimate a time series of M tempos T[1] T[M] corresponding to different analysis time points t[m] on the time axis.
As can be understood from the foregoing explanation, for each analysis time point t[m], the analysis processing unit 20 estimates tempo T[m] of the musical piece within a range (hereinafter referred to as the “restricted range”) R[m] between maximum tempo H[m] and minimum tempo L[m]. Therefore, the estimated tempo curve CT is located between maximum tempo curve CH and minimum tempo curve CL. The position and range width of restricted range R[m] changes with time.
The curve setting unit 28 of
In the second embodiment, since the waveform 731 of the audio signal A and maximum tempo curve CH and minimum tempo curve CL are displayed on the same time axis, the user can easily visually ascertain the relationship between the waveform 731 of the audio signal A and the temporal change of maximum tempo H[m] or minimum tempo L[m]. In addition, since estimated tempo curve CT is displayed together with maximum tempo curve CH and minimum tempo curve CL, the user can visually ascertain the estimated temporal change of tempo T[m] of the musical piece between maximum tempo curve CH and minimum tempo curve CL.
The estimation processing unit 23 generates a state series in the same way as in the first embodiment (Sb3). That is, from the N states Q, a series in which the states Q having high likelihoods λ[i, j] are arranged for each analysis time point t[m] is generated as the state series. As described above, the likelihood λ[i, j] of a state Q[i, j] corresponding to a tempo X[i] outside of the restricted range R[m] at the analysis time point t[m] is set to zero. Therefore, a state Q corresponding to a tempo X[i] outside of the restricted range R[m] is not selected as an element of the state series. As can be understood from the foregoing explanation, the invalid state of each state Q means that the state Q in question is not selected.
The estimation processing unit 23 generates beat point data B in the same manner as in the first embodiment (Sb4) and identifies the tempo T[m] of each analysis time point t[m] from the state series (Sb5). That is, the tempo X[i] of the state Q corresponding to analysis time point t[m] of the state series is set as the tempo T[m]. As described above, since a state Q corresponding to a tempo X[i] outside of the restricted range R[m] is not selected as an element of the state series, the tempo T[m] is limited to a numerical value inside of restricted range R[m].
As described above, in the second embodiment, maximum tempo curve CH and minimum tempo curve CL are set in accordance with an instruction from the user. The tempo T[m] of the musical piece is then estimated within the restricted range R[m] between maximum tempo H[m] represented by maximum tempo curve CH and minimum tempo L[m] represented by minimum tempo curve CL. Therefore, the possibility that a tempo that deviates excessively from the tempo intended by the user (for example, a tempo that is twice or half assumed by the user) is reduced. That is, tempo T[m] of the musical piece represented by the audio signal A can be estimated with high accuracy.
In addition, in the second embodiment, the state transition model 60 comprising a plurality of states Q corresponding to any of a plurality of tempos X[i] is used to estimate the plurality of beat points. Therefore, tempos T[m] that transition naturally over time are estimated. Moreover, tempos T[m] that are confined to the restricted range R[m] can be estimated by the simple process of setting the states Q of the plurality of states Q that correspond to tempos X[i] outside of the restricted range R[m] to invalid states.
In the first embodiment, an example was used in which output data O[m] representing probability P[m] calculated by the probability calculation unit 22 by the estimation model 50 are applied to the beat point estimation process Sb executed by the estimation processing unit 23. In the third embodiment, probability P[m] calculated the estimation model 50 (hereinafter referred to as “probability P1[m]”) is adjusted in accordance with a user operation of the operation device 14, and the output data O[m] representing adjusted probability P2[m] are applied to the beat point estimation process Sb.
The probability calculation unit 22 sets a unit distribution W for each operation time point. The unit distribution W is the distribution of weighted values w[m] on the time axis. For example, a probability distribution in which the variance is set to a prescribed value, such as a normal distribution, is used as the unit distribution W. In each unit distribution W, weighted value w[m] is maximum at operation time points T, and weighted value w[m] decreases with increasing distance from operation time point T.
The probability calculation unit 22 multiplies probability P1[m] generated by the estimation model 50 with respect to the analysis time point t[m] by weighted value w[m] at the analysis time point t[m], thereby calculating adjusted probability P2[m]. Therefore, even for an analysis time point t[m] at which probability P1[m] generated by the estimation model 50 is small, if the analysis time point t[m] is close to operation time point T, adjusted probability P2[m] is set to a large numerical value. The probability calculation unit 22 supplies output data O[m] representing adjusted probability P2[m] to the estimation processing unit 23. The procedure of the beat point estimation process Sb in which the estimation processing unit 23 uses output data O[m] to estimate a plurality of beat points is the same as that in the first embodiment.
The same effects as those of the first embodiment are realized by the third embodiment. In the third embodiment, since the weighted value w[m] of unit distribution W set at operation time point T by the user is multiplied by probability P1[m], there is the advantage that the beat points can be estimated sufficiently to reflect the intentions or preference of the user. It is also possible to apply the configuration of the second embodiment to the third embodiment.
Specific modified embodiments to be added to each of the aforementioned embodiment examples are illustrated below. Two or more embodiments arbitrarily selected from the following examples can be appropriately combined insofar as they are not mutually contradictory.
For example, the following configurations can be understood from the above-mentioned embodiments used as examples.
An audio analysis method according to one aspect (Aspect 1) of this disclosure comprises estimating a plurality of beat points of a musical piece by analyzing an audio signal representing the performance sound of the musical piece, receiving an instruction from a user to change the locations of some beat points of the plurality of beat points, and updating the positions of the plurality of beat points in accordance with an instruction from the user. In the aspect described above, in accordance with an instruction to change the locations of some beat points of the plurality of beat points estimated by analyzing the audio signal, the locations of a plurality of beat points including beat points other than the aforesaid some beat points are updated. Therefore, compared to a configuration in which the user must change the locations of all of the plurality of beat points, a time series of beat points can be obtained that accord with the intentions of the user, while reducing the burden on the user to issue instructions to change the location of each beat point.
In a specific example (Aspect 2) of Aspect 1, the estimation of the beat points include a feature extraction process for generating feature data including feature values of the audio signal for each of a plurality of analysis time points on a time axis, a probability calculation process for inputting feature data generated by the feature extraction process with respect to each of the analysis time points to an estimation model that has learned a relationship between training feature data corresponding to time points on a time axis and training output data representing the probability that the time points correspond to beat points, thereby generating output data representing the probability that the analysis time points correspond to beat points; and a beat point estimation process for estimating the plurality of beat points from output data generated by the probability calculation process. By the aspect described above, statistically valid output data can be generated for unknown feature data under the potential relationship between the training output data and the training feature data.
In a specific example (Aspect 3) of Aspect 2, in updating the locations of the plurality of beat points, in a state in which an adaptation block is added between a first part on the input side and a second part on the output side of the estimation model, the estimation model is updated by additional training to which are applied the locations of beat points before or after a change according to an instruction from the user, and a plurality of updated beat points are estimated by the probability calculation process that uses the updated estimation model, and the beat point estimation process that uses the output data generated by the probability calculation process. By the aspect described above, the estimation model is updated by additional training to which are applied the locations of beat points before or after a change according to an instruction from the user. Therefore, the estimation model can be specialized to a state in which the beat points can be estimated in accord with the intentions or preferences of the user.
An adaptation block is a block that generates a degree of similarity between first intermediate data generated by the first part from feature data corresponding to locations of beat points before or after a change according to an instruction from the user and second intermediate data corresponding to feature data in each of the plurality of analysis time points in the musical piece. The entire estimation model including the adaptation block is updated such that the output data of the analysis time point corresponding to the second intermediate data that are similar to the first intermediate data of the locations of the beat points before a change according to an instruction from the user approaches a numerical value indicating a lack of correspondence to a beat point, and such that the output data of the analysis time point corresponding to the second intermediate data that are similar to the first intermediate data of the locations of the beat points after the change approach a numerical value indicating correspondence to a beat point.
In a specific example (Aspect 4) of Aspect 2 or 3, in the beat point estimation process, the plurality of beat points are estimated using a state transition model consisting of a plurality of states corresponding to any of a plurality of tempos. By the aspect described above, a plurality of beat points are estimated using a state transition model consisting of a plurality of states corresponding to any of a plurality of tempos. Therefore, a plurality of beat points can be estimated such that the tempo transitions over time in a natural manner.
In a specific example (Aspect 5) of Aspect 4, the plurality of states of the state transition model correspond to different combinations of each of the plurality of tempos and each of a plurality of transition points within a beat interval, and in the beat point estimation process, a time point at which a state corresponding to an end point of the beat interval, from among the plurality of transition points, is estimated as a beat point, and in updating the locations of the plurality of beat points, the beat point estimation process is executed under a constraint condition that a state corresponding to the end point of the beat interval is observed at a time point of a beat point after a change according to an instruction from the user to estimate a plurality of updated beat points. By the aspect described above, a plurality of beat points can be estimated that include beat points of time points after the change according to a change instruction from the user.
An audio analysis system according to one aspect (Aspect 6) of this disclosure comprises an analysis processing unit for estimating a plurality of beat points of a musical piece by analyzing an audio signal representing the performance sound of the musical piece, an instruction receiving unit for receiving an instruction from a user to change the locations of some of the beat points of the plurality of beat points, and a beat point updating unit for updating the locations of the plurality of beat points in accordance with the instruction from the user.
A program according to one aspect (Aspect 7) of this disclosure causes a computer system to function as an analysis processing unit for estimating a plurality of beat points of a musical piece by analyzing an audio signal representing the performance sound of the musical piece, an instruction receiving unit for receiving an instruction from a user to change the locations of some of the beat points of the plurality of beat points, and a beat point updating unit for updating the locations of the plurality of beat points in accordance with the instruction from the user.
“Tempo” in the present Specification is an arbitrary numerical value representing the speed of the performance and is not limited to tempo in the narrow sense, meaning the number of beats within a unit time (BPM: Beats Per Minute).
By the audio analysis method, audio analysis system, and the program of this disclosure, a time series of beat points can be obtained in accord with the intentions of a user, while reducing on the burden on the user to issue instructions to change the locations of each beat point.
Number | Date | Country | Kind |
---|---|---|---|
2021-028539 | Feb 2021 | JP | national |
2021-028549 | Feb 2021 | JP | national |
This application is a continuation application of International Application No. PCT/JP2022/006601, filed on Feb. 18, 2022, which claims priority to Japanese Patent Application No. 2021-028539 filed in Japan on Feb. 25, 2021 and Japanese Patent Application No. 2021-028549 filed in Japan on Feb. 25, 2021. The entire disclosures of International Application No. PCT/JP2022/006601 and Japanese Patent Application Nos. 2021-028539 and 2021-028549 are hereby incorporated herein by reference.