1. Field of the Invention
The present invention relates to a musical score position estimating device, a musical score position estimating method, and a musical score position estimating robot.
2. Description of Related Art
In recent years, thanks to remarkable developments in the physical functions of robots, attempts have been made to support humans doing housework or nursing. For the purpose of coexistence of humans and robots, there is a need for natural interaction between robots and humans.
An example of a communication as an interaction between a human and a robot is a communication using music. Music plays an important role in communication between humans and, for example, persons who do not share a language can share a friendly and joyful time through the music. Accordingly, being able to interact with humans through music is essential for robots to live in harmony with humans.
As situations in which robots communicate with humans through music, for example, it can be thought that the robots could sing to accompaniments or singing voices or move their bodies to the music.
Regarding such a robot, techniques of analyzing musical score information and causing the robots to move on the basis of the analysis result are known.
As a technique of recognizing what musical note is described in a musical score, a technique of converting image data of a musical score into musical note data and automatically recognizing the musical score has been suggested (for example, JP Patent No. 3147846). As a technique of analyzing a metrical structure of tune data on the basis of musical score data and structure analysis data grouped in advance and estimating tempos from audio signals in performance, a beat tracking method has been suggested (for example, see JP-A-2006-201278).
In the technique of analyzing the metrical structure described in JP-A-2006-201278, only the structure based on the musical score is analyzed. Accordingly, when a robot tries to sing to audio signals collected by the robot and a piece of music is started from the middle part thereof, it is not clear what portion of the music is currently performed, and thus the robot fails to extract the beat time or tempo of the piece in performance. In addition, when a human performs a piece of music, the tempo of the performance may vary and thus there is a problem in that the robot may fail to extract the beat time or tempo of the piece in performance.
In the past, the metrical structure or the beat time or the tempo of the piece of music was extracted on the basis of the musical score data. Accordingly, when a piece of music is actually performed, it is not possible to detect what portion of the musical score is currently performed with high precision.
The invention is made in consideration of the above-mentioned problems and it is an object of the invention to provide a musical score position estimating device, a musical score position estimating method, and a musical score position estimating robot, which can estimate a position of a portion in a musical score in performance.
According to a first aspect of the invention, there is provided a musical score position estimating device including: an audio signal acquiring unit; a musical score information acquiring unit acquiring musical score information corresponding to an audio signal acquired by the audio signal acquiring unit; an audio signal feature extracting unit extracting a feature amount of the audio signal; a musical score feature extracting unit extracting a feature amount of the musical score information; a beat position estimating unit estimating a beat position of the audio signal; and a matching unit matching the feature amount of the audio signal with the feature amount of the musical score information using the estimated beat position to estimate a position of a portion in the musical score information corresponding to the audio signal.
According to a second aspect of the invention, the musical score feature extracting unit may calculate rareness which is an appearance frequency of a musical note from the musical score information, and the matching unit may make a match using rareness.
According to a third aspect of the invention, the matching unit may make a match on the basis of the product of the calculated rareness, the extracted feature amount of the audio signal, and the extracted feature amount of the musical score information.
According to a fourth aspect of the invention, rareness may be the lowness in appearance frequency of a musical note in the musical score information.
According to a fifth aspect of the invention, the audio signal feature extracting unit may extract the feature amount of the audio signal using a chroma vector, and the musical score feature extracting unit may extract the feature amount of the musical score information using a chroma vector.
According to a sixth aspect of the invention, the audio signal feature extracting unit may weight a high-frequency component in the extracted feature amount of the audio signal and calculate an onset time of a musical note on the basis of the weighted feature amount, and the matching unit may make a match using the calculated onset time of a musical note.
According to a seventh aspect of the invention, the beat position estimating unit may estimate the beat position by switching a plurality of different observation error models using a switching Kalman filter.
According to another aspect of the invention, there is provided a musical score position estimating method including: an audio signal acquiring step of causing an audio signal acquiring unit to acquire an audio signal; a musical score information acquiring step of causing a musical score information acquiring unit to acquire musical score information corresponding to the acquired audio signal; an audio signal feature extracting step of causing an audio signal feature extracting unit to extract a feature amount of the audio signal; a musical score information feature extracting step of causing a musical score feature extracting unit to extract a feature amount of the musical score information; a beat position estimating step of causing a beat position estimating unit to estimate a beat position of the audio signal; and a matching step of causing a matching unit to match the feature amount of the audio signal with the feature amount of the musical score information using the estimated beat position to estimate a position of a portion in the musical score information corresponding to the audio signal.
According to another aspect of the invention, there is provided a musical score position estimating robot including: an audio signal acquiring unit; an audio signal separating unit extracting an audio signal corresponding to a performance by performing a suppression process on the audio signal acquired by the audio signal acquiring unit; a musical score information acquiring unit acquiring musical score information corresponding to the audio signal extracted by the audio signal separating unit; an audio signal feature extracting unit extracting a feature amount of the audio signal extracted by the audio signal separating unit; a musical score feature extracting unit extracting a feature amount of the musical score information; a beat position estimating unit estimating a beat position of the audio signal extracted by the audio signal separating unit; and a matching unit matching the feature amount of the audio signal with the feature amount of the musical score information using the estimated beat position to estimate a position of a portion in the musical score information corresponding to the audio signal.
According to the first aspect of the invention, the feature amount and the beat position are extracted from the acquired audio signal and the feature amount is extracted from the acquired musical score information. By matching the feature amount of the audio signal with the feature amount of the musical score information using the extracted beat position, the position of a portion in the musical score information corresponding to the audio signal is estimated. As a result, it is possible to accurately estimate a position of a portion in a musical score on the basis of the audio signal.
According to the second aspect of the invention, since rareness which is the lowness in appearance frequency of a musical note is calculated from the musical score information and the match is made using the calculated rareness, it is possible to accurately estimate a position of a portion in a musical score on the basis of the audio signal with high precision.
According to the third aspect of the invention, since the match is made on the basis of the product of rareness, the feature amount of the audio signal, and the feature amount of the musical score information, it is possible to accurately estimate a position of a portion in a musical score on the basis of the audio signal with high precision.
According to the fourth aspect of the invention, since the lowness in appearance frequency of a musical note is used as rareness, it is possible to accurately estimate a position of a portion in a musical score on the basis of the audio signal with high precision.
According to the fifth aspect of the invention, since the feature amount of the audio signal and the feature amount of the musical score information are extracted using the chroma vector, it is possible to accurately estimate a position of a portion in a musical score on the basis of the audio signal with high precision.
According to the sixth aspect of the invention, since the high-frequency component in the feature amount of the audio signal is weighted and the match is made using the onset time of a musical note on the basis of the weighted feature amount, it is possible to accurately estimate a position of a portion in a musical score on the basis of the audio signal with high precision.
According to the seventh aspect of the invention, the beat position is estimated by switching plural different observation error models using the switching Kalman filter. Accordingly, when the performance starts to differ from the tempo of the musical score, it is possible to accurately estimate a position of a portion in a musical score on the basis of the audio signal with high precision.
Hereinafter, exemplary embodiments of the invention will be described in detail with reference to the accompanying drawings. The invention is not limited to the embodiments, but can be modified in various forms without departing from the technical spirit of the invention.
The microphone 30 collects sounds in which sounds of performance (accompaniment) and voice signals (singing voice) output from the speaker 20 of the robot 1 are mixed, converts the collected sounds into audio signals, and outputs the audio signals to the audio signal separating unit 110.
The audio signals collected by the microphone 30 and the voice signals generated from the singing voice generating unit 130 are input to the audio signal separating unit 110. The self-generated sound suppressing filter unit 111 of the audio signal separating unit 110 performs an independent component analysis (ICA) process on the input audio signals and suppresses reverberated sounds included in the generated voice signals and the audio signals. Accordingly, the audio signal separating unit 110 separates and extracts the audio signals based on the performance. The audio signal separating unit 110 outputs the extracted audio signals to the musical score position estimating unit 120.
The audio signals separated by the audio signal separating unit 110 are input to the musical score position estimating unit 120 (the musical score information acquiring unit, the audio signal feature extracting unit, the musical score feature extracting unit, the beat position estimating unit, and the matching unit). The tune position estimating unit 122 of the musical score position estimating unit 120 calculates an audio chroma vector as a feature amount and an onset time from the input audio signals. The tune position estimating unit 122 reads musical score data of a piece of music in performance from the musical score database 121 and calculates a musical score chroma vector as a feature amount from the musical score data and rareness as the appearance frequency of a musical note. The tune position estimating unit 122 performs a beat tracking process from the input audio signals and detects a rhythm interval (tempo). The tune position estimating unit 122 estimates the outlier of the tempo or a noise using a switching Kalman filter (SKF) on the basis of the extracted rhythm interval (tempo) and extracts a stable rhythm interval (tempo). The tune position estimating unit 122 (the audio signal feature extracting unit, the musical score feature extracting unit, the beat position estimating unit, and the matching unit) matches the audio signals based on the performance with the musical score using the extracted rhythm interval (tempo), the calculated audio chroma vector, the calculated onset time information, the musical score chroma vector, and rareness. That is, the tune position estimating unit 122 estimates at what portion of a musical score the tune being performed is located. The musical score position estimating unit 120 outputs the musical score position information representing the estimated musical score position to the singing voice generating unit 130.
It has been stated that the musical score data is stored in advance in the musical score database 121, but the musical score position estimating unit 120 may write and store input musical score data in the musical score database 121.
The estimated musical score position information is input to the singing voice generating unit 130. The voice generating unit 132 of the singing voice generating unit 130 generates a voice signal of a singing voice in accordance with the performance by the use of a known technique on the basis of the input musical score position information and using the information stored in the word and melody database 131. The singing voice generating unit 130 outputs the generated voice signal of a singing voice through the speaker 20.
Next, the outline of an operation will be described in which the audio signal separating unit 110 suppresses reverberated sounds included in the generated voice signals and the audio signals using an independent component analysis. In the independent component analysis, a separation process is performed by assuming independency (i.e., probability density) between sound sources. The audio signals acquired by the robot 1 through the microphone 30 are signals in which the signals of sounds of performance and the voice signals output by the robot 1 using the speaker 20 are mixed. Among the mixed signals, the voice signals output by the robot 1 using the speaker 20 are known because the signals are generated by the voice generating unit 132. Accordingly, the audio signal separating unit 110 carries out an independent component analysis in frequency region to suppress the voice signals of the robot 1 included in the mixed signals, thereby separating the sounds of performance.
Next, the outline of the method employed in the musical score position estimating device 100 according to this embodiment will be described. When the beat or tempo is extracted from the music (accompaniment) being performed to estimate what portion of a musical score is being performed, there are generally three technologies.
A first technology is how to distinguish various instrument sounds included in the audio signal being performed.
When complex musical notes are performed at the same time with various instruments, in other words, when chordal sounds are treated, it is even more difficult to detect basic frequencies of the musical notes or to recognize the stabilized sounds.
Accordingly, in this embodiment, the onset time (205, 215) which is a starting portion of a waveform in performance is noted.
The musical score position estimating unit 120 extracts a feature amount in a frequency domain using 12-step chroma vectors (audio feature amount). The musical score position estimating unit 120 calculates the onset time which is a feature amount in a time domain on the basis of the extracted feature amount in the frequency domain. The chroma vector has the advantages of being robust against variations in spectrum shape of various instruments, and being effective with respect to chordal sound signals. In the chroma vector, powers of 12 pitch names such as C, C#, . . . , and B are extracted instead of the basic frequencies. In this embodiment, as indicated by the starting portion 205 in part (a) of
A second technology is estimating a difference between the audio signals in performance and the musical score.
As shown in part (a) and part (b) of
In the musical score, the volumes of the musical notes are not clearly described.
As described above, in this embodiment, on the basis of the thought that the musical note of a rarely-used pitch name is markedly expressed in the audio signals at some times, the difference between the audio signals and the musical score is reduced. First, the musical score of the piece of music in performance is acquired in advance and is registered in the musical score database 121. The tune position estimating unit 122 analyzes the musical score of the piece in performance and calculates the appearance frequencies of the musical notes. The appearance frequency of each pitch name in the musical score is defined as rareness. The definition of rareness is similar to that of information entropy. In part (a) of
The tune position estimating unit 122 weights the pitch names calculated in this way on the basis of the calculated rareness.
By weighting the pitch names, a low-frequency musical note can be more easily extracted from the chordal audio signals than a high-frequency musical note.
A third technology is estimating a variation in tempo of the audio signals in performance. The stable tempo estimation is essential for the robot 1 to sing in accurate synchronization with the musical score and for the robot 1 to output smooth and pleasant singing voices in accordance with the piece of music in performance. When a human performs a piece of music, the tempo may depart from the tempo indicated by the musical score. The tempo difference is caused at the time of estimating the tempo using a known beat tracking process.
Accordingly, in this embodiment, the tune position estimating unit 122 employs the switching Kalman filter (SKF) for the tempo estimation. The SKF allows the estimation of a next tempo from a series of tempos including errors.
Next, the process performed by the musical score position estimating unit 120 will be described in detail with reference to
Extraction of Feature from Audio Signal
The audio signals separated by the audio signal separating unit 110 are input to the audio signal feature extracting unit 410. The audio signal feature extracting unit 410 extracts the audio chroma vector and the onset time from the input audio signals, and outputs the extracted chroma vector and the onset time information to the beat interval (tempo) calculating unit 430.
The audio signal feature extracting unit 410 calculates a spectrum from the input audio signal using a short-time Fourier transformation (STFT). The short-time Fourier transformation is a technique of multiplying the input audio signal by a window function such as a Hanning window and calculating a spectrum while shifting an analysis position within a finite period. In this embodiment, the Hanning window is set to 4096 points, the shift interval is set to 512 points, and the sampling rate is set to 44.1 kHz. Here, the power is expressed by p(t,ω), where t represents a frame time and ω represents a frequency.
The chroma vector c(t)=[c(1,t), c(2,t), . . . , c(12,t)]T (where T represents a transposition of a vector) every frame time t. As shown in
In Expression 1, BPFi,h represents the band-pass filter for pitch name i in the h-th octave. OctL and OctH are lower and higher limit octaves to consider respectively. The peak of the band is the fundamental frequency of the note. The edges of the band are the frequencies of neighboring notes. For example, the BPF for note “A4” (note “A” at the fourth octave) of which the fundamental frequency is 440 Hz has a peak at 440 Hz. The edges of the band are “G#” (note “G#” at the fourth octave) at 415 Hz, and “A#4” at 466 Hz. In this embodiment, OctL=3 and OctH=7 are set. In other words, the lowest note is “C3” at 131 Hz and the highest note is “B7” at 3951 Hz.
To emphasize the pitch name, the audio signal feature extracting unit 410 applies the convolution of Expression 2 to Expression 1.
The audio signal feature extracting unit 410 periodically processes the convolution of Expression 2 for index i. For example, when i=1 (pitch name “C”), c(i-1, t) is substituted with c(12, t) (pitch name “B”).
By the convolution of Expression 2, the neighboring pitch name power is subtracted and thus a component with more power than others can be emphasized, which may be analogous to edge extraction in image processing. By subtracting the power of the previous time frame, the increase in power is emphasized.
The audio signal feature extracting unit 410 extracts a feature amount by calculating the audio chroma vector csig(i,t) from the audio signal using Expression 3.
The audio signal feature extracting unit 410 extracts the onset time from the input audio signal using an onset extracting method (method 1) proposed by Rodet et al.
Reference 1 (method 1): X. Rodet and F. Jaillet. Detection and modeling of fast attack transients. In International Computer Music Conference, pages 30-33, 2001.
The increase in power at the onset time which is located particularly in the high frequency region is used to extract the onset. The onset time of sounds of pitched instruments is located at the center in a higher frequency region than those of percussive instruments such as drums. Accordingly, this method is particularly effective in detecting the onset times of pitched instruments.
First, the audio signal feature extracting unit 410 calculates the power known as a high-frequency component using Expression 4.
The high-frequency component is a weighted power where the weight increases linearly with the frequency. The audio signal feature extracting unit 410 determines the onset time tn by selecting the peaks of h(t) using a median filter, as shown in
The audio signal feature extracting unit 410 outputs the extracted audio chroma vectors and the extracted onset time information to the matching unit 440.
Feature Extraction from Musical Score
The musical score feature extracting unit 420 reads necessary musical score data from a musical score stored in the musical score database 121. In this embodiment, it is assumed that music titles to be performed are input to the robot 1 in advance, and the musical score feature extracting unit 420 selects and reads the musical score data of the designated piece of music.
The musical score feature extracting unit 420 divides the read musical score data into frames such that the length of one frame is equal to one-48th of a bar, as shown in part (b) of
In Expression 5, fm represents the m-th onset time in the musical score.
Then, the musical score feature extracting unit 420 calculates rareness r(i,m) of each pitch name i at frame fm from the extracted chroma vectors using Expression 7.
Here, M represents a frame range of which the length is two bars with its center at frame fm. Therefore, n(i,m) represents the distribution of pitch names around frame fm.
The musical score feature extracting unit 420 outputs the extracted musical score chroma vectors and rareness to the matching unit 440.
As shown in part (c) of
The musical score feature extracting unit 420 outputs the extracted musical score chroma vectors and rareness to the matching unit 440.
Beat Tracking
The beat interval (tempo) calculating unit 430 calculates the beat interval (tempo) from the input audio signal using a beat tracking method (method 2) developed by Murata et al.
Reference 2 (method 2): K. Murata, K. Nakadai, K. Yoshii, R. Takeda, T. Torii, H. G. Okuno, Y. Hasegawa, and H. Tsujino, “A robot uses its own microphone to synchronize its steps to musical beats while scatting and singing”, in 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2459-2464.
First, the beat interval (tempo) calculating unit 430 transforms a spectrogram p(t,ω) of which the frequency is in linear scale into pmel(t,φ) of which the frequency is in 64-dimensional Mel-scale using Expression 9. The beat interval (tempo) calculating unit 430 calculates an onset vector d(t,φ) using Expression 8.
Expression 9 means the onset emphasis with a Sobel filter.
Then, the beat interval (tempo) calculating unit 430 estimates the beat interval (tempo). The beat interval (tempo) calculating unit 430 calculates beat interval reliability R(t,k) using normalized cross-correlation by the use of Expression 10.
In Expression 10, Pw represents the window length for reliability calculation and k represents the time shift parameter. The beat interval (tempo) calculating unit 430 determines the beat interval I(t) on the basis of the time shift value k. The beat interval reliability R(t,k) takes a value of a local peak.
The beat interval (tempo) calculating unit 430 outputs the calculated beat interval (tempo) information to the tempo estimating unit 450.
Matching between Audio Signal and Musical Score
The audio chroma vectors and the onset time information extracted by the audio signal feature extracting unit 410, the musical score chroma vectors and rareness extracted by the musical score feature extracting unit 420, and the stabilized tempo information estimated by the tempo estimating unit 450 are input to the matching unit 440. The matching unit 440 lets (tn,fm) be the last matching pair. Here, tn represents the time in the audio signal and fm represents the frame index of the musical score. When a new onset time of the audio signal detected at time tn+1 and the tempo at that time are considered, the number of frames F to go forward in the musical score is estimated by the matching unit 440 using Expression 11.
Expression 11
F=A(tn+1−tn) (11)
In Expression 11, coefficient A corresponds to the tempo. The faster the music is, the larger coefficient A becomes. The weight for musical score frame fm+k is defined as Expression 12.
In Expression 12, k represents the number of onset times in the musical score to go forward and σ represents the variance for the weight. In this embodiment, σ=24 is set, which corresponds to the half length of a note. Here, it should be noted that k may have a negative value. When k is a negative number, it means that the matching such as (tn+1,fm−1) is considered, which means that the matching moves backward in the musical score.
The matching unit 440 calculates the similarity between the pair (tn,fm) using Expression 13.
In Expression 13, i represents a pitch name, r(i,m) represents rareness csco, and csig represents the chroma vector generated from the musical score and the audio signal. That is, the matching unit 440 calculates the similarity between the pair (tn,fm) on the basis of the product of rareness, the audio chroma vector, and the musical score chroma vector.
When the last matching pair is (tn,fm), the new matching is (tn+1,fm+k) where the number of onset times k in the musical score to go forward is expressed by Expression 14.
In this embodiment, the search range of the number of onset times k in the musical score to go forward for each matching step performed by the matching unit 440 is limited to two bars to reduce the computational cost.
The matching unit 440 calculates the last matching pair (tn,fm) using Expressions 11 to 14 and outputs the calculated last matching pair (tn,fm) to the singing voice generating unit 130.
Tempo Estimation using Switching Kalman Filter
The tempo estimating unit 450 estimates the tempo using switching Kalman filters (SKF) (method 3) to cope with the matching result and two types of errors in the tempo estimation using the beat tracking method.
Reference 3 (method 3): K. P. Murphy. Switching kalman filters. Technical report, 1998.
Two types of errors to be coped with by the tempo estimating unit 450 are “small errors caused by slight changes of the performance speed” and “errors due to the outliers of the tempo estimation using the beat tracking method”. The tempo estimating unit 450 includes the switching Kalman filters and employs two models of a small observation error model 451 and a large observation error model 452 as the outlier.
The switching Kalman filter is an extension of a Kalman filter (KF). The Kalman filter is a linear prediction filter with a state transition model and an observation model. The KF estimates the state from observed values including errors in a discrete time series when the state is unobservable. The switching Kalman filter has a multiple state transition model and an observation model. Every time the switching Kalman filter obtains an observation value, the model is automatically switched on the basis of the likelihood of each model.
In this embodiment, in two models of the small observation error model 451 and the large observation error model 452 as the outlier of the switching Kalman filter, other modeling elements such as the state transition models are common to the two models.
In this embodiment, the SKF model (method 4) proposed by Cemgil et al. is used to estimate the beat time and the beat interval.
Reference 4 (method 4): A. T. Cemgil, B. Kappen, P. Desain, and H. Honing. On tempo tracking: Tempogram representation and Kalman filtering, Journal of New Music Research, 28:4:259-273, 2001.
Suppose that the k-th beat time is bk and the beat interval at that time is Δk and that the tempo is constant. The next beat time is represented as bk±1=bk+Δk and the beat interval is represented as Δk+1=Δk. Here, by assuming that vector xk=[bkΔk]T, the state transition is expressed as Expression 15.
In Expression 15, Fk represents a state transition matrix, vk represents a transition error vector derived from a normal distribution with mean 0 and covariance matrix Q. When it is assumed that the most recent state is xk, the tempo estimating unit 450 estimates the next beat time bk+1 as the first component of xk+1 expressed by Expression 16.
Expression 16
xk+1=Fkxk (16)
Here, let the observation vector be zk=[bk′, Δk′]T, where bk′ represents the beat time calculated from the matching result of the matching unit 440 and Δk′ represents the beat interval calculated by the beat interval (tempo) calculating unit 430 using the beat tracking. The tempo estimating unit 450 calculates the observation vector using Expression 17.
In Expression 17, Hk represents an observation matrix and wk represents the observation error vector derived from a normal distribution with mean 0 and covariance matrix R. In this embodiment, the tempo estimating unit 450 causes the SKF to switch observation error covariance matrices Ri (where i=1, 2), where i represents a model number. Through preliminary experiments, Ri is set as follows in this embodiment. The small error model is R1=diag(0.02, 0.005) and the outlier model is R2=diag(1, 0.125), where diag(a1, . . . , an) represents n×n diagonal matrix of which elements are a1, . . . , an from the top-left side to the bottom-right side.
In part (b) of
Observation of Beat Time
As described with reference to part (b) of
The tempo estimating unit 450 outputs the calculated beat time bk′ and the beat interval information to the matching unit 440.
Procedure of Musical Score Position Estimating Process
The procedure of the musical score position estimating process performed by the musical score position estimating device 100 will be described with reference to
First, the musical score feature extracting unit 420 reads the musical score data from the musical score database 121. The musical score feature extracting unit 420 calculates the musical score chroma vector and rareness from the read musical score data using Expressions 5 to 7, and outputs the calculated musical score chroma vector and rareness to the matching unit 440 (step S1).
Then, the musical score position estimating unit 122 determines whether the performance is continued on the basis of the audio signal collected by the microphone 30 (step S2). Regarding this determination, the musical score position estimating unit 122 determines that the piece of music is continuously performed when the audio signal is continued, or determines that the piece of music is continuously performed when the position of the piece of music which is being performed is not the final edge of the musical score.
When it is determined in step S2 that the piece of music is not continuously performed (NO in step S2), the musical score position estimating process is ended.
When it is determined in step S2 that the piece of music is continuously performed (YES in step S2), the audio signal separating unit 110 stores the audio signal collected by the microphone 30 in a buffer of the audio signal separating unit 110, for example, for 1 second (step S3).
Then, the audio signal separating unit 110 extracts the audio signal by making an independent component analysis using the input audio signal and the voice signal generated by the singing voice generating unit 130 and suppressing the reverberated sound and the singing voice, and outputs the extracted audio signal to the musical score position estimating unit 120.
The beat interval (tempo) calculating unit 430 estimates the beat interval (tempo) using the beat tracking method and Expressions 8 to 10 on the basis of the input musical signal, and outputs the estimated beat interval (tempo) to the matching unit 440 (step S4).
The audio signal feature extracting unit 410 detects the onset time information from the input audio signal using Expression 4, and outputs the detected onset time information to the matching unit 440 (step S5).
The audio signal feature extracting unit 410 extracts the audio chroma vector using Expressions 8 to 3 on the basis of the input audio signal, and outputs the extracted audio chroma vector to the matching unit 440 (step S6).
The audio chroma vector and the onset time information extracted by the audio signal feature extracting unit 410, the musical score chroma vector and rareness extracted by the musical score feature extracting unit 420, and the stable tempo information estimated by the tempo estimating unit 450 are input to the matching unit 440. The matching unit 440 sequentially matches the input audio chroma vector and musical score chroma vector using Expressions 11 to 14, and estimates the last matching pair (tn, fm). The matching unit 440 outputs the last matching pair (tn, fm) corresponding to the estimated musical score position to the tempo estimating unit 450 and the singing voice generating unit 130 (step S7).
On the basis of the beat interval (tempo) information input from the beat interval (tempo) calculating unit 430, the tempo estimating unit 450 calculates the beat time bk′ and the beat interval information using Expressions 15 to 3 and outputs the calculated beat time bk′ and the calculated beat interval information to the matching unit 440 (step S8).
The last matching pair (tn, fm) is input to the tempo estimating unit 450 from the matching unit 440. The tempo estimating unit 450 interpolates the calculated beat time bk by the matching result in the matching unit 440 when no note exists in the k-th beat frame.
The matching unit 440 and the tempo estimating unit 450 sequentially perform the matching process and the tempo estimating process, and the matching unit 440 estimates the last matching pair (tn, fm).
The voice generating unit 132 of the singing voice generating unit 130 generates a singing voice of words and melodies corresponding to the musical score position with reference to the word and melody database 131 on the basis of the input last matching pair (tn, fm). Here, the “singing voice” is voice data output through the speaker 20 from the musical score position estimating device 100. That is, since the sound is output through the speaker 20 of the robot 1 having the musical score position estimating unit 100, it is called a “singing voice” for the purpose of convenience. In this embodiment, the voice generating unit 132 generates the singing voice using VOCALOID (registered trademark (VOCALOID2)). Since the VOCALOID (registered trademark (VOCALOID2)) is an engine for synthesizing a singing voice based on a human voice sampled by inputting the melodies and words, the singing voice does not depart from the actual performance by adding the musical score position as information in this embodiment.
The voice generating unit 132 outputs the generated voice signal from the speaker 20.
After the last matching pair (tn, fm) is estimated, the processes of steps S2 to S8 are sequentially performed until the performance of a piece of music is finished.
In this way, by estimating the musical score position, generating a voice (singing voice) corresponding to the estimated musical score position, and outputting the generated voice from the speaker 20, the robot 1 can sing to the performance. According to this embodiment, since the position of a portion in the musical score is estimated on the basis of the audio signal in performance, it is possible to accurately estimate the position of a portion in the musical score even when a piece of music is started from the middle part thereof.
Evaluation Result
The evaluation result using the musical score position estimating device 100 according to this embodiment will be described. First, test conditions will be described. The pieces of music used in the evaluation were 100 pieces of popular music in the RWC research music database (RWC-MDB-P-2001;http://staff.aist.go.jp/m.goto/RWC-MDB/index-j.html) prepared by GOTO et al. Regarding the used pieces of music, the full-version pieces of music including the singing parts or the performance parts were used.
The answer data of musical score synchronization was generated from MIDI files of the pieces of music by an evaluator. The MIDI files are accurately synchronized with the actual performance. The error is defined as an absolute difference between the beat times extracted per second in this embodiment and the answer data. The errors are averaged every piece of music.
The following four types of methods were evaluated and the evaluation results were compared.
(i) Method according to this embodiment: SKF and rareness are used.
(ii) Without SKF: Tempo estimation is not modified.
(iii) Without rareness: All notes have equal rareness.
(iv) Beat tracking method: This method determines the musical score position by counting the beats from the beginning of the music.
Furthermore, by using two types of music signals, it was evaluated what influence the sound collected by the microphone 30 of the musical score position estimating device 100 have on the reverberation in the room environment.
(v) Clean music signal: music signal without reverberation
(vi) Reverberated music signal: music signal with reverberation.
The reverberation was simulated by impulse response convolution.
Since the magnitude of error when using the method (ii) without the SKF is larger than the magnitude of error when using the method (iii) without rareness, it can be said that the SKF is more effective than rareness. This is because rareness often causes a high similarity between the frames in the musical score and the incorrect onset times such as drum sounds. If drum sounds accompany high rareness and have high power in the chroma vector component, this causes incorrect matching. To avoid this problem, the musical score position estimating device 100 can consider rareness of combined pitch names, not a single pitch name.
Regarding the reverberated signal, the number of pieces of music having an error of 2 seconds or less was 36 in the method (i) according to this embodiment, but was 12 in the method (iv) using only the beat tracking method. In this way, since the position of a portion in the musical score can be estimated with smaller errors, the method according to this embodiment is better than the beat tracking method. This is essential to the generation of natural singing voices to the music.
In the classification using the method according to this embodiment, there is no great difference between the clean signal and the reverberated signal, but the method according to this embodiment has greater errors in the reverberated signal, as shown in
Accordingly, in this embodiment, since the audio signal having been subjected to the independent component analysis to suppress the reverberation sounds by the audio signal separating unit 110 is used to estimate the musical score position, it is possible to reduce the influence of the reverberation in this case, thereby synchronizing the musical score with high precision.
Accordingly, by comparing the errors of the pieces of music having drum sounds and having no drum sound with each other, it was tested that the precision of the method according to this embodiment depends on the playing of a drum in the musical score. The number of pieces of music having a drum sound and the number of pieces of music having no drum sound are 89 and 11, respectively. The average of the cumulative absolute errors of the pieces of music having a drum sound is 7.37 seconds and the standard deviation thereof is 9.4 seconds. On the other hand, the average of cumulative errors of the pieces of music having no drum sound is 22.1 seconds and the standard deviation thereof is 14.5 seconds. The tempo estimation using the beat tracking method can easily cause a very great variation when there is no drum sound. This is a reason for inaccurate matching causing a high cumulative error.
In this embodiment, to reduce the influence of a low-pitched sound region like a drum, the high-frequency component is weighted and the onset time is detected from the weighted power, as shown in
In this embodiment, it has been stated that the musical score position estimating device 100 is applied to the robot 1 and the robot 1 sings to performance (singing voices are output from the speaker 20). However, on the basis of the estimated musical score position information, the control unit of the robot 1 may control the robot 1 to move its movable parts to the performance as if the robot 1 moves its body to the performance and rhythms.
In this embodiment, it has been stated that the musical score position estimating device 100 is applied to the robot 1, but the musical score position estimating device may be applied to other apparatuses. For example, the device may be applied to a mobile phone or the like or may be applied to a singer apparatus singing to a performance.
In this embodiment, it has been stated that the matching unit 440 performs the weighting using rareness, but the weighting may be carried out using different factors. When it is determined that the appearance frequency of a musical note is low it can be considered that the musical note of which the appearance frequency is low is high in appearance frequency in frames before and after a specific frame. In this case, the musical note having the high appearance frequency or the musical note having the average appearance frequency may be used.
In this embodiment, it has been stated that the beat interval (tempo) calculating unit 430 divides a musical score into frames with a length corresponding to a 48th note, but the frames may have a different length. It has been stated that the buffering time is 1 second, but the buffering time may not be 1 second and data for a time longer than the time of the processing may be included.
The above-mentioned operations of the units according to the embodiment of the invention shown in
The “computer system” includes a homepage providing environment (or display environment) in using a WWW system.
Examples of the “computer-readable recording medium” include memory devices of portable mediums such as a flexible disk, a magneto-optical disk, a ROM (Read Only Memory), and a CD-ROM, a USB memory connected via a USB (Universal Serial Bus) I/F (Interface), and a hard disk built in the computer system. The “computer-readable recording medium” may include a recording medium dynamically storing a program for a short time like a transmission medium when the program is transmitted via a network such as Internet or a communication line such as a phone line, and a recording medium storing a program for a predetermined time like a volatile memory in a computer system serving as a server or a client in that case. The program may embody a part of the above-mentioned functions. The program may embody the above-mentioned functions in cooperation with a program previously recorded in the computer system.
While preferred embodiments of the invention have been described and illustrated above, it should be understood that these are exemplary of the invention and are not to be considered as limiting. Additions, omissions, substitutions, and other modifications can be made without departing from the spirit or scope of the present invention. Accordingly, the invention is not to be considered as being limited by the foregoing description, and is only limited by the scope of the appended claims.
This application claims benefit from U.S. Provisional application Ser. No. 61/234,076, filed Aug. 14, 2009, the contents of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5952597 | Weinstock et al. | Sep 1999 | A |
6107559 | Weinstock et al. | Aug 2000 | A |
7179982 | Goto | Feb 2007 | B2 |
7838755 | Taub et al. | Nov 2010 | B2 |
7966327 | Li et al. | Jun 2011 | B2 |
8035020 | Taub et al. | Oct 2011 | B2 |
8076566 | Yamashita et al. | Dec 2011 | B2 |
8178770 | Kobayashi | May 2012 | B2 |
8296390 | Wood | Oct 2012 | B2 |
20020172372 | Tagawa et al. | Nov 2002 | A1 |
20050182503 | Lin et al. | Aug 2005 | A1 |
20080002549 | Copperwhite et al. | Jan 2008 | A1 |
20090056526 | Yamashita et al. | Mar 2009 | A1 |
20090139389 | Bowen | Jun 2009 | A1 |
20090228799 | Verbeeck et al. | Sep 2009 | A1 |
20090288546 | Takeda | Nov 2009 | A1 |
20100126332 | Kobayashi | May 2010 | A1 |
20100212478 | Taub et al. | Aug 2010 | A1 |
20100313736 | Lenz | Dec 2010 | A1 |
20110036231 | Nakadai et al. | Feb 2011 | A1 |
20110214554 | Nakadai et al. | Sep 2011 | A1 |
20120031257 | Saino | Feb 2012 | A1 |
20120101606 | Miyajima | Apr 2012 | A1 |
20120132057 | Kristensen | May 2012 | A1 |
20130226957 | Ellis et al. | Aug 2013 | A1 |
Number | Date | Country |
---|---|---|
3147846 | Mar 2001 | JP |
2006-201278 | Aug 2006 | JP |
Entry |
---|
Cont, Arshia, “Realtime Audio to Score Alignment for Polyphonic Music Instruments Using Sparse Non-negative Constraints and Hierarchical HMMS,” IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2006 Proceedings, vol. 5:V-245-V-248 (2006). |
Dannenberg, Roger B. et al., “Polyphonic Audio Matching for Score Following and Intelligent Audio Editors,” Proceedings of the 2003 International Computer Music Conference, pp. 27-33 (2003). |
Murata, Kazumasa et al., “A Robot Uses Its Own Microphone to Synchronize Its Steps to Musical Beats While Scatting and Singing,” IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2459-2464 (2008). |
Orio, Nicola et al., “Score Following: State of the Art and New Developments,” Proceedings of the 2003 Conference on New Interfaces for Musical Expression, pp. 36-41 (2003). |
Bello, Juan Pablo et al., “Techniques for Automatic Music Transcription,” International Symposium on Music Information Retrieval, pp. 1-8 (2000). |
Otsuka, Takuma et al., “Real-time Synchronization Method between Audio Signal and Score Using Beats, Melodies, and Harmonies for Singer Robots,” 71st National Convention of Information Processing Society of Japan, pp. 2-243-2-244 (2009) X. |
Japanese Office Action for Application No. 2010-177968, 6 pages, dated Mar. 4, 2014. |
Cemgil, Ali Taylan et al., “On tempo tracking: Tempogram Representation and Kalman filtering,” Journal of New Music Research, vol. 28(4), 19 pages, (2001). |
Number | Date | Country | |
---|---|---|---|
20110036231 A1 | Feb 2011 | US |
Number | Date | Country | |
---|---|---|---|
61234076 | Aug 2009 | US |