The present invention relates to a dereverberation apparatus, a dereverberation method and a dereverberation program and a recording medium for removing a reverberation signal from an observation signal.
In the following description, a signal emitted from a sound source is referred to as an audio signal, and an audio signal produced in a reverberant room and collected by a plurality of sound collecting means (microphones, for example) is referred to as an observation signal. The observation signal is the audio signal on which a reverberation signal is superimposed. It is difficult to extract characteristics of the original audio signal from the observation signal, and the resulting sound has a decreased clarity. A dereverberation processing removes the superimposed reverberation signal from the observation signal to facilitate extraction of the characteristics of the original audio signal and recover the sound clarity. This technique can be applied to various audio signal processing systems as a constituent technology to improve the entire performance of the system. Audio signal processing systems to which the dereverberation processing can be applied as a constituent technology to improve the performance include:
(1) a speech recognition system that uses the reverberation signal removal as a preprocessing;
(2) a communication system, such as a teleconference system, that uses the reverberation signal removal to improve the sound clarity;
(3) a playing system that removes a reverberation signal in recorded speech to improve the clarity of the recorded sound;
(4) a hearing aid that removes a reverberation signal to improve the listenability;
(5) a machine-controlled interface and a human-machine interactive system that issue a command to a machine in response to a human voice;
(6) a post-production system that improves the sound quality of acoustic contents containing reverberation signals recorded during production; and
(7) an acoustic effecter that performs an acoustic control of music contents by removing or adding a reverberation signal.
In the following description, input observations signals in the time domain are denoted by xt(1), . . . , xt(q), . . . , xt(Q). The subscript “t” represents a discrete time index, and the superscript “q” (q=1, . . . , Q) represents a sound collecting means index (a microphone index, for example). In the following, a microphone with an index q is referred to as a microphone for a q-th channel. This holds true for the following description.
When the observation signal xt(q) is input, the estimating section 104 estimates a dereverberation filter using the observation signal xt(q) and the optimization function described above. More specifically, the estimating section 104 estimates the dereverberation filter by determining a parameters that maximizes the value of the optimization function. The removing section 106 convolves the observation signal with the estimated dereverberation filter to remove the reverberation signal from the observation signal and outputs the resulting signal. The signal is referred to as a target signal.
The dividing section 202 divides the observation signal into subband signals for the U frequency bands. The resulting subband signals are time-domain signals. When the observation signal is divided into the subband signals, down-sampling (thinning out of the samples) may be performed. In the following description, a subband signal is denoted by x′n,u(q). In this expression, n represents a sample index after down-sampling, and u represents a frequency band index (u=0, . . . , U−1). In the following, a subband signal x′n,u(q) in a u-th frequency band of the observation signal xt(q) collected by a microphone for a q-th channel will be described.
As described above, the removing section 206u (u=0, . . . , U−1) and the storage section 204u are provided for each of the U frequency bands. The storage section 204u stores the dereverberation filter. By using a previously determined room transfer function from a sound source to each microphone, a coefficient of the dereverberation filter is previously determined on the basis of the least square error criterion so that the input/output function of the entire system, which is obtained by applying the room transfer function, the subband division processing by the dividing section 202, the dereverberation processing by the removing section 206u and the integration processing by the integrating section 208 in order, may be a unit impulse function as far as possible.
The removing section 206u removes the reverberation signal from the subband signal by convolving the subband signal x′n,u(q) with the dereverberation filter. The subband signal for each frequency band from which the reverberation signal is removed is referred to as a frequency-specific target signal s˜n,u. Then, the integrating section 208 integrates the frequency-specific target signals sn,u˜ (u=0, . . . , U−1) to determine a target signal st˜.
Details of the dereverberation apparatuses 100 and 200 are described in Non-Patent literatures 1, 2 and 3.
In order to optimally use time-varying characteristics of an audio signal, the dereverberation apparatus 100 according to the related art 1 described above has to calculate an extremely large covariance matrix to achieve the calculation to maximize the value of the optimization function. Thus, the maximization of the value of the optimization function requires an enormous amount of calculation time. The reason why the covariance matrix has such a large size will be described below. A covariance matrix H(r) for the observation signal handled in the related art 1 is expressed by the following formula (1).
In the following description, the covariance matrix H(r) is a covariance matrix for the observation signal handled in the related art 1. Assuming that two microphones collect one audio signal, Xt−1=[x−t−1(1), . . . , x−1−K(1), x−t−1(2), . . . , x−t−K(2)], where x−t(1) is a column vector composed of short-time frames of xt(1) having a length of N (x−t(1)=[xt+1(1), . . . , xt+N−1(1)]T), and xt(1) and xt(2) are observation signals collected by microphones for the first channel and the second channel, respectively. T represents transposition of a matrix or a vector. K represents the length of a prediction filter (estimated dereverberation filter). rt represents a covariance matrix E{s−ts−tT} for a column vector s−t=[st, st+1, st+N−1]T composed of short time frames of the audio signal (rt=E{s−ts−tT}), where E{·} represents an expected value function. In general, the covariance matrix rt is not known, and therefore, an estimated value determined by the estimating section 104 on the basis of the sound source model stored in the sound source model storage section 108 is used.
In general, at least theoretically, the length of K of the prediction filter has to be equal to the length of the room impulse response. Therefore, the size of the covariance matrix H(r) is extremely large. However, if it is assumed that the audio signal is a stationary signal, the covariance matrix approximates to a correlation matrix, and therefore, a fast calculation method, such as the fast Fourier transform, can be used. However, if this assumption is applied to a time-varying signal, such as a voice signal, the calculation precision of the dereverberation disadvantageously decreases. As described above, the dereverberation apparatus 100 requires an enormous amount of calculation time to achieve dereverberation with high precision and cannot achieve the dereverberation in a shorter time without deteriorating the precision of the dereverberation in the case where the audio signal is a time-varying signal.
The dereverberation apparatus 200 according to the related art 2 described above has to previously estimate the dereverberation filter (an inverse filter of the room transfer function) and previously determine the room transfer function. In addition, the dereverberation using the inverse filter of the room transfer function is highly sensitive to an error of the room transfer function. If the room transfer function has a certain level of error, the dereverberation processing increases the distortion of the audio signal. In addition, the room transfer function is sensitive to a change of the position of the sound source or the room temperature. Thus, if the position of the sound source or the room temperature cannot be precisely determined in advance, the room transfer function cannot be precisely determined. As described above, the dereverberation apparatus 200 has to previously prepare the precise room transfer function, and a room transfer function determined under a certain condition can be applied to dereverberation only under extremely limited conditions.
Thus, the present invention performs dereverberation as described below. A storage section stores a sound source model that represents an audio signal as a probability density function. An observation signal obtained by collecting an audio signal is converted into frequency-specific observation signals associated with a plurality of frequency bands. Then, on the basis of the sound source model and a reverberation model that represents a relationship for each frequency band among the audio signal, the observation signal and a dereverberation filter, a dereverberation filter for each frequency band is estimated using the corresponding frequency-specific observation signal. Each dereverberation filter is applied to the corresponding frequency-specific observation signal to determine a frequency-specific target signal for the frequency band, and then, the frequency-specific target signals are integrated.
In the following, best modes for carrying out the present invention will be described. Components having the same functions or steps of performing the same processings are denoted by the same reference numerals, and redundant descriptions thereof will be omitted.
The dividing section 302 divides the observation signal into individual frequency bands and down-samples the observation signals to output frequency-specific observation signals. The dividing section 302 according to the embodiment 1 divides the observation signal on a frequency band basis by applying a short-time analysis window to the observation signal by temporally shifting the short-time analysis window and converting the observation signal into a frequency-domain signal.
The sound source model storage section 304 stores a sound source model that represents a characteristic of a frequency-specific observation signal for each frequency band.
The estimating section 306u is provided for each frequency band and estimates a dereverberation filter from the frequency-specific observation signal on the basis of an optimization function for the observation signal defined in association with the sound source model.
The removing section 308u is also provided for each frequency band and determines a frequency-specific target signal for each frequency band by using the frequency-specific observation signal and the dereverberation filter. The removing section 308u according to the embodiment 1 determines the frequency-specific target signal by convolving the frequency-specific observation signal with the dereverberation filter.
The integrating section 310 integrates the frequency-specific target signals to output a target signal described later. The integrating section 310 according to the embodiment 1 outputs the target signal described later by integrating the frequency-specific target signals and thereafter by converting it into a single time-domain signal for the entire frequency band.
First, a relationship between an audio signal st and an observation signal xt(q) will be described. In the following description, it is assumed that room transfer functions from the sound source to the microphones have no common zero, and the microphone closest to the sound source is denoted by q=1 (referred to as a microphone for a first channel). The relationship between the audio signal and the observation signal can be expressed by the formula (11) below. For more details, see M. Miyoshi, “Estimating AR parameter—sets for linear—recurrent signals in convolutive mixtures,” Proc. ICA-2003, pp. 585-589, 2003.
In this formula, h0(1) represents the first tap value of a room impulse response from the sound source to the microphone q=1, cτ(q) represents a prediction coefficient of the dereverberation filter estimated by the estimating section 306u, τ represents a discrete time index, and K represents a prediction filter length (size of the dereverberation filter estimated in the related art 1) as described earlier.
If the gain of the audio signal is ignored, the second term h0(1)st of the right side represents the audio signal st multiplied by a constant and thus can be regarded as the audio signal st to be estimated. Therefore, the formula (11) can be rewritten as the following formula (12).
According to the formula (12), the current observation signal xt(q) is predicted from a time series xt−τ(q) of previous observation signals, and the audio signal st is regarded as a prediction residual signal. Although the formula (12) is based on the assumption that the microphone for the first channel (q=1) is the microphone closest to the sound source, the relationship between the observation signal and the audio signal can be expressed by the same formula (12) even when the assumption does not hold. That is, if an adequate delay is introduced to the observation signals of the microphones other than the microphone (q=1) for the first channel, the microphone (q=1) for the first channel can be virtually regarded as the first microphone that receives the sound from the sound source and thus can be handled as the microphone closest to the sound source. Thus, for example, if it is assumed that the delay time introduced to a microphone q is d(q) taps, it can be considered that a fixed value 0 is substituted into the first d(q) taps of the prediction coefficients {c1(g), c2(q), . . . , cK(q)} for the microphones other than the microphone q=1, so that the relationship between the observation signal and the audio signal can be expressed by the formula (12).
When the observation signals xt(q) are input to the dividing section 302, the dividing section 302 divides the relevant observation signal into individual frequency bands and down-samples the observation signals to output frequency-specific observation signals (step S2). The dividing section 302 according to the embodiment 1 divides the observation signal on a frequency band basis by applying a short-time analysis window to the observation signal by temporally shifting the short-time analysis window and converting the observation signal into a frequency-domain signal. For example, the dividing section 302 performs a short-time Fourier transform. In the following specific description, it is assumed that the dividing section 302 performs a short-time Fourier transform.
Next, the formula (12) described above is generalized into the following formula (12′).
In this formula, d represents a constant to introduce a delay to a previous observation signal used to predict the current observation signal. When d=1, the formula (12′) is the same as the formula (12). When d>1, the formula (12′) cannot strictly express the relationship between the observation signal and the audio signal. The previous signal series of the right side of the formula (12′) does not include signals derived from the audio signals for the previous d taps from the current time t, and therefore, reverberation signals derived from the audio signals in the time period contained in the current observation signal cannot be expressed by a linear combination of previous observation signals. The “reverberation signals derived from the audio signals in the time period contained in the current observation signal” correspond to an initial reflected sound for the first d taps of the room impulse response. Therefore, the formula (12′) is based on the assumption that the residual signal contains the initial reflected sound in addition to the audio signal. In order to make this clear, the residual signal is denoted by st−. In this specification, a symbol Aα˜ represents a combination of a symbol A and a symbol ˜ directly above the symbol A.
<Convolution Operation of Frequency Signal>
Next, a method of performing on a frequency-domain signal an operation corresponding to convolution in the time domain included in the first term of the right side of the formula (12′) will be described. First, a signal resulting from convolving an audio signal xt with a dereverberation filter ct having a filter length of K in the time domain is denoted by yt. A signal in a short time frame extracted from the signal yt beginning at a time t0 by a time window of a window function is expressed by the following formula (13) in a z transform domain.
WN(y(z)zt0)=WN(c(z)·x(z)zt0) (13)
In this formula, y(z)=c(z)·x(z), the symbol · represents convolution, and WN( ) represents a function corresponding to a window function having a length of N in the time domain. WN(c(z)) means extracting (−N+1)-th order to 0-th order terms from c(z), changing the respective coefficients in proportion to the shape of the window, and removing the terms outside the window. zt0 represents a time shift operator to shift the short time frame beginning at the time t0 into the window function.
Extraction of a frame having a length of M from the filter coefficient ct at the time t is represented as ct,M(z)=WMR(c(z)zt), where WMR( ) represents a short time analysis window (rectangular window) having a length of M. Then, obviously, c(z)=ΣτcτM,M(z)z−τM. The formula (13) described above can be transformed as follows.
ΣτcτM,M(z)z−τM in the formula (14), corresponds to c(z) (see the formula (13)), and xt0−m+1−τM,M+N−1(z) in the formula (16) corresponds to x(z) (see the formula (13)).
In addition, KR=<K/M>, where <K/M> represents the smallest integer not less than K/M. KR is a filter length (number of taps) of the dereverberation filter estimated by the estimating section 306u. The formula (16) is derived from the formula (15) by removing the terms outside the window from the terms included in the argument of the window function of the formula (15).
The term cτM,M(z)xt0−M+1−τM,M+N−1(z) in the formula (16) is a product in a z domain of a frame having a length of M extracted from the τM-th tap of the filter coefficient cτ in the time domain and a frame having a length of M+N−1 extracted from the observation signal xt in the time domain at a time t0−M+1−τM. Since multiplication in the z domain is equivalent to a convolution operation, the term represents a convolution operation in the time domain of the observation signal xt in the frame and the filter coefficient ct in the frame. In addition, the frame length of cτM,M(z) is M, and the frame length of xt0−M+1−τM,M+N−1(z) is M+N−1. Thus, when the number of points (number of frequency bands) U of the short time Fourier transform is equal to or more than 2M+N−2 (U≧2M+N−2), the convolution in the time domain is strictly represented by the product in the short time Fourier transform domain. Then, an approximation used in many audio signal processings is used. That is, the convolution of the signal included in the short time analysis window with the filter approximates to the product of the signal and the filter in the short time Fourier transform domain, if the length of M of the filter is adequately shorter than the length of N of the short time analysis window. Using this approximation, the formula (16) can be transformed into the following formula (17) on a unit circle in the z domain (which corresponds to the short time Fourier transform domain).
In the short-time Fourier transform representation, the formula (17) can be transformed into the following formula (18).
In this formula, n and τ represent short time frame indices, Yn, Cn and Xn represent vectors whose elements are values of signals for each frequency band extracted with a time window from time-domain signals corresponding to y(z), c(z) and x(z) and subjected to the short time Fourier transform, respectively, and diag(x) represents a diagonal matrix having the components of the vector X as the diagonal components. In this specification, the short time Fourier transform is expressed as follows. In the following formulas, tτ represents a discrete time index of the first sample in a frame τ.
According to the formula (18), the convolution operation in the time domain can be performed as a convolution operation of the frequency-specific observation signal for each frequency band. In the formula (17), M is a value corresponding to frame shifting, and therefore, the frame shift M has to be adequately small compared with the window length of N of the window function WN( ) in this approximate calculation.
This is the end of the supplementary explanation of <Convolution Operation of Frequency Signal>.
Performing the short-time Fourier transform on the both sides of the formula (12′) by using the formula (16) results in the following formula (22).
The formula (22) is equivalent to the formula (22a).
In this formula, D corresponds to the delay d in the formula (12′) and represents the delay introduced to previous observation signals in the frequency domain in the form of the number of frames. Frequency signals in adjacent frames overlap with each other in the time domain. Therefore, part of the audio signal included in the observation signal (the left side Xn(1) of the formula (22)) in the frame n is also included in the observation signal corresponding to the immediately-previous frame. Therefore, if Xn(1) is predicted using the previous observation signal including the immediately-previous frame according to the formula (22), part of the audio signal can also be predicted. Since the predictable part of the observation signal is not included in the residual signal, this means that the part of the audio signal is removed by the dereverberation. To avoid this, according to the present invention using the frequency signal, the observation signal in the immediately-previous frame is not used to predict the current observation signal, but only a previous observation signal spaced away by a certain delay D or more is used as shown in the formula (22). When d=DM, the formula (12′) agrees with the formula (22). In the following, this embodiment will be described using the formula (22) as a formula that represents a relationship between the observation signal and the audio signal. In the formula (22), Xn(q) corresponds to the short time Fourier transform for a time-domain signal collected by a microphone for a q-th channel. The short time Fourier transform follows the formulas (19) and (20). Here, n represents the frame identification number. The frequency-specific observation signal in a frequency band u (u=0, . . . , U−1) is represented by Xn,u(q). In order to determine the frequency-specific observation signal Xn,u(q), the dividing section 302 applies the short time analysis window by temporally shifting the window in steps of M samples and performs conversion into the frequency domain. In this way, the frequency-specific observation signal Xn,u(q) for each frequency band is obtained.
The estimating section 306u described in detail later estimates the dereverberation filter for removing a reverberation from the frequency-specific observation signal Xn,u(q). Once the prediction coefficient Cτ(q), which is a coefficient of the dereverberation filter, is obtained, the target signal (the audio signal containing the initial reflected sound) S˜n can be estimated as follows.
The formula (23) can be transformed into the following formula (24) to express the element for each frequency band of the target signal Sn˜=[Sn,0˜, Sn,1˜, . . . , Sn, U−1˜].
The formula (24) can be transformed into the formula (29) using the formulas (25) to (28).
Cu=[Cu(1),Cu(2) . . . Cu(Q)] (25)
Cu(q)=[CD,u(q),CD+1,u(q) . . . CK
Bn−D,u=[Bn−D,u(1),Bn−D,u(2) . . . Bn−D,u(Q)] (27)
Bn−D,u(q)=[Xn−D,u(q),Xn−D−1,u(q) . . . Xn−K,u(q)] (28)
{tilde over (S)}n,u=Xn,u(1)−Bn−D,uCuT (29)
T represents transposition of a vector or a matrix. In this embodiment, Cu represents the dereverberation filter for the u-th frequency band. The term Bn−D, uCuT of the formula (29) corresponds to the signals obtained by convolution of Bn,u(q) with Cu(q) for each channel added to each other for all the values of the index q. The estimating section 306u estimates the dereverberation filter Cu, and the removing section 308u removes the reverberation signal according to the formula (29).
Assuming that 0D−1 represents a (D−1)-dimensional row vector all the elements of which are 0, the dereverberation filter Wu can also be defined as follows.
Wu=[1,0D−1,Cu(1),0,0D−1,Cu(2), . . . ,0,0D−1,Cu(Q)]
In this case, the removing section 308u removes the reverberation signal according to the following formulas.
{tilde over (S)}n,u=ζn,uWuT
ζn,u[ζn,u(1)ζn,u(2) . . . ζn,u(Q)]
ζn,u(q)=[Xn,u(q)Xn−1,u(q) . . . Xn−K
As described above, if the estimating section 306u can estimate the dereverberation filter Cu or Wu, the removing section 308u can remove the reverberation signal according to the formula (29) or (30). Next, the sound source model will be described before describing the estimation of the dereverberation filter.
The sound source model storage section 304 stores a sound source model that represents a characteristic of a frequency-specific observation signal for each frequency band.
The sound source model according to this embodiment represents the tendency of the possible values of the audio signal in the form of a probability distribution. The optimization function is defined on the basis of the probability distribution. A useful example of the sound source model is a time-varying normal distribution, and the probability density function of the frequency-specific signal Sn˜ to be determined is defined as follows.
p(Sn˜)=N(Sn˜;0,Ψn) (31)
ΨnεΩΨ (32)
N(sn˜; 0, Ψn) represents a multidimensional complex normal distribution with an average being 0 and a covariance matrix of the sound source model being Ψn=E(Sn˜(Sb˜)*T), and Ψn assumes a different or common value for each short time frame n. In the following description, Ψn is referred to as a model covariance matrix, and it is assumed that the model covariance matrix Ψn is a diagonal matrix that has a different value for each short time frame n. The symbol * represents complex conjugate. ΩΨ represents a set of all the possible values of Ψn (in other words, a parametric space of Ψn). Assuming that ψn,u2=E(Sn,u˜Sn,u˜*T) represents a u-th diagonal element of Ψn, the probability density function is defined as follows independently for each frequency band, since Ψn is a diagonal matrix.
p(Sn,u˜)=N(Sn,u˜;0,ψn,u2) (33)
The estimating section 306u provided for each frequency band estimates the dereverberation filter from the frequency-specific observation signal on the basis of the optimization function of the observation signal defined in association with the sound source model (step S4). Next, the estimation of the dereverberation filter will be described in detail.
As shown by the formula (25), the dereverberation filter Cu is represented by a vector composed of the prediction coefficients Cu(q) of the observation signal for all the microphones. The prediction coefficients Cu(q) are prediction coefficients in the frequency domain. ψu2 represents a time series of u-th diagonal elements of the model covariance matrix, and ψu2={ψn,u2}. In addition, θu={Cu, ψu2} represents a set of estimation parameters. In addition, a set of all the estimation parameters for all the frequency bands is represented by θ={θ0, θ1, . . . , θU−1}. A log likelihood function Lu(θu) as the optimization function for each frequency band and a log likelihood function L(θ) as the optimization function for all the frequency bands are defined as follows.
On the basis of the formulas (29) and (33), the formula (34) can be transformed into the following formula (36).
By estimating a parameter that maximizes the left side of the formula (35), the prediction coefficients Cu(q) of the dereverberation filters can be determined. Maximization of the formula (35) can be achieved by the optimization algorithm described below.
In the above description of the algorism, an operation to update the value of a parameter A to B is expressed as “A→B”. Furthermore, the symbol “+” represents a Moore-Penrose pseudo inverse matrix. A covariance matrix H′(φn,u2) for the observation signal that has to be calculated in the algorism described above is expressed by the following formula (40).
On the basis of the optimization algorism, the dereverberation filter is constructed from Cu finally obtained. The removing section 308u determines the frequency-specific target signals Sn,u˜ by removing the reverberation signal from the frequency-specific observation signals Xn,u(q) by convolving the frequency-specific observation signals Xn,u(q) with the dereverberation filter Cu or Wu (step S12).
Then, the integrating section 310 integrates the frequency-specific target signals Sn,u˜ for the frequency bands, converts the signals into the time domain, and outputs the target signal st˜ (step S14). More specifically, a common method of converting a time series of frames into a time-domain signal by the short time Fourier transform can be used. That is, a short time inverse Fourier transform is applied to Sn˜=[Sn,0˜, Sn,1˜, . . . , Sn,U−1˜] for each frame n to determine a time-domain signal for each frame, and the signals for the frames are overlap-added to determine the target signal st˜. The short time inverse Fourier transform for a frame t is expressed by the formula (40a). The overlap add operation is performed by applying some time window to the time-domain signals for the frames obtained by the application of the short time inverse Fourier transform and adding the signals with the same frame shift width M as that is used by the dividing section. A specific calculation formula is expressed by the formula (40b). In this formula, wt1 represents a time window having a length of N, and floor(a) represents the maximum integer equal to or less than a.
Next, advantages of the dereverberation apparatus 300 according to the embodiment 1 will be described. The dereverberation processing from the observation signals xt(q) (q=1, . . . , Q) by the dereverberation apparatus 300 can be performed as an approximate calculation for each frequency band. Since conversion into the frequency-domain signal is performed by applying the short time analysis window having a length of N while temporally shifting in steps of M samples, the length of the dereverberation filter for each frequency band can be reduced. Thus, the size of the covariance matrix required to estimate the dereverberation filter can be reduced. The reason for this is as follows. That is, in general, the size of the dereverberation filter is equal to the size of the covariance matrix used to determine the dereverberation filter. And the conversion into the frequency domain is performed by extracting N samples by temporally shifting in steps of M samples (by applying a short time analysis window having a length of N), so that the size of the dereverberation filter to be convolved decreases compared with the related art 1. Thus, the size of the covariance matrix also decreases. This can be apparently seen from the formulas (1) and (40). Comparing the size of the covariance matrix H(r) expressed by the formula (1) and the size of the covariance matrix H′(ψn,u2) expressed by the formula (40), the size of the covariance matrix H(r) according to the related art 1 depends on the prediction filter length (the length of the room impulse response) K, whereas the covariance matrix H′(ψn,u2) used in this embodiment 1 depends on KR (that is, <K/M>). This is because the number of elements (number of taps) of Bn−D,u(q) forming the covariance matrix H′(ψn,u2) is KR−D, as shown by the formula (35). It will thus be understood that the size of the covariance matrix used in this embodiment 1 can be reduced compared with the related art 1. The estimation of the dereverberation filter involves not only calculation of the covariance matrix but also calculation of the inverse matrix thereof, and the calculation cost of these calculations accounts for most of the calculation cost of the entire dereverberation processing. The calculation cost of these calculations can be reduced by reducing the size of the covariance matrix. Thus, according to this embodiment, the calculation cost of the entire dereverberation processing can be significantly reduced.
In the embodiment 1, the observation signal is convolved with the dereverberation filter estimated for each frequency band to achieve dereverberation. However, as is known, dereverberation carried out by estimating the reverberation signal and determining a difference signal that is the difference between the energy of the observation signal and the energy of the reverberation signal is less susceptible to the estimation error of the dereverberation filter than the dereverberation method according to the embodiment 1. For example, such a method is described in K. Kinoshita, T. Nakatani, and M. Miyoshi, “Spectral subtraction steered by multi-step forward linear prediction for single channel speech dereverberation,” Proc. ICASSP-2006, vol. I, pp. 817-820, May, 2006. An embodiment 2 is based on this concept.
A dereverberation apparatus 400 according to the embodiment 2 will be described.
The dividing section 302 divides the observation signal into frequency bands (step S2), and the estimating section 306u estimates the dereverberation filter for the frequency band (step S4). Then, the reverberation signal generating means 408u generates a frequency-specific reverberation signal Rn,u by using a dereverberation filter and a frequency-specific observation signal Xn,u(q) (step S22). More specifically, the frequency-specific reverberation signal Rn,u is determined according to the following formula (41).
The reverberation signal frequency specific power determining means 410u determines a frequency-specific power |Rn,u|2 of the frequency-specific reverberation signal Rn,u (step S24). Besides, the observation signal frequency specific power determining means 412u determines a frequency-specific power |X(1)n,u|2 of the frequency-specific observation signal collected by the microphone for the first channel, for example (step S26). Then, the subtracting means 414u determines a difference signal |X(1)n,u|2−Rn,u|2 by calculating the difference between the frequency-specific power of the frequency-specific reverberation signal and the frequency-specific power of the frequency-specific observation signal and determines a frequency-specific target signal on the basis of the difference signal and the frequency-specific observation signal X(1)n,u used for calculation of the difference signal (step S28). For example, the frequency-specific target signals Sn,u˜ are determined according to the following formulas.
In the formula, max {A, B} represents a function that chooses a larger one of A and B, and G0 represents a flooring coefficient that determines the lower limit of suppression of the signal energy in power subtraction and is greater than 0 (G0>0). Then, the integrating section 416 converts the frequency-specific target signals into the time domain to determine the target signal st˜ (step S30).
Even if the dereverberation filter has an estimation error, the dereverberation apparatus 400 can achieve dereverberation with less sound quality deterioration than the dereverberation apparatus 300 according to the embodiment 1.
According to the related art, the dereverberation processing can be achieved only in the time domain. However, the dereverberation apparatuses 300 and 400 according to the embodiments 1 and 2 can operate in the frequency domain and thus can be combined with other various useful sound enhancing techniques that operate in the frequency domain, such as the blind source separation and Wiener filter.
A signal resulting from the subband division is referred to as a subband signal, the number of subbands is represented by V, and a subband index is represented by v (v=0, . . . , V−1). An estimating section 506v estimates a dereverberation filter for each subband signal, and a removing section 508v removes a reverberation from each subband signal. An integrating section 510 integrates the resulting signals to determine a target signal s1˜. The subband division processing by the dividing section 502 and the integration processing by the integrating section 510 are described in M. R. Portnoff, “Implementation of the digital phase vocoder using the fast Fourier transform”, IEEE Trans. ASSP, vol. 24, No. 3, pp. 243-248, 1976 (referred to as Non-patent literature A, hereinafter), and J. P, Reilly, M. Wilbur, M. Seibert, and N. Ahmadvand, “The complex subband decomposition and its application to the decimation of large adaptive filtering problems”, IEEE Trans. Signal Processing, vol. 50, no. 11, pp. 2730-2743, November 2002, for example. In the following description, the technique according to Non-patent literature A will be used. The formula (50) described later in this specification is described in Non-patent literature A. The general flow of the processing is the same as shown in
First, a relationship between the audio signal and the observation signal will be described. The dividing section 502 divides the observation signal into V frequency bands (subbands) by performing subband division on the observation signal. According to the definition described in Non-patent literature A, the division can be expressed by the following formula (50).
In this formula, t represents a sample index of a signal obtained by applying frequency shift and a low-pass filter to the observation signal in each subband (t is the same as the discrete time of the observation signal yet to be subjected to the subband processing), and a t-th sample in a v-th subband (v=0, . . . , V−1) of the observation signal collected by a microphone for the q-th channel is denoted by xt,v(q). And e−j2πvτ/V represents a frequency shift operator corresponding to the v-th subband, and ht represents a coefficient of a low-pass filter having a length of 2Nh+1. Applying the formula (50) to the both sides of the formula (12′) results in the following formula.
The term st,v˜ in the right side of the formula (51) represents a signal derived from the audio signal including an initial reflected sound by application of the subband division processing. In this embodiment, the signal st,v˜ is handled as a target signal to be determined. The dividing section 502 performs down-sampling of each subband signal in addition to the subband division. For example, b represents a sample index of a signal derived from the time series of the observation signal xt,v(1) collected by the microphone for the first channel and the audio signal st,v by down-sampling at intervals of γ samples (thinning out of samples), and the subband signal obtained as a result of the down-sampling is denoted by xb,vr(q) or sb,v˜t. tb represents a sample index of a signal yet to be subjected to the down-sampling that corresponds to the sample index b of the signal subjected to the down-sampling. Then, the following formula (52) results.
On the other hand, hτ relates to the low-pass filter, and thus, the signal yet to be subjected to the down-sampling can be precisely recovered by up-sampling in the case where the down-sampling is performed at a sampling frequency equal to or higher than twice the cut-off frequency of the low-pass filter. The up-sampling is performed in the following procedure, for example.
In step 2, a finite length impulse response filter is commonly used. This means that a signal recovered by up-sampling can be expressed by linear coupling of down-sampled signals.
Using this relationship, the expression xtb−τ,v(q) in the right side of the formula (52) can be transformed into the following formula (53).
βτ,k represents a coefficient depending on the coefficient of the low-pass filter used for up-sampling, k0 represents a delay of filtering by the low-pass filter used for up-sampling, and k0+k1+1 corresponds to a filter length of the low-pass filter used for up-sampling. Substituting the formula (53) into the formula (52) and rearranging the resulting formula results in the following formula (54).
In this formula, αk,v(q) represents a coefficient of the term x′b−k,v(q) of the formula resulting from substituting the formula (53) into the formula (52) and rearranging the resulting formula. d′ represents a delay of filtering for αk,v(q), and K′ represents a filter length of filtering for αk,v(q). On the basis of the formulas (52) and (53) and the sampling interval γ, relationships d′≈d/γ−k0 and K′≈K/γ+k1 can be defined. When d′≧1, the formula (54) represents a relationship that a residual signal of the prediction of the current observation signal from a previous observation signal using αk,v(q) as a prediction coefficient (a coefficient of a dereverberation filter estimated by the estimating section 506v) for each subband signal is the audio signal including the initial reflected sound. In the following description, the formula (54) is handled as a formula that represents a relationship between the observation signal and the audio signal for each subband signal.
Formulas (55) to (58) are defined as follows.
αv=[αv(1) . . . αv(q) . . . αv(Q)] (55)
αv(q)=[αd′,v(q),αd′+1,v(q) . . . αK′,v(q)] (56)
Fb−d′,v[Fb−d′,v(1) . . . Fb−d′,v(q) . . . Fb−d′,v(Q)] (57)
Fd−d′,v(q)=[xb−d′,v′(q),xb−d′−1,v′(q), . . . xb−K′,v′(q)] (58)
In this case, the formula (54) can be transformed into the following formula (59).
{tilde over (s)}b,v′=xb,v′(1)−Fb−d′,v·αVT (59)
In this embodiment 3, αv represents a dereverberation filter for a v-th subband signal, and the removing section 508v removes a reverberation signal according to the formula (59). Assuming that 0d′−1 represents a (d′−1)-dimensional row vector all the elements of which are 0, a dereverberation filter wv can also be expressed by the following formula (60).
wv=[10d′−1αv(1) . . . 00d′−1αv(q) . . . 00d′−1αv(Q)] (60)
In this case, the removing section 508v removes the reverberation signal according to the following formula (61).
{tilde over (s)}b,v′=ξb,vwvT
ξb,v=[ξb,v(1) . . . ξb,v(q) . . . ξb,v(Q)]
ξb,v(q)[xb,v(q)xb−1,v(q) . . . xb−K′,v(q)] (61)
Next, a method of estimating a dereverberation filter performed by the estimating section 506v will be described. The sound source model stored in a sound source model storage section 504 in this embodiment represents the possible tendency of the audio signal in the form of a probability distribution as in the embodiments 1 and 2, and the optimization function is based on the probability distribution. A useful example of the sound source model is a time-varying normal distribution. In the following description, as the simplest sound source model, a model in which signals in each subband are independent of the signals in the other subbands is introduced. In addition, it is assumed that each subband signal is a time-varying white normal process that has a flat frequency spectrum and temporally varies only in signal energy.
As with the formulas (31) and (32) described earlier, a parametric space is defined and modified as follows. Note that a probability density function of a signal sb˜′=[sb,0˜′, . . . , sb,V−1˜′]1 is defined as follows.
p(sb˜′)=N(sb˜′;0,Ψb′) (31′)
Ψb′εΩΨ′ (32′)
In this formula, N(sb˜′, 0, Ψb′) represents a multidimensional complex normal distribution with an average being 0 and a covariance matrix of the sound source model being Ψb′=E(sb˜′(sb˜′)*T), and Ψb′ assumes a different or common value for each sample b. In the following description, Ψb′ is referred to as a model covariance matrix, and it is assumed that the model covariance matrix Ψb′ is a diagonal matrix that has a different value for each sample. ΩΨ′ represents a set of all the possible values of Ψb′ (in other words, a parametric space of Ψb′). ψb,v′2=E(sb,v˜′(sb,v˜′)*) represents a v-th diagonal element of Ψb′. Since Ψb′ is a diagonal matrix, the probability density function can be defined as p(sb,v˜′)=N(sb,v˜′; 0, ψb,v′2) independently for each subband. ψv′2 represents a time series of v-th diagonal elements of the model covariance matrix, and ψv′2={ψb,v′2}. In addition, θv={αv, ψv′2} represents a set of estimation parameters for the subband v. In addition, a set of all the estimation parameters for all the subbands is represented by θ′={θ0, θ1, . . . , θV−1}. A log likelihood function Lv(θv) as the optimization function for each subband and a log likelihood function L′(θ′) as the optimization function for all the subbands are defined as follows.
The formula (63) can be transformed into the following formula (64) on the basis of the formulas (59) and (31′).
By estimating a parameter that maximizes the formula (64), an estimated value of the coefficient of the dereverberation filter can be determined. Maximization of the formula (64) can be achieved by the optimization algorithm described below.
The estimating section 506v constructs a dereverberation filter on the basis of αv finally obtained, and the removing section 508v removes the reverberation signal using the dereverberation filter according to the formulas (59) or (61) to determine a frequency-specific target signal sb,v˜′. Then, the integrating section 510 integrates the frequency-specific target signals sb,v˜′ while performing up-sampling to determine the target signal st˜.
As described above, in the subband processing, since the observation signal is divided into time-domain signal for frequency bands, and then the time-domain signals are down-sampled at intervals of γ samples, the sampling frequency of the time-domain signals for each frequency band can be reduced by 1/γ.
According to this embodiment, the dereverberation processing is separately performed for the time-domain signal for each frequency band, and the resulting signals are integrated to achieve the dereverberation for all the frequency bands. Comparing the case where down-sampling of the time-domain signal is performed and the case where the down-sampling is not performed, the size of the covariance matrix used for estimating the dereverberation filter can be reduced in the case where the down-sampling is performed. This is because the size of the covariance matrix depends on the filter length of the dereverberation filter, the filter length K of the dereverberation filter depends on the number of taps of the room impulse response, and the number of taps of the impulse response decreases as the sampling frequency decreases if the temporal length of the impulse response is physically fixed. In other words, since down-sampling in steps of γ samples is performed, the filter length of the dereverberation filter is reduced to K′(=K/γ+k1), which is shorter than the filter length K of the dereverberation filter according to the related art.
Since the size of the covariance matrix used to estimate the dereverberation filter decreases as the filter length of the dereverberation filter decreases as described above, the calculation cost of the estimation of the dereverberation filter is reduced.
Furthermore, in the case where the down-sampling is performed at a sampling frequency equal to or higher than twice the cut-off frequency of the low-pass filter, the subband signal determined by the subband division processing performed with the down-sampling can be precisely reconstructed by up-sampling. Therefore, the target signal is not deteriorated by the up-sampling performed when the integrating section 510 performs the integration processing.
The reverberation signal generating means 608v determines a frequency-specific reverberation signal rb,v using a dereverberation filter αv and an observation signal xt,v(q). More specifically, the frequency-specific reverberation signal rb,v is determined according to the following formula (70).
rb,v=Fb−d′,v·αvT (70)
Then, the reverberation signal frequency specific power determining means 610v determines a frequency-specific power |rb,v|2 of the frequency-specific reverberation signal. Besides, the observation signal frequency specific power determining means 612v determines a frequency-specific power |xb,v(1)|2 of the observation signal xb,v(1) collected by the microphone for the first channel. Then, the subtracting means 614v determines a difference signal |xb,v(1)|2−|rb,v|2 by calculating the difference between the frequency-specific power of the frequency-specific reverberation signal and the frequency-specific power of the frequency-specific observation signal and determines a frequency-specific target signal on the basis of the difference signal and the frequency-specific observation signal xb,v(1) used for calculation of the difference signal (steps 28). For example, the frequency-specific target signals sb,v˜′ are determined according to the following formulas.
In the formula, max {A, B} represents a function that chooses a larger one of A and B, and G0 represents a flooring coefficient that determines the lower limit of suppression of the signal energy in power subtraction and is greater than 0 (G0>0).
Then, the integrating section 510 integrates the frequency-specific target signals sb,v′˜ (v=0, . . . , V−1) and outputs the resulting target signal st˜.
The dereverberation apparatus 600 thus configured is less susceptible to the estimation error of the dereverberation filter in dereverberation signal than the dereverberation apparatus 500.
The dereverberation apparatuses 300 to 600 described above with regard to the embodiments 1 to 4 are configured for a batch processing in which all the signals are obtained in advance. However, as described with regard to an embodiment 5, reverberation signals may be sequentially removed from observation signals collected by a microphone. For example, a dereverberation filter estimated by an estimating section is configured to be (sequentially) estimated and updated at predetermined time intervals. When the update is performed, the optimization algorism described above is applied to part or all of the observation signals obtained before that point in time to estimate a dereverberation filter. In combination with the estimation, the estimating section 306u of the dereverberation apparatus 300 (see
[Specific Example of Sound Source Model]
In the following, specific examples of the sound source model according to the embodiments 1 to 5 will be described with reference to examples of sets ΩΨ and ΩΨ′. The embodiments 1, 2 and 5 will be essentially described. Descriptions of the embodiments 3 and 4 will be omitted, because specific examples thereof can be constructed by replacing the symbols in the following description of the embodiments 1, 2 and 5 as follows.
Then, the parameter to be estimated in the iteration algorism described above is the value of the index, rather than the covariance matrix. In the following, the state at the time n is denoted by in, the covariance matrix corresponding to the state in is denoted by Ψ(in), and the diagonal element of the covariance matrix Ψ(in) is denoted by ψu2(in). The state in of the sound source model at each time is not a value specific to each frequency band but a value specific to all the frequency bands. Therefore, the optimization function determined on the basis of the log likelihood function can be defined by the following formula (81) for all the frequency bands.
In this formula, the estimation parameter θ={C, I} is composed of a time series I={i1, i2 . . . } of states in and prediction coefficients C={C0, C1, . . . , CU−1} for the respective frequency bands. On the basis of the optimization function, the update formula (38) of the optimization algorism can be replaced with the following update formula (82) for all the frequency bands. The update formula (39) is not modified.
The replacement of the formula (38) with the formula (82) allows the estimating section 306u to estimate the dereverberation filter with higher precision.
The estimation parameter θ in the optimization function expressed by the formula (83) is the same as the estimation parameter defined by the finite state machine. The optimization function of the formula (83) can be readily maximized by simply replacing the update formula (38) in the optimization algorism described above with the following update formula.
The calculation to maximize the formula (84) can be efficiently achieved by a known dynamic program.
In the description of the embodiments 1 to 5, it is assumed that, room transfer functions for different microphones have no common zero point in the formula (12′) that expresses the relationship between the observation signal and the audio signal, and two or more microphones are required. However, it has experimentally confirmed that the dereverberation methods according to the embodiments 1 to 5 of the present invention can remove the reverberation with high quality even if these assumptions are not satisfied.
An experimental result that demonstrates that the effect of the dereverberation apparatus according to the embodiment 4 using a single microphone will be described. The subject sound is a sound signal composed of a voice sequence of five words produced by a woman. The observation signal is synthesized by convolution with a single-channel room impulse response measured in a reverberant room. The reverberation time (RT60) is 0.5 seconds.
Therefore, the present invention can be applied to a case where the number Q of microphones is one (Q=1) or a case where the room transfer functions for different microphones have a common zero point. Although it is assumed that the microphone closest to the sound source is known and is the microphone for the first channel in the related art 1, it is experimentally confirmed that the present invention does not need the assumption that the microphone closest to the sound source is known.
In the embodiments 1 to 5 described above, the processing of the dividing section involves the short-time Fourier transform and the subband division. As an alternative method of dividing into frequency bands, the wavelet transform or the discrete cosine transform may be used as far as the number of samples of the observation signal is reduced. Even if these transforms causes signals in different frequency bands to correlate with each other, the correlation can be ignored by approximation to achieve the same advantages.
Furthermore, as an alternative to calculating the formula (39) (in the case of estimating Cu) or the formula (67) (in the case of estimating αv) to optimize the dereverberation filter Cu or αv, a sequential estimation algorithm often used in the adaptive filter may be used. As such an optimization method, the least mean square (LMS) method, the recursive least squares (RLS) method, the steepest descent method, and the conjugate gradient method are known, for example. This method can substantially reduce the amount of calculation required for one repetition. As a result, at least one estimation can be repeated in real time with a reduced calculation cost. Thus, the real time processing can be achieved with the relative inexpensive digital signal processor (DSP). Although one repetition is not always sufficient to provide a precise dereverberation filter, the estimation precision can be gradually improved with time.
<Hardware Configuration>
The dereverberation apparatuses that operate under the control of a program according to the embodiments described above have a central processing unit (CPU), an input section, an output section, an auxiliary storage device, a random access memory (RAM), a read only memory (ROM) and a bus (these components are not shown).
The CPU performs various calculations according to various loaded programs. The auxiliary storage device is a hard disk drive, a magneto-optical (MO) disc, or a semiconductor memory, for example. The RAM is a static random access memory (SRAM) or a dynamic random access memory (DRAM), for example. The bus connects the CPU, the input section, the output section, the auxiliary storage device, the RAM and the ROM to each other in such a manner that these components can communicate with each other.
<Cooperation Between Hardware and Software>
The dereverberation apparatuses according to the present invention are implemented by loading a predetermined program to the hardware described above and making the CPU execute the program. In the following, a functional configuration of each apparatus thus implemented will be described.
The input section and the output section of the dereverberation apparatus are a communication device, such as a LAN card and a modem, that operates under the control of the CPU to which a predetermined program is loaded. The dividing section, the estimating section and the processing section are a calculating section implemented by loading a predetermined program to the CPU and executing the program by the CPU. The auxiliary storage device described above serves as the sound source model storage section.
[Experimental Result]
An experimental result that demonstrates the effect of the dereverberation apparatuses according to the embodiments will be described. In this experiment, the dereverberation apparatus 300 according to the embodiment 1 and the dereverberation apparatus 100 according to the related art are compared. The subject sounds are sound signals of two voice series of five words produced by a man and a woman. The observation signal is synthesized by convolution with a two-channel room impulse response measured in a reverberant room. The reverberation time (RT60) is 0.5 seconds. The dereverberation is performed for each voice series, and the dereverberation performance is evaluated in terms of cepstrum distortion (abbreviated as CD hereinafter) of the signal after dereverberation and real time factor (abbreviated as RTF hereinafter) of the dereverberation processing. CD is defined as follows.
In this formula, ck^ and ck are cepstrum coefficients of the sound signal to be evaluated and a clean sound signal, respectively, and D=12. With this evaluation measure, a signal distortion can be evaluated for both the energy time pattern and the spectral envelope. RTF is defined as (time required for dereverberation processing)/(time of observation signal). Any dereverberation method used in the experiment is implemented by the MATLAB programming language on a Linux computer. The sampling frequency is 8 kHz, and the length N of the short time analysis window is 256.
As can be seen from
According to the present invention, the observation signal is converted into a frequency-domain observation signal corresponding to one of a plurality of frequency bands, and a dereverberation filter corresponding to each frequency band is estimated using the frequency-specific observation signal corresponding to the frequency band. The order of the dereverberation filter corresponding to each frequency band is smaller than the order of the dereverberation filter in the case where the observation signal is used directly. Accordingly, the size of the covariance matrix decreases, so that the calculation cost involved in estimation of the dereverberation filter is reduced. In addition, since the dereverberation filter is estimated by using each frequency-specific observation signal, the room transfer function does not have to be known in advance.
Number | Date | Country | Kind |
---|---|---|---|
2008-052175 | Mar 2008 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2009/054231 | 2/27/2009 | WO | 00 | 9/2/2010 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2009/110578 | 9/11/2009 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5774562 | Furuya et al. | Jun 1998 | A |
20020059065 | Rajan | May 2002 | A1 |
Number | Date | Country |
---|---|---|
9 321860 | Dec 1997 | JP |
2004 274234 | Sep 2004 | JP |
2006 243676 | Sep 2006 | JP |
Entry |
---|
Tomohiro Nakatani, etc., “Importance of Energy and Spectral Features in Gaussian Source Model for Speech Dereverberation”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 21-24, 2007, p. 299-302. |
Tomohiro, Nakatani et al., “Study on Speech Dereverberation With Autocorrelation Codebook”, Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP-2007), vol. I, p. 193-196, (Apr. 2007). |
Tomohiro, Nakatani et al., “Importance of Energy and Spectral Features in Gaussian Source Model for Speech Dereverberation”, IEEE Workshop on Application of Signal Processing to Audio and Acoustics (WASPAA-2007), p. 299-302, (2007). |
Gaubitch, D. Nikolay et al., “Subband Method for Multichannel Least Squares Equalization of Room Transfer Functions”, Proceedings IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (WASPAA-2007), p. 14-17, (2007). |
Miyoshi, Masato: “Estimating AR Parameter-Sets for Linear-Recurrent Signals in Convolutive Mixtures”, 4th International Symposium on Independent Component Analysis and Blind Signal Separation, (ICA-2003), p. 585-589, (Apr. 2003). |
Kinoshita, Keisuke et al., “Spectral Subtraction Steered by Multi-Step Forward Linear Prediction for Single Channel Speech Dereverberation”, Proc., ICASSP-2006, vol. I, p. 817-820, (May 2006). |
Portnoff, R. Michael: “Implementation of the Digital Phase Vocoder Using the Fast Fourier Transform”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, No. 3, pp. 243-248, (Jun. 1976). |
Reilly, P. James et al., “The Complex Subband Decomposition and its Application to the Decimation of Large Adaptive Filtering Problems” IEEE Transactions on Signal Processing, vol. 50, No. 11, pp. 2730-2743, (Nov. 2002). |
Number | Date | Country | |
---|---|---|---|
20110002473 A1 | Jan 2011 | US |