The present invention relates to a signal processing technique for an acoustic signal.
NPL 1 and NPL 2 disclose a method of suppressing noise and reverberation from an observation signal in the frequency domain. In this method, reverberation and noise are suppressed by receiving an observation signal in the frequency domain and a steering vector representing the direction of a sound source or an estimated vector thereof, estimating an instantaneous beamformer for minimizing the power of the frequency-domain observation signal under a constraint condition that sound reaching a microphone from the sound source is not distorted, and applying the instantaneous beamformer to the frequency-domain observation signal (conventional method 1).
PTL 1 and NPL 3 disclose a method of suppressing reverberation from an observation signal in the frequency domain. In this method, reverberation in an observation signal in the frequency domain is suppressed by receiving an observation signal in the frequency domain and the power of a target sound at each time, or an estimated value thereof, estimating a reverberation suppression filter for suppressing reverberation in the target sound on the basis of a weighted power minimization reference of a prediction error, and applying the reverberation suppression filter to the frequency-domain observation signal (conventional method 2).
NPL 4 discloses a method of suppressing noise and reverberation by cascade-connecting conventional method 2 and conventional method 1. In this method, at a prior stage, an observation signal in the frequency domain and the power of a target sound at each time are received and reverberation is suppressed using conventional method 2, and then, at a later stage, a steering vector is received and reverberation and noise are further suppressed using conventional method 1 (conventional method 3).
In the conventional methods, it may be impossible to sufficiently suppress reverberation and noise. Conventional method 1 is a method originally developed for the purpose of suppressing noise and may not always be capable of sufficiently suppressing reverberation. With conventional method 2, noise cannot be suppressed. Conventional method 3 can suppress more noise and reverberation than when conventional method 1 or conventional method 2 is used alone. With conventional method 3, however, conventional method 2 serving as the prior stage and conventional method 1 serving as the later stage are viewed as independent systems and optimization is performed in the respective systems. Therefore, when conventional method 2 is applied at the prior stage, it may not always be possible to sufficiently suppress reverberation due to the effects of noise. Further, when conventional method 1 is applied at the later stage, it may not always be possible to sufficiently suppress noise and reverberation due to the effects of residual reverberation.
The present invention has been designed in consideration of these points, and an object thereof is to provide a technique with which noise and reverberation can be sufficiently suppressed.
In the present invention, a convolutional beamformer for calculating, at each time, a weighted sum of a current signal and a past signal sequence having a predetermined delay and a length of 0 or more such that estimation signals increase a probability expressing a speech-likeness of the estimation signals based on a predetermined probability model is acquired where the estimation signals are acquired by applying the convolutional beamformer to frequency-divided observation signals corresponding respectively to a plurality of frequency bands of observation signals acquired by picking up acoustic signals emitted from a sound source, whereupon target signals are acquired by applying the acquired convolutional beamformer to the frequency-divided observation signals.
In the present invention, the convolutional beamformer such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the probability model is acquired, and therefore noise suppression and reverberation suppression can be optimized as a single system, with the result that noise and reverberation can be sufficiently suppressed.
Embodiments of the present invention will be described below.
[Definitions of Symbols]
First, symbols used in the embodiments will be defined.
M: M is a positive integer expressing a number of microphones. For example, M≥2.
m: m is a positive integer expressing the microphone number, and satisfies 1≤m≤M. The microphone number is represented by upper right superscript in round parentheses. In other words, a value or a vector based on a signal picked up by a microphone having the microphone number m is represented by a symbol having the upper right superscript “(m)” (for example, xf, t(m)).
N: N is a positive integer expressing the total number of time frames of signals. For example, N≥2.
t, τ: t and τ are positive integers expressing the time frame number, and t satisfies 1≤t≤N. The time frame number is represented by lower right subscript. In other words, a value or a vector corresponding to a time frame having the time frame number t is represented by a symbol having the lower right subscript “t” (for example, xf, t(m)). Similarly, a value or a vector corresponding to a time frame having the time frame number t is represented by a symbol having the lower right subscript “τ”.
P: P is a positive integer expressing a total number of frequency bands (discrete frequencies). For example, P≥2.
f: f is a positive integer expressing the frequency band number, and satisfies 1≤f≤P. The frequency band number is represented by lower right subscript. In other words, a value or a vector corresponding to a frequency band having the frequency band number f is represented by a symbol having the lower right subscript “f” (for example, xf, t(m)).
T: T expresses a non-conjugated transpose of a matrix or a vector. α0T represents a matrix or a vector acquired by non-conjugated transposition of α0.
H: H expresses a conjugated transpose of a matrix or a vector. α0H represents a matrix or a vector acquired by conjugated transposition of a0.
|α0|: |α0| expresses the absolute value of α0.
∥α0∥: ∥a0∥ expresses the norm of α0.
|α0|γ: |α0|γ expresses a weighted absolute value γ|α0| of α0.
∥α0∥γ: ∥α0∥γ expresses a weighted norm γ∥α0∥ of α0.
In this specification, a“target signal” denotes a signal corresponding to a direct sound and an initial reflected sound, within a signal (for example, a frequency-divided observation signal) corresponding to a sound emitted from a target sound source and picked up by a microphone. The initial reflected sound denotes a reverberation component derived from the sound emitted from the target sound source that reaches the microphone at a delay of no more than several tens of milliseconds following the direct sound. The initial reflected sound typically acts to improve the clarity of the sound, and in this embodiment, a signal corresponding to the initial reflected sound is also included in the target signal. Here, the signal corresponding to the sound picked up by the microphone also includes, in addition to the target signal described above, late reverberation (a component acquired by excluding the initial reflected sound from the reverberation) derived from the sound emitted from the target sound source, and noise derived from a source other than the target sound source. Ina signal processing method, the target signal is estimated by suppressing late reverberation and noise from a frequency-divided observation signal corresponding to a sound recorded by the microphone, for example. In this specification, unless specified otherwise, “reverberation” is assumed to refer to “late reverberation”.
[Principles]
Next, principles will be described.
<Prerequisite Method 1>
Method 1 serving as a prerequisite of the method according to the embodiments will now be described. In method 1, noise and reverberation are suppressed from an M-dimensional observation signal (frequency-divided observation signals) in the frequency domain
x
f,t=[xf,t(1),xf,t(2), . . . ,xf,t(M)]T (1)
The frequency-divided observation signals xf, t are acquired by transforming M observation signals, which are acquired by picking up acoustic signals emitted from one or a plurality of sound sources in M microphones, to the frequency domain. The observation signals are acquired by picking up acoustic signals emitted from the sound sources in an environment where noise and reverberation exist. xf, t(m) is acquired by transforming an observation signal that is acquired by being picked up by the microphone having the microphone number m to the frequency domain. xf, t(m) corresponds to the frequency band having the frequency band number f and the time frame having the time frame number t. In other words, the frequency-divided observation signals xf, t are time series signals.
In method 1, an instantaneous beamformer wf, 0 for minimizing a cost function C1 (wf, 0) below is determined for each frequency band under the constraint condition in which “the target signals are not distorted as a result of applying an instantaneous beamformer (for example, a minimum power distortionless response beamformer) wf, 0 for calculating the weighted sum of the signals at the current time to the frequency-divided observation signals xf, t at each time”.
Note that the lower right subscript “0” of wf, 0 does not represent the time frame number, wf, 0 being independent of the time frame. The constraint condition is a condition in which, for example, wf, 0Hνf, 0 is a constant (1, for example). Here,
νf,0=[νf,0(1),νf,0(2), . . . ,νf,0(M)]T (4)
is a steering vector having, as an element, a transfer function νf, 0(m) relating to the direct sound and the initial reflected sound from the sound source to each microphone (the sound pickup position of the acoustic signal), or an estimated vector (an estimated steering vector) thereof. In other words, νf, 0 is expressed by an M-dimensional (the dimension of the number of microphones) vector having, as an element, the transfer function νf, 0(m), which corresponds to the direct sound and initial reflected sound parts of an impulse response from the sound source position to each microphone (i.e. the reverberation that arrives at a delay of no more than several tens of milliseconds (for example, within 30 milliseconds) following the direct sound). When it is difficult to estimate the gain of the steering vector, a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having one of the microphone numbers m0∈{1, . . . , M} becomes a constant g (g≠0) may be used as νf, 0. In other words, as illustrated below, a normalized vector may be used as νf, 0.
By applying the instantaneous beamformer wf, 0 acquired as described above to the frequency-divided observation signal xf, t of each frequency band in the manner illustrated below, a target signal yf, t in which noise and reverberation have been suppressed from the frequency-divided observation signal xf, t is acquired.
y
f,t
=w
f,0
H
x
f,t (6)
<Prerequisite Method 2>
Method 2 serving as a prerequisite of the method according to the embodiments will now be described. In method 2, reverberation is suppressed from the frequency-divided observation signal xf, t. In method 2, a reverberation suppression filter Ff, τ for minimizing a cost function C2 (Ff) below is determined for τ=d, d+1, . . . , d+L−1 in each frequency band.
Here, the reverberation suppression filter Ff, τ is an M×M-dimensional matrix filter for suppressing reverberation from the frequency-divided observation signal xf, t. d is a positive integer expressing a prediction delay. L is a positive integer expressing the filter length. σf, t2 is the power of the target signal, which is expressed as follows.
∥x∥γ relating to the frequency-divided observation signal x is the weighted norm ∥x∥γ=γ(xHx) of the frequency-divided observation signal x.
By applying the reverberation suppression filter Ff, t acquired as described above to the frequency-divided observation signal xf, t of each frequency band in the manner illustrated below, a target signal zf, t in which reverberation has been suppressed from the frequency-divided observation signal xf, t is acquired.
Here, the target signal zf, t is an M-dimensional column vector, as shown below.
z
f,t=[zf,t(1),zf,t(2), . . . ,zf,t(M)]T′
<Method of Embodiments>
The method of the embodiments will now be described. A target signal yf, t acquired by suppressing noise and reverberation from the frequency-divided observation signal xf, t by using a method integrating methods 1 and 2 can be modeled as follows.
Here, with respect to τ≠0, wf, τ=Ff, τwf, 0, and wf, τ corresponds to a filter for performing noise suppression and reverberation suppression simultaneously. w−f is a convolutional beamformer that calculates a weighted sum of a current signal and a past signal sequence having a predetermined delay at each time. Note that the “−” of “w−f” should be written directly above the “w”, as shown below, but due to notation limitations may also be written to the upper right of “w”.
f
The convolutional beamformer w−f calculates the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time point. The convolutional beamformer w−f is expressed as shown below, for example,
f=[
where the following is satisfied.
f
(m)=[wf,0(m),wf,d(m),wf,d+1(m). . . ,wf,d+L−1(m)]T (10A)
Further, x−f, t is expressed as follows.
f,t=[
f,t
(m)=[xf,t(m),xf,t−d(m),xf,t−d−1(m), . . . ,xf,t−d−L+1(m)]T (11A)
Note that throughout this specification, cases in which L=0 in equations (9) to (11A) are also assumed to be included in the convolutional beamformer of the present invention. In other words, even cases in which the length of the past signal sequence used by the convolutional beamformer to calculate the weighted sum is 0 are treated as examples of realization of the convolutional beamformer. At this time, the term E in equation (9) becomes 0, and therefore equation (9) becomes equation (9A), shown below. Further, the respective right sides of equations (10A) and (11A) become vectors constituted respectively by only one first element (i.e., scalars), and therefore become equations (10AA) and (11AA), respectively.
Note that the convolutional beamformer w−f of equation (9A) is a beamformer that calculates, at each time point, the weighted sum of the current signal and a signal sequence having a predetermined delay and a length of 0, and therefore the convolutional beamformer calculates the weighted value of the current signal at each time point. Further, as will be described below, even when L=0, the signal processing device of the present invention can acquire the target signal by determining a convolutional beamformer on the basis of a probability expressing a speech-likeness and applying the convolutional beamformer to the frequency-divided observation signals.
Here, assuming that yf, t in equation (9) preferably conforms to a speech probability density function p ({yf, t}t=1:N; w−f) (a probability model), the signal processing device determines the convolutional beamformer w−f such that it increases the probability p ({yf, t}t=1:N; w−f) (in other words, a probability expressing the speech-likeness of yf, t) of yf, t based on the speech probability density function. Preferably, the convolutional beamformer w−f which maximizes the probability expressing the speech-likeness of yf, t is determined. For example, the signal processing device determines the convolutional beamformer w−f such that it increases log p ({yf, t}t=1:N; w−f), and preferably determines the convolutional beamformer w−f which maximizes log p ({yf, t}t=1:N; w−f).
A complex normal distribution having an average of 0 and a variance matching the power σf, t2 of the target signal can be cited as an example of a speech probability density function. The “target signal” is a signal corresponding to the direct sound and the initial reflected sound, within a signal corresponding to a sound emitted from a target sound source and picked up by a microphone. Further, the signal processing device determines the convolutional beamformer w−f under the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer w−f to the frequency-divided observation signals xf, t”, for example. This constraint condition is a condition in which, for example, wf, 0Hνf, 0 is a constant (1, for example). On the basis of this constraint condition, for example, the signal processing device determines w−f which maximizes log p ({yf, t}t=1:N; w−f), which is determined as shown below, for each frequency band.
Here, “const.” expresses a constant.
The following function, which is acquired by subtracting the constant term (const.) from log p ({yf, t}t=1:N; w−f) in equation (12) and reversing the plus/minus sign, is set as a cost function C3 (w−f).
Here, R is a weighted space-time covariance matrix determined as shown below.
The signal processing device may determine w−f which minimizes the cost function C3 (w−f) of equation (13) under the constraint condition described above (in which, for example, wf, 0Hνf, 0 is a constant), for example.
The analytical solution of w−f for minimizing the cost function C3 (w−f) under the constraint condition described above (in which, for example, wf, 0Hνf, 0=1) is as shown below.
Here, λ−f is a vector acquired by disposing the element νf, 0(m) of the steering vector νf, 0 as follows.
f=[
f
(m)=[νf,0(m),0, . . . ,0]T
Here, ν˜f(m) is an L+1-dimensional column vector having νf, 0(m), and L zeros as elements.
The signal processing device acquires the target signal yf, t by applying the determined convolutional beamformer w−f to the frequency-divided observation signal xf, t as follows.
y
f,t
=
f
H
f,t (16)
Next, a first embodiment will be described.
As illustrated in
<Step S11>
As illustrated in
The estimation unit 11 acquires and outputs the convolutional beamformer w−f for calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increase the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model where the estimation signals are acquired by applying the convolutional beamformer w−f to the frequency-divided observation signals xf, t in respective frequency bands. For example, the estimation unit 11 determines the convolutional beamformer w−f such that it increases the probability expressing speech-likeness of yf, t based on the probability density function p ({yf, t}t=1:N; w−f) (such that log p ({yf, t}t=1:N; w−f) is increased, for example). The estimation unit 11 preferably determines the convolutional beamformer w−f which maximizes the probability (maximizes log p ({yf, t}t=1:N; w−f), for example).
<Step S12>
The frequency-divided observation signal xf, t and the convolutional beamformer w−f acquired in step S11 are input into the suppression unit 12. The suppression unit 12 acquires and outputs the target signal yf, t (the estimation signal) by applying the convolutional beamformer w−f to the frequency-divided observation signal xf, t in each frequency band. For example, the suppression unit 12 acquires and outputs the target signal yf, t by applying w−f to x−f, t as shown in equation (16).
<Features of this Embodiment>
In this embodiment, the convolutional beamformer w−f for calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model is determined where the estimation signals are acquired by applying the convolutional beamformer w−f to the frequency-divided observation signals xf, t. This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.
Next, a second embodiment will be described. Hereafter, processing units and steps described heretofore will be cited using identical reference numerals, and description thereof will be simplified.
As illustrated in
The estimation unit 21 of this embodiment acquires and outputs the convolutional beamformer w−f which minimizes a sum of values (the cost function C3 (w−f) of equation (13), for example) acquired by weighting the power of the estimation signals at each time belonging to a predetermined time interval by the reciprocal of the power σf, t2 of the target signals or the reciprocal of the estimated power σf, t2 of the target signals under the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer w−f to the frequency-divided observation signals xf, t”. As illustrated in equation (9), the convolutional beamformer w−f is equivalent to a beamformer acquired by integrating a reverberation suppression filter Ff, t for suppressing reverberation from the frequency-divided observation signal xf, t and the instantaneous beamformer wf, 0 for suppressing noise from a signal acquired by applying the reverberation suppression filter Ff, t to the frequency-divided observation signal xf, t. Further, the constraint condition is a condition in which, for example, “a value acquired by applying an instantaneous beamformer to a steering vector having, as an element, transfer functions relating to the direct sound and the initial reflected sound from the sound source to the to the pickup position of the acoustic signals, or an estimated steering vector, which is an estimated vector of the steering vector, is a constant (wf, 0Hνf, 0 is a constant)”. The processing will be described in detail below.
<Step S211>
As illustrated in
<Step S212>
The steering vector or estimated steering vector νf, 0 (equation (4) or (5)) and the weighted space-time covariance matrix Rf acquired in step S211 are input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer w−f on the basis of the weighted space-time covariance matrix Rf and the steering vector or estimated steering vector νf, 0. For example, the convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer w−f in accordance with equation (15).
<Step S12>
This step is identical to the first embodiment, and therefore description thereof has been omitted.
<Features of this Embodiment>
In this embodiment, the weighted space-time covariance matrix Rf is acquired, and on the basis of the weighted space-time covariance matrix Rf and the steering vector or estimated steering vector νf, 0, the convolutional beamformer w−f is acquired. This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.
Next, a third embodiment will be described. In this embodiment, an example of a method of generating σf, t2 and νf, 0 will be described.
As illustrated in
Hereafter, only the processing executed by the parameter estimation unit 33, which differs from the second embodiment, will be described. The processing performed by the other processing units is as described in the first and second embodiments.
The frequency-divided observation signal xf, t is input into the initial setting unit 330. Using the frequency-divided observation signal xf, t, the initial setting unit 330 generates and outputs a provisional power σf, t2, which is a provisional value of the estimated power σf, t2 of the target signal. For example, the initial setting unit 330 generates and outputs the provisional power σf, t as follows.
<Step S332>
The frequency-divided observation signals xf, t and the newest provisional powers σf, t2 are input into the reverberation suppression filter estimation unit 332. The reverberation suppression filter estimation unit 332 determines and outputs a reverberation suppression filter Ff, t for minimizing the cost function C2 (Ff) of equation (7) with respect to t=d, d+1, . . . , d+L−1 in each frequency band.
<Step S333>
The frequency-divided observation signal xf, t and the newest reverberation suppression filter Ff, t acquired in step S332 are input into the reverberation suppression filter application unit 333. The reverberation suppression filter application unit 333 acquires and outputs an estimation signal y′f, t by applying the reverberation suppression filter Ff, t to the frequency-divided observation signal xf, t in each frequency band. For example, the reverberation suppression filter application unit 333 sets zf, t, acquired in accordance with equation (8), as y′f, t and outputs y′f, t.
<Step S334>
The newest estimation signal y′f, t acquired in step S333 is input into the steering vector estimation unit 334. Using the estimation signal y′f, t, the steering vector estimation unit 334 acquires and outputs a provisional steering vector νf, 0, which is a provisional vector of the estimated steering vector, in each frequency band. For example, the steering vector estimation unit 334 acquires and outputs the provisional steering vector νf, 0 for the estimation signal y′f, t in accordance with a steering vector estimation method described in NPL 1 and NPL 2. For example, as the provisional steering vector νf, 0, the steering vector estimation unit 334 outputs a steering vector estimated using y′f, t as yf, t according to NPL 2. Further, as noted above, a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having any one of the microphone numbers m0∈(1, . . . , M) becomes a constant g may be used as νf, 0 (equation (5)).
<Step S335>
The newest estimation signal y′f, t acquired in step S333 and the newest provisional steering vector νf, 0 acquired in step S334 are input into the instantaneous beamformer estimation unit 335. The instantaneous beamformer estimation unit 335 acquires and outputs an instantaneous beamformer wf, 0 for minimizing C1 (wf, 0) shown below in equation (18), which is acquired by setting xf, t=y′f, t in equation (2), in each frequency band on the basis of the constraint condition that “wf, 0Hνf, 0 is a constant”.
<Step S336>
The newest estimation signal y′f, t acquired in step S333 and the newest instantaneous beamformer wf, 0 acquired in step S335 are input into the instantaneous beamformer application unit 336. The instantaneous beamformer application unit 336 acquires and outputs an estimation signal y″f, t by applying the instantaneous beamformer wf, 0 to the estimation signal y′f, t in each frequency band. For example, the instantaneous beamformer application unit 336 acquires and outputs the estimation signal y″f, t as follows.
y″
f,t
=w
f,0
H
y′
f,t (19)
<Step S331>
The newest estimation signal y″f, t acquired in step S336 is input into the power estimation unit 331. The power estimation unit 331 outputs the power of the estimation signal y″f, t as the provisional power σf, t2 in each frequency band. For example, the power estimation unit 331 generates and outputs the provisional power σf, t2 as follows.
σf,t2=|y″f,t|2=y″f,tHy″f,t (20)
<Step S337a>
The control unit 337 determines whether or not a termination condition is satisfied. There are no limitations on the termination condition, but for example, the termination condition may be satisfied when the number of repetitions of the processing of steps S331 to S336 exceeds a predetermined value, when the variation in σf, t2 or νf, 0 falls to or below a predetermined value after the processing of steps S331 to S336 is performed once, and so on. When the termination condition is not satisfied, the processing returns to step S332. When the termination condition is satisfied, on the other hand, the processing advances to step S337b.
<Step S337b>
In step S337b, the power estimation unit 331 outputs σf, t2 acquired most recently in step S331 as the estimated power of the target signal, and the steering vector estimation unit 334 outputs νf, 0 acquired most recently in step S334 as the estimated steering vector. As illustrated in
As described above, the steering vector is estimated on the basis of the frequency-divided observation signal xf, t. Here, when the steering vector is estimated after suppressing (preferably, removing) reverberation from the frequency-divided observation signal xf, t, the estimation precision improves. In other words, by acquiring a frequency-divided reverberation-suppressed signal in which the reverberation component of the frequency-divided observation signal xf, t has been suppressed, and acquiring the estimated steering vector from the frequency-divided reverberation-suppressed signal, the precision of the estimated steering vector can be improved.
As illustrated in
The fourth embodiment differs from the first to third embodiments in that before generating the estimated steering vector, the reverberation component of the frequency-divided observation signal xf, t is suppressed. Hereafter, only a method for generating the estimated steering vector will be described.
<Processing of Reverberation Suppression Unit 431 (Step S431)>
The frequency-divided observation signal xf, t is input into the reverberation suppression unit 431 of the parameter estimation unit 43 (
<Processing of Steering Vector Estimation Unit 432 (Step S432)>
The frequency-divided reverberation-suppressed signal uf, t acquired by the reverberation suppression unit 431 is input into the steering vector estimation unit 432. Using the frequency-divided reverberation-suppressed signal uf, t as input, the steering vector estimation unit 432 generates and outputs an estimated steering vector serving as an estimated vector of the steering vector. A steering vector estimation processing method of acquiring an estimated steering vector using a frequency-divided time series signal as input is well-known. The steering vector estimation unit 432 acquires and outputs the estimated steering vector νf, 0 by using the frequency-divided reverberation-suppressed signal uf, t as the input of a desired type of steering vector estimation processing. There are no limitations on the steering vector estimation processing method, and for example, the method described above in NPL 1 and NPL 2, methods described in reference documents 2 and 3, and so on may be used.
The estimated steering vector νf, 0 acquired by the steering vector estimation unit 432 is input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 performs the processing of step S212, described in the second embodiment, using the estimated steering vector νf, 0 and the weighted space-time covariance matrix Rf acquired in step S211. All other processing is as described in the first and second embodiments.
In a fifth embodiment, a method of executing steering vector estimation by successive processing will be described. In so doing, the estimated steering vector of each time frame number t can be calculated from frequency-divided observation signals xf, t input successively online, for example.
As illustrated in
<Processing of Steering Vector Estimation Unit 532 (Step S532)>
The frequency-divided observation signal xf, t, which is a frequency-divided time series signal, is input into the steering vector estimation unit 532 (
<<Processing of Observation Signal Covariance Matrix Updating Unit 532a (Step S532a)>>
Using the frequency-divided observation signal xf, t as input, the observation signal covariance matrix updating unit 532a (
ψx,f,t=βψx,f,t−1+xf,txf,tH (21)
Here, β is an oblivion coefficient, and is a real number belonging to a range of 0<β<1, for example. An initial matrix ψx, f, 0 of the spatial covariance matrix ψx, f, t−1 may be set as desired. For example, an M×M-dimensional unit matrix may be set as the initial matrix ψx, f, 0 of the spatial covariance matrix γx, f, t−1.
<Processing of Inverse Noise Covariance Matrix Updating Unit 532d (Step S532d)>
The frequency-divided observation signal xf, t and mask information γf, t(n) are input into the inverse noise covariance matrix updating unit 532d. The mask information γf, t(n) is information expressing the ratio of the noise component included in the frequency-divided observation signal xf, t at a time-frequency point corresponding to the time frame number t and the frequency band number f. In other words, the mask information γf, t(n) expresses the occupancy probability of the noise component included in the frequency-divided observation signal xf, t at a time-frequency point corresponding to the time frame number t and the frequency band number f. There are no limitations on the method of estimating the mask information γf, t(n). Methods of estimating the mask information γf, t(n) are well-known, and include, for example, an estimation method using a complex Gaussian mixture model (CGMM) (reference document 4, for example), an estimation method using a neural network (reference document 5, for example), an estimation method integrating these methods (reference document 6 and reference document 7, for example), and so on.
The mask information γf, t(n) may be estimated in advance and stored in a storage device, not illustrated in the figures, or may be estimated successively. Note that the upper right superscript “(n)” of “γf, t(n)” should be written directly above the lower right subscript “f, t”, but due to notation limitations has been written to the upper right of “f, t”.
The inverse noise covariance matrix updating unit 532d acquires and outputs an inverse noise covariance matrix ψ−1n, f, t (an inverse noise covariance matrix of the frequency-divided observation signal belonging to the first time interval) on the basis of the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval), the mask information γf, t(n) (mask information belonging to the first time interval), and an inverse noise covariance matrix ψ−1n, f, t−1 (an inverse noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval). For example, the inverse noise covariance matrix updating unit 532d acquires and outputs the inverse noise covariance matrix ψ−1n, f, t in accordance with equation (22), shown below, using the Woodbury formula.
Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. An initial matrix ψ−1n, f, 0 of the inverse noise covariance matrix ψ−1n, f, t−1 may be set as desired. For example, an M×M-dimensional unit matrix may be set as the initial matrix ψ−1n, f, 0 of the inverse noise covariance matrix ψ−1n, f, t−1. Note that the upper right superscript “−1” of “ψ−1n, f, t” should be written directly above the lower right subscript “n, f, t”, but due to notation limitations has been written to the upper left of “n, f, t”.
<Processing of Main Component Vector Updating Unit 532b (Step S532b)>
The spatial covariance matrix ψx, f, t acquired by the observation signal covariance matrix updating unit 532a and the inverse noise covariance matrix ψ−1n, f, t acquired by the inverse noise covariance matrix updating unit 532d are input into the main component vector updating unit 532b. The main component vector updating unit 532b acquires and outputs a main component vector ν˜f, t (a main component vector of the first time interval) relating to ψ−1n, f, tψx, f, t (the product of an inverse matrix of the noise covariance matrix of the frequency-divided observation signal and the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval) by using a power method on the basis of the inverse noise covariance matrix ψ−1n, f, t (the inverse matrix of the noise covariance matrix of the frequency-divided observation signal), the spatial covariance matrix ψx, f, t (the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval), and a main component vector v˜f, t−1 (a main component vector of the second time interval). For example, the main component vector updating unit 532b acquires and outputs a main component vector v˜f, t based on ψ−1n, f, tψx, f, tv˜f, t−1. The main component vector updating unit 532b acquires and outputs the main component vector v˜f, t in accordance with equations (23) and (24) shown below, for example. Note that the upper right superscript “˜” of “v˜f, t” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.
Here, v˜f, tref expresses an element corresponding to a predetermined microphone (a reference microphone ref) serving as a reference, among the M elements of a vector v˜f, t acquired from equation (23). In other words, in the example of equations (23) and (24), the main component vector updating unit 532b sets a vector acquired by normalizing the respective elements of v˜′f, t=ψ−1n, f, f, tψx, f, tv˜f, t−1 by v˜f, tref as the main component vector v˜f, t. Note that the upper right superscript “˜” of “v˜′f, t” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.
<Noise Covariance Matrix Updating Unit 532e (Step S532e)>
The noise covariance matrix updating unit 532e, using the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval) and the mask information γf, t(n); (the mask information of the first time interval) as input, acquires and outputs a noise covariance matrix γn, f, t of the frequency-divided observation signal xf, t (a noise covariance matrix of the frequency-divided observation signal belonging to the first time interval), which is based on the frequency-divided observation signal xf, t, the mask information γf, t(n), and a noise covariance matrix ψn, f, t−1 (a noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval). For example, the noise covariance matrix updating unit 532e acquires and outputs the linear sum of a product γf, t(n)xf, txf, tH of the covariance matrix xf, txf, tH of the frequency-divided observation signal xf, t and the mask information γf, t(n), and the noise covariance matrix ψn, f, t−1 (the noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval) as the noise covariance matrix ψn, f, t of the frequency-divided observation signal xf, t. For example, the noise covariance matrix updating unit 532e acquires and outputs the noise covariance matrix ψn, f, t in accordance with equation (25) shown below.
ψn,f,t=αψn,f,t−1+γf,t(n)xf,txf,tH (25)
Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example.
<Steering Vector Updating Unit 532c (Step S532c)>
The steering vector updating unit 532c, using the main component vector v˜f, t (the main component vector of the first time interval) acquired by the main component vector updating unit 532b and the noise covariance matrix ψn, f, t (the noise covariance matrix of the frequency-divided observation signal) acquired by the noise covariance matrix updating unit 532e as input, acquires and outputs an estimated steering vector νf, t (an estimated steering vector of the first time interval) on the basis thereof. For example, the steering vector updating unit 532c acquires and outputs an estimated steering vector νf, t based on ψn, f, tv˜f, t. The steering vector updating unit 532c acquires and outputs the estimated steering vector νf, t in accordance with equations (26) and (27) shown below, for example.
Here, vf, tref expresses an element corresponding to the reference microphone ref, among the M elements of a vector v′f, t acquired from equation (26). In other words, in the example of equations (26) and (27), the steering vector updating unit 532c sets a vector acquired by normalizing the respective elements of v′f, t=ψn, f, tv˜f, t by vf, tref as the estimated steering vector νf, t.
The estimated steering vector νf, t acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 treats the estimated steering vector νf, t as νf, 0, and performs the processing of step S212, described in the second embodiment, using the estimated steering vector νf, t and the weighted space-time covariance matrix Rf acquired in step S211. All other processing is as described in the first and second embodiments. Further, as σf, t2 input into the matrix estimation unit 211, either the provisional power generated as illustrated in equation (17) or the estimated power σf, t2 generated as described in the third embodiment, for example, may be used.
In step S532d of the fifth embodiment, the inverse noise covariance matrix updating unit 532d adaptively updates the inverse noise covariance matrix ψ−1n, f, t at each time point corresponding to the time frame number t by using the frequency-divided observation signal xf, t and the mask information γf, t(n). However, the inverse noise covariance matrix updating unit 532d may acquire and output the inverse noise covariance matrix ψ−1n, f, t by using a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant, without using the mask information γf, t(n). For example, the inverse noise covariance matrix updating unit 532d may output, as the inverse noise covariance matrix ψ−1n, f, t, an inverse matrix of the temporal average of xf, txf, tH with respect to a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant. The inverse noise covariance matrix ψ−1n, f, t acquired in this manner is used continuously in the frames having the respective time frame numbers t.
In step S532e of the fifth embodiment, the noise covariance matrix updating unit 532e may acquire and output the noise covariance matrix ψ−1n, f, t of the frequency-divided observation signal xf, t using a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant, without using the mask information γf, t(n). For example, the noise covariance matrix updating unit 532e may output, as the noise covariance matrix ψn, f, t, the temporal average of xf, txf, tH with respect to a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant. The noise covariance matrix ψn, f, t acquired in this manner is used continuously in the frames having the respective time frame numbers t.
In the fifth embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.
In the fifth embodiment, the steering vector estimation unit 532 acquires and outputs the estimated steering vector νf, t by successive processing using the frequency-divided observation signal xf, t as input. As noted in the fourth embodiment, however, by estimating the steering vector after suppressing reverberation from the frequency-divided observation signal xf, t, the estimation precision is improved. In the sixth embodiment, an example in which the steering vector estimation unit acquires and outputs the estimated steering vector νf, t by successive processing, as described in the fifth embodiment, after reverberation has been suppressed from the frequency-divided observation signal xf, t will be described.
As illustrated in
<Processing of Reverberation Suppression Unit 431 (Step S431)>
As described in the fourth embodiment, the reverberation suppression unit 431 (
<Processing of Steering Vector Estimation Unit 632 (Step S632)>
The frequency-divided reverberation-suppressed signal uf, t is input into the steering vector estimation unit 632. The processing of the steering vector estimation unit 632 is identical to the processing of the steering vector estimation unit 532 of the fifth embodiment except that the frequency-divided reverberation-suppressed signal uf, t, rather than the frequency-divided observation signal xf, t, is input into the steering vector estimation unit 632, and the steering vector estimation unit 632 uses the frequency-divided reverberation-suppressed signal uf, t instead of the frequency-divided observation signal xf, t. In other words, in the processing performed by the steering vector estimation unit 63 the frequency-divided observation signal xf, t used in the processing of the steering vector estimation unit 532 is replaced by the frequency-divided reverberation-suppressed signal uf, t. All other processing is identical to the fifth embodiment and the modified example thereof. More specifically, the frequency-divided reverberation-suppressed signal uf, t, which is a frequency-divided time series signal, is input into the steering vector estimation unit 632. The observation signal covariance matrix updating unit 532a acquires and outputs the spatial covariance matrix ψx, f, t of the frequency-divided reverberation-suppressed signal uf, t belonging to the first time interval, which is based on the frequency-divided reverberation-suppressed signal uf, t belonging to the first time interval and the spatial covariance matrix ψx, f, t−1 of a frequency-divided reverberation-suppressed signal uf, t_i belonging to the second time interval that is further in the past than the first time interval. The main component vector updating unit 532b acquires and outputs the main component vector v˜f, t of the first time interval with respect to the product ψ−1n, f, tψx, f, t of the inverse matrix ψ−1n, f, t of the noise covariance matrix of the frequency-divided reverberation-suppressed signal and the spatial covariance matrix ψx, f, t of the frequency-divided reliability-suppressed signal belonging to the first time interval on the basis of the inverse matrix ψ−1n, f, t of the noise covariance matrix of the frequency-divided reliability-suppressed signal uf, t, the spatial covariance matrix ψx, f, t of the frequency-divided reliability-suppressed signal belonging to the first time interval, and the main component vector v˜f, t−1 of the second time interval. The steering vector updating unit 532c acquires and outputs the estimated steering vector νf, t of the first time interval on the basis of the noise covariance matrix of the frequency-divided reverberation-suppressed signal uf, t and the main component vector v˜f, t of the first time interval.
In a seventh embodiment, a method of estimating the convolutional beamformer by successive processing will be described. In so doing, the convolutional beamformer of each time frame number t can be estimated and the target signal yf, t can be acquired from frequency-divided observation signals xf, t input successively online, for example.
As illustrated in
<Processing of Parameter Estimation Unit 53 (Step S53)>
The frequency-divided observation signal xf, t is input into the parameter estimation unit 53 (
νf,t=[νf,t(1),νf,t(2), . . . ,νf,t(M)]T
Here, νf, t(m) represents an element corresponding to the microphone having the microphone number m, among the M elements of the estimated steering vector νf, t. The estimated steering vector νf, t acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 712.
<Processing of Matrix Estimation Unit 711 (Step S711)>
The frequency-divided observation signal xf, t and the power or estimated power σf, t2 of the target signal are input into the matrix estimation unit 711 (
R̆
f,t−1
−1
of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the second time interval that is further in the past than the first time interval), the matrix estimation unit 711 estimates and outputs an inverse matrix
R̆
f,t
−1
of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the first time interval). An example of the space-time covariance matrix is as follows.
In this case, the matrix estimation unit 711 generates and outputs the inverse matrix
R̆
f,t
−1
of the space-time covariance matrix in accordance with equations (28) and (29) shown below, for example.
Here, kf, t in equation (28) is an (L+1)M-dimensional vector, and the inverse matrix of equation (29) is an (L+1)M×(L+1)M matrix. α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. Further, an initial matrix of the inverse matrix
R̆
f,t−1
−1
of the space-time covariance matrix may be set as desired, and an example of the initial matrix is an (L+1)M-dimensional unit matrix shown below.
R̆
f,0
−1
=I
(L+1)M
<Processing of beamformer estimation unit 712 (step S712)>
R̆
f,t
−1
(the inverse matrix of the space-time covariance matrix of the first time interval) acquired by the matrix estimation unit 711, and the estimated steering vector νf, t acquired by the parameter estimation unit 53 are input into the beamformer estimation unit 712. The convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer w−f, t (the convolutional beamformer of the first time interval) on the basis thereof. For example, the convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer w−f, t in accordance with equation (30), shown below.
where
f,t=[
and
f,t
(m)=[gf
[gf
is an L+1-dimensional vector. gf is a scalar constant other than 0.
<Processing of Suppression Unit 72 (Step S72)>
The frequency-divided observation signal xf, t and the convolutional beamformer w−f, t acquired by the beamformer estimation unit 712 are input into the suppression unit 72. The suppression unit 72 acquires and outputs the target signal yf, t by applying the convolutional beamformer w−f, t to the frequency-divided observation signal xf, t in each time frame number t and frequency band number f. For example, the suppression unit 72 acquires and outputs the target signal yf, t in accordance with equation (31) shown below.
y
f,t
=
f,t
H
f,t (31)
The parameter estimation unit 53 of the signal processing device 7 according to the seventh embodiment may be replaced by the parameter estimation unit 63. In other words, in the seventh embodiment, the parameter estimation unit 63, rather than the parameter estimation unit 53, may acquire and output the estimated steering vector νf, t by successive processing, as described in the sixth embodiment, using the frequency-divided observation signal xf, t as input.
In the seventh embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.
In the second embodiment, an example in which the analytical solution of w−f for minimizing the cost function C3 (w−f) on the basis of a constraint condition in which wf, 0Hνf, 0 is a constant is viewed as equation (15) and the convolutional beamformer w−f is acquired in accordance with equation (15) was described. In an eighth embodiment, an example in which the convolutional beamformer is acquired using a different optimal solution will be described.
When an (M−1)×M block matrix corresponding to the orthogonal complement of the estimated steering vector νf, 0 is set as Bf, BfHνf, 0=0 is satisfied. An infinite number of block matrices Bf of this type exist. Equation (32) below shows an example of the block matrix Bf.
Here, ν˜f, 0 is an M−1-dimensional column vector constituted by elements of the steering vector νf, 0 or the estimated steering vector νf, 0 that correspond to microphones other than the reference microphone ref, νf, 0ref is the element of νf, 0 that corresponds to the reference microphone ref, and IM−1 is an (M−1)×(M−1)-dimensional unit matrix.
gf is set as a scalar constant other than 0, af, 0 is set as an M-dimensional modified instantaneous beamformer, and the instantaneous beamformer wf, 0 is expressed as the sum of a constant multiple gfνf, 0 of the steering vector νf, 0 or a constant multiple gfνf, 0 of the estimated steering vector νf, 0 and a product Bfaf, 0 of the block matrix Bf corresponding to the orthogonal complement of the steering vector νf, 0 or the estimated steering vector νf, 0 and the modified instantaneous beamformer af, 0. In other words, the instantaneous beamformer wf, 0 is expressed as
w
f,0
=g
fνf,0+Bfaf,0 (33)
Accordingly, BfHνf, 0=0, and therefore the constraint condition that “wf, 0Hνf, 0 is a constant” is expressed as follows.
w
f,0
Hνf,0=(gfνf,0+Bfaf,0)Hνf,0=gfH|∥f,0|2=constant
Hence, even under the definition given in equation (33), the constraint condition that “wf, 0Hνf, 0 is a constant” is satisfied in relation to any modified instantaneous beamformer af, 0. It is therefore evident that the instantaneous beamformer wf, 0 may be defined as illustrated in equation (33). In this embodiment, the convolutional beamformer is estimated using the optimal solution of the convolutional beamformer acquired when the instantaneous beamformer wf, 0 is defined as illustrated in equation (33). This will be described in detail below.
As illustrated in
<Processing of Parameter Estimation Unit 83 (Step S83)>
The parameter estimation unit 83 (
<Processing of Initial Beamformer Application Unit 813 (Step S813)>
The estimated steering vector νf, 0 and the frequency-divided observation signal xf, t are input into the initial beamformer application unit 813. The initial beamformer application unit 813 acquires and outputs an initial beamformer output zf, t (an initial beamformer output of the first time interval) based on the estimated steering vector νf, 0 and the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval). For example, the initial beamformer application unit 813 acquires and outputs an initial beamformer output zf, t based on the constant multiple of the estimated steering vector νf, 0 and the frequency-divided observation signal rf, t. The initial beamformer application unit 813 acquires and outputs the initial beamformer output zf, t in accordance with equation (34) shown below, for example.
z
f,t=(gfνf,0)Hxf,t (34)
The output initial beamformer output zf, t is transmitted to the convolutional beamformer estimation unit 812 and the suppression unit 82.
<Processing of Block Unit 814 (Step S814)>
The estimated steering vector νf, 0 and the frequency-divided observation signal xf, t are input into the block unit 814. The block unit 814 acquires and outputs a vector x=f, t based on the frequency-divided observation signal xf, t and the block matrix Bf corresponding to the orthogonal complement of the estimated steering vector νf, 0. As noted above, BfHνf, 0=0 is satisfied. Equation (32) shows an example of the block matrix Bf, but the present invention is not limited to this example, and any block matrix Bf in which BfHνf, 0=0 is satisfied may be used. The block unit 814 acquires and outputs the vector x=f, t in accordance with equations (35) and (36) shown below, for example.
Note that the upper right superscript “=” of “x=f, t” should be written directly above the lower right subscript “x”, as shown in equation (36), but due to notation limitations may also be written to the upper right of “x”. The output vector x=f, t is transmitted to the matrix estimation unit 811, the convolutional beamformer estimation unit 812, and the suppression unit 82. Further, when L=0, the right side of equation (35) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (36) is as shown below in equation (36A).
f,t
=B
f
H
x
f,t (36A)
<Processing of Matrix Estimation Unit 811 (Step S811)>
The vector x=f, t acquired by the block unit 814 and the power or estimated power σf, t2 of the target signal are input into the matrix estimation unit 811. Either the provisional power generated as illustrated in equation (17) or the estimated power σf, t2 generated as described in the third embodiment, for example, may be used as σf, t2. Using the vector x=f, t and the power or estimated power σf, t2 of the target signal, the matrix estimation unit 811 acquires and outputs a weighted modified space-time covariance matrix R=f, which is based on the estimated steering vector νf, 0, the frequency-divided observation signal xf, t, and the power or estimated power σf, t2 of the target signal and increases the probability expressing the speech-likeness of the estimation signal when the instantaneous beamformer wf, 0 is expressed as illustrated in equation (33). For example, the matrix estimation unit 811 acquires and outputs a weighted modified space-time covariance matrix R=f based on the vector x=f, t, and the power or estimated power σf, t2 of the target signal. The matrix estimation unit 811 acquires and outputs the weighted modified space-time covariance matrix R=f in accordance with equation (37) below, for example.
The output weighted modified space-time covariance matrix R=f is transmitted to the convolutional beamformer estimation unit 812.
<Processing of Convolutional Beamformer Estimation Unit 812 (Step S812)>
The initial beamformer output zf, t acquired by the initial beamformer application unit 813, the vector x=f, t acquired by the block unit 814, and the weighted modified space-time covariance matrix R=f acquired by the matrix estimation unit 811 are input into the convolutional beamformer estimation unit 812. Using these, the convolutional beamformer estimation unit 812 acquires and outputs a convolutional beamformer w=f that is based on the estimated steering vector νf, the weighted modified space-time covariance matrix R=f, and the frequency-divided observation signal xf, t. For example, the convolutional beamformer estimation unit 812 acquires and outputs the convolutional beamformer w=f in accordance with equation (38) shown below.
f
=
f
−1
f,t
z
f,t
H (38)
f=[af,0T
f
(m)=[wf,d(m),wf,d+1(m), . . . wf,d+L−1(m)]T (38B)
The output convolutional beamformer w=f is transmitted to the suppression unit 82.
Note that when L=0, the right side of equation (38B) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (38A) is as shown below.
f
=a
f,0
<Processing of Suppression Unit 82 (Step S82)>
The vector xf, t output from the block unit 814, the initial beamformer output zf, t output from the initial beamformer application unit 813, and the convolutional beamformer w=f output from the convolutional beamformer estimation unit 812 are input into the suppression unit 82. The suppression unit 82 acquires and outputs the target signal yf, t by applying the initial beamformer output zf, t and the convolutional beamformer w=f to the vector x=f, t. This processing is equivalent to processing for acquiring and outputting the target signal yf, t by applying the convolutional beamformer w−f to the frequency-divided observation signal xf, t. For example, the suppression unit 82 acquires and outputs the target signal yf, t in accordance with equation (39) shown below.
y
f,t
=z
f,t
+
f
H
f,t (39)
A known steering vector νf, 0 acquired on the basis of actual measurement or the like may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector νf, 0 acquired by the parameter estimation unit 83. In this case, the initial beamformer application unit 813 and the block unit 814 perform steps S813 and S814, described above, using the steering vector νf, 0 instead of the estimated steering vector νf, 0.
In a ninth embodiment, a method for executing convolutional beamformer estimation based on the eighth embodiment by successive processing will be described. The following processing is executed on each time frame number t in ascending order from t=1.
As illustrated in
<Processing of Parameter Estimation Unit 93 (Step S93)>
The parameter estimation unit 93 (
<Processing of Initial Beamformer Application Unit 813 (Step S813)>
The estimated steering vector νf, t (the estimated steering vector of the first time interval) and the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval) are input into the initial beamformer application unit 813, and the initial beamformer application unit 813 acquires and outputs the initial beamformer output zf, t (the initial beamformer output of the first time interval) as described in the eighth embodiment using νf, t instead of νf, 0. The output initial beamformer output zf, t is transmitted to the suppression unit 92.
<Processing of Block Unit 814 (Step S814)>
The estimated steering vector νf, t and the frequency-divided observation signal xf, t are input into the block unit 814, and the block unit 814 acquires and outputs the vector x=f, t as described in the eighth embodiment by using νf, t instead of νf, 0. The output vector x=f, t is transmitted to the adaptive gain estimation unit 911, the matrix estimation unit 915, and the suppression unit 92.
<Processing of Suppression Unit 92 (Step S92)>
The initial beamformer output zf, t output from the initial beamformer application unit 813 and the vector x=f, t output from the block unit 814 are input into the suppression unit 92. Using these, the suppression unit 92 acquires and outputs the target signal yf, t, which is based on the initial beamformer output zf, t (the initial beamformer output of the first time interval), the estimated steering vector νf, t (the estimated steering vector of the first time interval), the frequency-divided observation signal xf, t, and a convolutional beamformer w=f, t_, (the convolutional beamformer of the second time interval that is further in the past than the first time interval). For example, the suppression unit 92 acquires and outputs the target signal yf, t in accordance with equation (40) below.
y
f,t
=z
f,t
+
f,t−1
H
f,t (40)
Here, the initial vector w=f, 0 of the convolutional beamformer w=f, t−1 may be any (LM+M−1)-dimensional vector. An example of the initial vector w=f, 0 is an (LM+M−1)-dimensional vector in which all elements are 0.
<Processing of Adaptive Gain Estimation Unit 911 (Step S911)>
The vector x=f, t output from the block unit 814, an inverse matrix R˜−f, t−1 of the weighted modified space-time covariance matrix output from the matrix estimation unit 915, and the power or estimated power σf, t2 of the target signal are input into the adaptive gain estimation unit 911. As σf, t2 input into the matrix estimation unit 711, either the provisional power generated as illustrated in equation (17) or the estimated power σf, t2 generated as described in the third embodiment, for example, may be used. Note that the “˜” of “R˜−1f, t−1” should be written directly above the “R”, but due to notation limitations may also be written to the upper right of “R”. Using these, the adaptive gain estimation unit 911 acquires and outputs an adaptive gain kf, t (the adaptive gain of the first time interval) that is based on the inverse matrix R˜−1f, t−1 of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval), the estimated steering vector νf, t (the estimated steering vector of the first time interval), the frequency-divided observation signal xf, t, and the power or estimated power σf, t2 of the target signal. For example, the adaptive gain estimation unit 911 acquires and outputs the adaptive gain kf, t as an (LM+M−1)-dimensional vector in accordance with equation (41) shown below.
Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. Further, an initial matrix of the inverse matrix R˜−1f, t−1 of the weighted modified space-time covariance matrix may be any (LM+M−1)×(LM+M−1)-dimensional matrix. An example of the initial matrix of the inverse matrix R˜−1f, t−1 of the weighted modified space-time covariance matrix is an (LM+M−1)-dimensional unit matrix. Here,
Note that R˜f, t itself is not calculated. The output adaptive gain kf, t is transmitted to the matrix estimation unit 915 and the convolutional beamformer estimation unit 912.
<Processing of matrix estimation unit 915 (step S915)β The vector xf, t output from the block unit 814 and the adaptive gain kf, t output from the adaptive gain estimation unit 911 are input into the matrix estimation unit 915. Using these, the matrix estimation unit 915 acquires and outputs an inverse matrix R˜−1f, t of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the first time interval) that is based on the adaptive gain kf, t (the adaptive gain of the first time interval), the estimated steering vector νf, t (the estimated steering vector of the first time interval), the frequency-divided observation signal xf, t, and the inverse matrix R˜−1f, t−1, of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval). For example, the matrix estimation unit 915 acquires and outputs the inverse matrix R˜−1f, t of the weighted modified space-time covariance matrix in accordance with equation (42) below.
The output inverse matrix R˜−1f, t of the weighted modified space-time covariance matrix is transmitted to the adaptive gain estimation unit 911.
<Processing of Convolutional Beamformer Estimation Unit 912 (Step S912)>
The target signal yf, t output from the suppression unit 92 and the adaptive gain kf, t output from the adaptive gain estimation unit 911 are input into the convolutional beamformer estimation unit 912. Using these, the convolutional beamformer estimation unit 912 acquires and outputs the convolutional beamformer w=f, t (the convolutional beamformer of the first time interval), which is based on the adaptive gain kf, t (the adaptive gain of the first time interval), the target signal yf, t (the target signal of the first time interval), and the convolutional beamformer w=f, t−1 (the convolutional beamformer of the second time interval). For example, the convolutional beamformer estimation unit 912 acquires and outputs the convolutional beamformer w=f, t in accordance with equation (43) shown below.
f,t
=
f,t−1
−k
f,t
t
f,t
H (43)
The output convolutional beamformer w=f, t is transmitted to the suppression unit 92.
In the ninth embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.
A known steering vector νf, t may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector νf, t acquired by the parameter estimation unit 93. In this case, the initial beamformer application unit 813 and the block unit 814 perform steps S813 and S814, described above, using the steering vector νf, t instead of the estimated steering vector νf, t.
The frequency-divided observation signals xf, t input into the signal processing devices 1 to 9 described above may be any signals that correspond respectively to a plurality of frequency bands of an observation signal acquired by picking up an acoustic signal emitted from a sound source. For example, as illustrated in
The target signals yf, t output from the signal processing devices 1 to 9 may either be used in other processing (speech recognition processing or the like) without being transformed into time-domain signals y(i) or be transformed into a time-domain signal y(i). For example, as illustrated in
Test results relating to the methods of the respective embodiments will be illustrated below.
Next, noise/reverberation suppression results acquired by the first embodiment and conventional methods 1 to 3 will be illustrated.
In this test, a data set of the “REVERB Challenge” was used as the observation signal. Acoustic data (Real Data) acquired by picking up English-language speech read aloud in a room with stationary noise and reverberation using microphones disposed in positions away (0.5 to 2.5 m) from the speaker, and acoustic data (Sim Data) acquired by simulating this environment are recorded in the data set. The number of microphones M=8. The frequency-divided observation signals were determined by the short-time Fourier transform. The frame length was set at 32 milliseconds, the frame shift was set at 4, and the prediction delay was set at d=4. Using these data, the speech quality and speech recognition precision of signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3 were evaluated.
Note that the present invention is not limited to the embodiments described above. For example, in the above embodiments, d is set at the same value in all of the frequency bands, but d may be set for each frequency band. In other words, a positive integer df may be used instead of d. Similarly, in the above embodiments, L is set at the same value in all of the frequency bands, but L may be set for each frequency band. In other words, a positive integer Lf may be used instead of L.
In the first to third embodiments, examples in which batch processing is performed by determining the cost functions and so on (equations (2), (7), (12), (13), (14), and (18)) by using a time frame corresponding to 1≤t≤N as a processing unit were described, but the present invention is not limited thereto. For example, rather than using a time frame corresponding to 1≤t≤N as a processing unit, the processing may be executed using a partial time frame thereof as a processing unit. Alternatively, the time frame that is used as the processing unit may be updated in real time, and the processing may be executed by determining the cost functions and so on in processing units of each time point. For example, when the number of the current time frame is expressed as tc, a time frame corresponding to 1≤t≤tc, may be set as the processing unit, or a time frame corresponding to tc−η≤t≤tc may be set as the processing unit in relation to a positive integer constant η.
The various types of processing described above do not have to be executed in time series, as described above, and may be executed in parallel or individually either in accordance with the processing power of the device that executes the processing or in accordance with necessity. Furthermore, the processing may be modified appropriately within a scope that does not depart from the spirit of the present invention.
The devices described above are configured by, for example, having a general-purpose or dedicated computer including a processor (a hardware processor) such as a CPU (central processing unit) and a memory such as a RAM (random-access memory)/ROM (read-only memory) execute a predetermined program. The computer may include one processor and one memory, or pluralities of processors and memories. The program may be either installed in the computer or recorded in the ROM or the like in advance. Further, instead of electronic circuitry, such as a CPU, that realizes a functional configuration by reading a program, some or all of the processing units may be configured using electronic circuitry that realizes processing functions without the use of a program. Electronic circuitry constituting a single device may include a plurality of CPUs.
When the configurations described above are realized by a computer, the processing content of the functions to be included in the devices is described by the program. The computer realizes the processing functions described above by executing the program. The program describing the processing content may be recorded in advance on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of this type of recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and so on.
The program is distributed by, for example, selling, transferring, renting, etc. a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, the program may be stored in a storage device of a server computer and distributed by being transferred from the server computer to another computer over a network.
For example, the computer that executes the program first stores the program recorded on the portable recording medium or transferred from the server computer temporarily in a storage device included therein. During execution of the processing, the computer reads the program stored in the storage device included therein and executes processing corresponding to the read program. As a different form of execution of the program, the computer may read the program directly from the portable recording medium and execute processing corresponding to the program. Alternatively, every time the program is transferred to the computer from the server computer, the computer may execute processing corresponding to the received program. Instead of transferring the program from the server computer to the computer, the processing described above may be executed by a so-called ASP (Application Service Provider) type service, in which processing functions are realized only by issuing commands to execute the processing and acquiring results.
Instead of realizing the processing functions of the present device by executing a predetermined program on a computer, at least some of the processing functions may be realized by hardware.
The present invention can be used in various applications in which it is necessary to suppress noise and reverberation from an acoustic signal. For example, the present invention can be used in speech recognition, call systems, conference call systems, and so on.
Number | Date | Country | Kind |
---|---|---|---|
2018-234075 | Dec 2018 | JP | national |
PCT/JP2019/016587 | Apr 2019 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/029921 | 7/31/2019 | WO | 00 |