SIGNAL PROCESSING APPARATUS, SIGNAL PROCESSING METHOD, AND PROGRAM

TECHNICAL FIELD

The present invention relates to a signal processing technique for an acoustic signal.

BACKGROUND ART

NPL 1 and NPL 2 disclose a method of suppressing noise and reverberation from an observation signal in the frequency domain. In this method, reverberation and noise are suppressed by receiving an observation signal in the frequency domain and a steering vector representing the direction of a sound source or an estimated vector thereof, estimating an instantaneous beamformer for minimizing the power of the frequency-domain observation signal under a constraint condition that sound reaching a microphone from the sound source is not distorted, and applying the instantaneous beamformer to the frequency-domain observation signal (conventional method 1).

PTL 1 and NPL 3 disclose a method of suppressing reverberation from an observation signal in the frequency domain. In this method, reverberation in an observation signal in the frequency domain is suppressed by receiving an observation signal in the frequency domain and the power of a target sound at each time, or an estimated value thereof, estimating a reverberation suppression filter for suppressing reverberation in the target sound on the basis of a weighted power minimization reference of a prediction error, and applying the reverberation suppression filter to the frequency-domain observation signal (conventional method 2).

NPL 4 discloses a method of suppressing noise and reverberation by cascade-connecting conventional method 2 and conventional method 1. In this method, at a prior stage, an observation signal in the frequency domain and the power of a target sound at each time are received and reverberation is suppressed using conventional method 2, and then, at a later stage, a steering vector is received and reverberation and noise are further suppressed using conventional method 1 (conventional method 3).

CITATION LIST
Patent Literature

[PTL 1] Japanese Patent No. 5227393

Non Patent Literature

[NPL 1] T Higuchi, N Ito, T Yoshioka, T Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise”, Proc. ICASSP 2016, 2016.

[NPL 2] J Heymann, L Drude, R Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” Proc. ICASSP 2016, 2016

[NPL 3] T Nakatani, T Yoshioka, K Kinoshita, M Miyoshi, B H Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Trans. ASLP, 18 (7), 1717-1731, 2010

[NPL 4] Takuya Yoshioka, Nobutaka Ito, Marc Delcroix, Atsunori Ogawa, Keisuke Kinoshita, Masakiyo Fujimoto, Chengzhu Yu, Wojciech J Fabian, Miquel Espi, Takuya Higuchi, Shoko Araki, Tomohiro Nakatani, “The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices,” Proc. IEEE ASRU 2015, 436-443, 2015.

SUMMARY OF THE INVENTION
Technical Problem

In the conventional methods, it may be impossible to sufficiently suppress reverberation and noise. Conventional method 1 is a method originally developed for the purpose of suppressing noise and may not always be capable of sufficiently suppressing reverberation. With conventional method 2, noise cannot be suppressed. Conventional method 3 can suppress more noise and reverberation than when conventional method 1 or conventional method 2 is used alone. With conventional method 3, however, conventional method 2 serving as the prior stage and conventional method 1 serving as the later stage are viewed as independent systems and optimization is performed in the respective systems. Therefore, when conventional method 2 is applied at the prior stage, it may not always be possible to sufficiently suppress reverberation due to the effects of noise. Further, when conventional method 1 is applied at the later stage, it may not always be possible to sufficiently suppress noise and reverberation due to the effects of residual reverberation.

The present invention has been designed in consideration of these points, and an object thereof is to provide a technique with which noise and reverberation can be sufficiently suppressed.

Means for Solving the Problem

In the present invention, a convolutional beamformer for calculating, at each time, a weighted sum of a current signal and a past signal sequence having a predetermined delay and a length of 0 or more such that estimation signals increase a probability expressing a speech-likeness of the estimation signals based on a predetermined probability model is acquired where the estimation signals are acquired by applying the convolutional beamformer to frequency-divided observation signals corresponding respectively to a plurality of frequency bands of observation signals acquired by picking up acoustic signals emitted from a sound source, whereupon target signals are acquired by applying the acquired convolutional beamformer to the frequency-divided observation signals.

Effects of the Invention

In the present invention, the convolutional beamformer such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the probability model is acquired, and therefore noise suppression and reverberation suppression can be optimized as a single system, with the result that noise and reverberation can be sufficiently suppressed.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1A is a block diagram illustrating an example of a functional configuration of a signal processing device according to a first embodiment, and FIG. 1B is a flowchart illustrating an example of a signal processing method according to the first embodiment.

FIG. 2A is a block diagram illustrating an example of a functional configuration of a signal processing device according to a second embodiment, and FIG. 2B is a flowchart illustrating an example of a signal processing method according to the second embodiment.

FIG. 3 is a block diagram illustrating an example of a functional configuration of a signal processing device according to a third embodiment.

FIG. 4 is a block diagram illustrating an example of a functional configuration of a parameter estimation unit illustrated in FIG. 3.

FIG. 5 is a flowchart illustrating an example of a parameter estimation method according to the third embodiment.

FIG. 6 is a block diagram illustrating an example of a functional configuration of a signal processing device according to fourth to seventh embodiments.

FIG. 7 is a block diagram illustrating an example of a functional configuration of a parameter estimation unit illustrated in FIG. 6.

FIG. 8 is a block diagram illustrating an example of a functional configuration of a steering vector estimation unit illustrated in FIG. 7.

FIG. 9 is a block diagram illustrating an example of a functional configuration of a signal processing device according to an eighth embodiment.

FIG. 10 is a block diagram illustrating an example of a functional configuration of a signal processing device according to a ninth embodiment.

FIGS. 11A to 11C are block diagrams illustrating examples of use of the signal processing devices according to the embodiments.

FIG. 12 is a table illustrating examples of test results of the first embodiment.

FIG. 13 is a table illustrating examples of test results of the first embodiment.

FIG. 14 is a table illustrating examples of test results of the fourth embodiment.

FIGS. 15A to 15C are tables illustrating examples of test results of the seventh embodiment.

DESCRIPTION OF EMBODIMENTS

Embodiments of the present invention will be described below.

[Definitions of Symbols]

First, symbols used in the embodiments will be defined.

M: M is a positive integer expressing a number of microphones. For example, M≥2.

m: m is a positive integer expressing the microphone number, and satisfies 1≤m≤M. The microphone number is represented by upper right superscript in round parentheses. In other words, a value or a vector based on a signal picked up by a microphone having the microphone number m is represented by a symbol having the upper right superscript “(m)” (for example, x_{f, t}^(m)).

N: N is a positive integer expressing the total number of time frames of signals. For example, N≥2.

t, τ: t and τ are positive integers expressing the time frame number, and t satisfies 1≤t≤N. The time frame number is represented by lower right subscript. In other words, a value or a vector corresponding to a time frame having the time frame number t is represented by a symbol having the lower right subscript “t” (for example, x_{f, t}^(m)). Similarly, a value or a vector corresponding to a time frame having the time frame number t is represented by a symbol having the lower right subscript “τ”.

P: P is a positive integer expressing a total number of frequency bands (discrete frequencies). For example, P≥2.

f: f is a positive integer expressing the frequency band number, and satisfies 1≤f≤P. The frequency band number is represented by lower right subscript. In other words, a value or a vector corresponding to a frequency band having the frequency band number f is represented by a symbol having the lower right subscript “f” (for example, x_{f, t}^(m)).

T: T expresses a non-conjugated transpose of a matrix or a vector. α₀^Trepresents a matrix or a vector acquired by non-conjugated transposition of α₀.

H: H expresses a conjugated transpose of a matrix or a vector. α₀^Hrepresents a matrix or a vector acquired by conjugated transposition of a₀.

|α₀|: |α₀| expresses the absolute value of α₀.

∥α₀∥: ∥a₀∥ expresses the norm of α₀.

|α₀|_γ: |α₀|_γ expresses a weighted absolute value γ|α₀| of α₀.

∥α₀∥_γ: ∥α₀∥_γ expresses a weighted norm γ∥α₀∥ of α₀.

In this specification, a“target signal” denotes a signal corresponding to a direct sound and an initial reflected sound, within a signal (for example, a frequency-divided observation signal) corresponding to a sound emitted from a target sound source and picked up by a microphone. The initial reflected sound denotes a reverberation component derived from the sound emitted from the target sound source that reaches the microphone at a delay of no more than several tens of milliseconds following the direct sound. The initial reflected sound typically acts to improve the clarity of the sound, and in this embodiment, a signal corresponding to the initial reflected sound is also included in the target signal. Here, the signal corresponding to the sound picked up by the microphone also includes, in addition to the target signal described above, late reverberation (a component acquired by excluding the initial reflected sound from the reverberation) derived from the sound emitted from the target sound source, and noise derived from a source other than the target sound source. Ina signal processing method, the target signal is estimated by suppressing late reverberation and noise from a frequency-divided observation signal corresponding to a sound recorded by the microphone, for example. In this specification, unless specified otherwise, “reverberation” is assumed to refer to “late reverberation”.

[Principles]

Next, principles will be described.

Method 1 serving as a prerequisite of the method according to the embodiments will now be described. In method 1, noise and reverberation are suppressed from an M-dimensional observation signal (frequency-divided observation signals) in the frequency domain

x
_f,t=[x_f,t⁽¹⁾,x_f,t⁽²⁾, . . . ,x_f,t^(M)]^T (1)

The frequency-divided observation signals x_{f, t}are acquired by transforming M observation signals, which are acquired by picking up acoustic signals emitted from one or a plurality of sound sources in M microphones, to the frequency domain. The observation signals are acquired by picking up acoustic signals emitted from the sound sources in an environment where noise and reverberation exist. x_{f, t}^(m)is acquired by transforming an observation signal that is acquired by being picked up by the microphone having the microphone number m to the frequency domain. x_{f, t}^(m)corresponds to the frequency band having the frequency band number f and the time frame having the time frame number t. In other words, the frequency-divided observation signals x_{f, t}are time series signals.

In method 1, an instantaneous beamformer w_{f, 0}for minimizing a cost function C₁(w_{f, 0}) below is determined for each frequency band under the constraint condition in which “the target signals are not distorted as a result of applying an instantaneous beamformer (for example, a minimum power distortionless response beamformer) w_{f, 0}for calculating the weighted sum of the signals at the current time to the frequency-divided observation signals x_{f, t}at each time”.

$\begin{matrix} C_{1} (w_{f, 0}) = \sum_{t = 1}^{N} {\langle w_{f, 0} {}^{H}x_{f, t} \rangle}^{2} & (2) \\ w_{f, 0} = {[w_{f, 0}^{(1)}, w_{f, 0}^{(2)}, \dots, w_{f, 0}^{(M)}]}^{T} & (3) \end{matrix}$

Note that the lower right subscript “0” of w_{f, 0}does not represent the time frame number, w_{f, 0}being independent of the time frame. The constraint condition is a condition in which, for example, w_{f, 0}^Hν_{f, 0}is a constant (1, for example). Here,

ν_f,0=[ν_f,0⁽¹⁾,ν_f,0⁽²⁾, . . . ,ν_f,0^(M)]^T (4)

is a steering vector having, as an element, a transfer function ν_{f, 0}^(m)relating to the direct sound and the initial reflected sound from the sound source to each microphone (the sound pickup position of the acoustic signal), or an estimated vector (an estimated steering vector) thereof. In other words, ν_{f, 0}is expressed by an M-dimensional (the dimension of the number of microphones) vector having, as an element, the transfer function ν_{f, 0}^(m), which corresponds to the direct sound and initial reflected sound parts of an impulse response from the sound source position to each microphone (i.e. the reverberation that arrives at a delay of no more than several tens of milliseconds (for example, within 30 milliseconds) following the direct sound). When it is difficult to estimate the gain of the steering vector, a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having one of the microphone numbers m₀∈{1, . . . , M} becomes a constant g (g≠0) may be used as ν_{f, 0}. In other words, as illustrated below, a normalized vector may be used as ν_{f, 0}.

$\begin{matrix} C_{2} (F_{f}) = \sum_{t = 1}^{N} { x_{f, t} - \sum_{τ = d}^{d + L - 1} F_{f, τ} {}^{H}x_{f, t - τ} }_{σ_{f, t}^{- 2}} & (7) \end{matrix}$

By applying the instantaneous beamformer w_{f, 0}acquired as described above to the frequency-divided observation signal x_{f, t}of each frequency band in the manner illustrated below, a target signal y_{f, t}in which noise and reverberation have been suppressed from the frequency-divided observation signal x_{f, t}is acquired.

y
_f,t
=w
_f,0
^H
x
_f,t (6)

Method 2 serving as a prerequisite of the method according to the embodiments will now be described. In method 2, reverberation is suppressed from the frequency-divided observation signal x_{f, t}. In method 2, a reverberation suppression filter F_{f, τ} for minimizing a cost function C₂(F_f) below is determined for τ=d, d+1, . . . , d+L−1 in each frequency band.

$σ_{f, t}^{- 2} = \frac{1}{σ_{f, t}^{2}}$

Here, the reverberation suppression filter F_{f, τ}is an M×M-dimensional matrix filter for suppressing reverberation from the frequency-divided observation signal x_{f, t}. d is a positive integer expressing a prediction delay. L is a positive integer expressing the filter length. σ_{f, t}²is the power of the target signal, which is expressed as follows.

$\begin{matrix} v_{f, 0} \leftarrow g \frac{v_{f, 0}}{v_{f, 0}^{(m_{0})}} & (5) \end{matrix}$

∥x∥_γ relating to the frequency-divided observation signal x is the weighted norm ∥x∥_γ=γ(x^Hx) of the frequency-divided observation signal x.

By applying the reverberation suppression filter F_{f, t}acquired as described above to the frequency-divided observation signal x_{f, t}of each frequency band in the manner illustrated below, a target signal z_{f, t}in which reverberation has been suppressed from the frequency-divided observation signal x_{f, t}is acquired.

$\begin{matrix} z_{f, t} = x_{f, t} - \sum_{τ = d}^{d + L - 1} F_{f, τ} {}^{H}x_{f, t - τ} & (8) \end{matrix}$

Here, the target signal z_{f, t}is an M-dimensional column vector, as shown below.

z
_f,t=[z_f,t⁽¹⁾,z_f,t⁽²⁾, . . . ,z_f,t^(M)]^T′

The method of the embodiments will now be described. A target signal y_{f, t}acquired by suppressing noise and reverberation from the frequency-divided observation signal x_{f, t}by using a method integrating methods 1 and 2 can be modeled as follows.

$\begin{matrix} \begin{matrix} y_{f, t} = w_{f, 0}^{H} (x_{f, t} - \sum_{τ = d}^{d + L - 1} F_{f, τ} {}^{H}x_{f, t - τ}) \\ = w_{f, 0} {}^{H}x_{f, t} + \sum_{τ = d}^{d + L - 1} w_{f, τ} {}^{H}x_{f, t - τ} \\ = {\overline{w}}_{f} {}^{H}{\overline{x}}_{f, t} \end{matrix} & (9) \end{matrix}$

Here, with respect to τ≠0, w_{f, τ}=F_{f, τ}w_{f, 0}, and w_{f, τ}corresponds to a filter for performing noise suppression and reverberation suppression simultaneously. w⁻_fis a convolutional beamformer that calculates a weighted sum of a current signal and a past signal sequence having a predetermined delay at each time. Note that the “−” of “w⁻_f” should be written directly above the “w”, as shown below, but due to notation limitations may also be written to the upper right of “w”.

w

_f

The convolutional beamformer w⁻_fcalculates the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time point. The convolutional beamformer w⁻_fis expressed as shown below, for example,

w

_f=[w_f⁽¹⁾^T,w_f⁽²⁾^T, . . . w_f^(M)^T]^T (10)

where the following is satisfied.

w

_f
^(m)=[w_f,0^(m),w_f,d^(m),w_f,d+1^(m). . . ,w_f,d+L−1^(m)]^T (10A)

Further, x⁻_{f, t}is expressed as follows.

x

_f,t=[x_f,t⁽¹⁾^T,x_f,t⁽²⁾^T, . . . ,x_f,t^(M)^T]^T (11)

x

_f,t
^(m)=[x_f,t^(m),x_f,t−d^(m),x_f,t−d−1^(m), . . . ,x_{f,t−d−L+1}^(m)]^T (11A)

Note that throughout this specification, cases in which L=0 in equations (9) to (11A) are also assumed to be included in the convolutional beamformer of the present invention. In other words, even cases in which the length of the past signal sequence used by the convolutional beamformer to calculate the weighted sum is 0 are treated as examples of realization of the convolutional beamformer. At this time, the term E in equation (9) becomes 0, and therefore equation (9) becomes equation (9A), shown below. Further, the respective right sides of equations (10A) and (11A) become vectors constituted respectively by only one first element (i.e., scalars), and therefore become equations (10AA) and (11AA), respectively.

$\begin{matrix} \begin{matrix} y_{f, t} = w_{f, 0} {}^{H}x_{f, t} \\ = {\overline{w}}_{f,} {}^{H}{\overline{x}}_{f, t} \end{matrix} & (9 A) \\ {\overline{w}}_{f}^{(m)} = w_{f, 0}^{(m)} & (10 AA) \\ {\overline{x}}_{f, t}^{(m)} = x_{f, t}^{(m)} & (11 AA) \end{matrix}$

Note that the convolutional beamformer w⁻_fof equation (9A) is a beamformer that calculates, at each time point, the weighted sum of the current signal and a signal sequence having a predetermined delay and a length of 0, and therefore the convolutional beamformer calculates the weighted value of the current signal at each time point. Further, as will be described below, even when L=0, the signal processing device of the present invention can acquire the target signal by determining a convolutional beamformer on the basis of a probability expressing a speech-likeness and applying the convolutional beamformer to the frequency-divided observation signals.

Here, assuming that y_{f, t}in equation (9) preferably conforms to a speech probability density function p ({y_{f, t}}_t=1:N; w⁻_f) (a probability model), the signal processing device determines the convolutional beamformer w⁻_fsuch that it increases the probability p ({y_{f, t}}_t=1:N; w⁻_f) (in other words, a probability expressing the speech-likeness of y_{f, t}) of y_{f, t}based on the speech probability density function. Preferably, the convolutional beamformer w⁻_fwhich maximizes the probability expressing the speech-likeness of y_{f, t}is determined. For example, the signal processing device determines the convolutional beamformer w⁻_fsuch that it increases log p ({y_{f, t}}_t=1:N; w⁻_f), and preferably determines the convolutional beamformer w⁻_fwhich maximizes log p ({y_{f, t}}_t=1:N; w⁻_f).

A complex normal distribution having an average of 0 and a variance matching the power σ_{f, t}²of the target signal can be cited as an example of a speech probability density function. The “target signal” is a signal corresponding to the direct sound and the initial reflected sound, within a signal corresponding to a sound emitted from a target sound source and picked up by a microphone. Further, the signal processing device determines the convolutional beamformer w⁻_funder the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer w⁻_fto the frequency-divided observation signals x_{f, t}”, for example. This constraint condition is a condition in which, for example, w_{f, 0}^Hν_{f, 0}is a constant (1, for example). On the basis of this constraint condition, for example, the signal processing device determines w⁻_fwhich maximizes log p ({y_{f, t}}_t=1:N; w⁻_f), which is determined as shown below, for each frequency band.

$\begin{matrix} \log p ({y_{f, t}}_{t = 1 : N}; {\overline{w}}_{f}) = - \sum_{t = 1}^{N} \frac{\langle {\overline{w}}_{f} {}^{H}{\overline{x}}_{f, t} \rangle}{σ_{f, t}^{2}} + const . & (12) \end{matrix}$

Here, “const.” expresses a constant.

The following function, which is acquired by subtracting the constant term (const.) from log p ({y_{f, t}}_t=1:N; w⁻_f) in equation (12) and reversing the plus/minus sign, is set as a cost function C₃(w⁻_f).

$\begin{matrix} C_{3} ({\overline{w}}_{f}) = \sum_{t = 1}^{N} \frac{\langle {\overline{w}}_{f} {}^{H}{\overline{x}}_{f, t} \rangle}{σ_{f, t}^{2}} = {\overline{w}}_{f} {}^{H}R_{f} {\overline{w}}_{f} & (13) \end{matrix}$

Here, R is a weighted space-time covariance matrix determined as shown below.

$\begin{matrix} R_{f} = \sum_{t = 1}^{N} \frac{{\overline{x}}_{f, t} {\overline{x}}_{f, t}^{H}}{σ_{f, t}^{2}} & (14) \end{matrix}$

The signal processing device may determine w⁻_fwhich minimizes the cost function C₃(w⁻_f) of equation (13) under the constraint condition described above (in which, for example, w_{f, 0}^Hν_{f, 0}is a constant), for example.

The analytical solution of w⁻_ffor minimizing the cost function C₃(w⁻_f) under the constraint condition described above (in which, for example, w_{f, 0}^Hν_{f, 0}=1) is as shown below.

$\begin{matrix} {\overline{w}}_{f} = \frac{R_{f}^{- 1} {\overline{v}}_{f}}{{\overline{v}}_{f}^{H} R_{f}^{- 1} {\overline{v}}_{f}} & (15) \end{matrix}$

Here, λ⁻_fis a vector acquired by disposing the element ν_{f, 0}^(m)of the steering vector ν_{f, 0}as follows.

ν
_f=[ν_f⁽¹⁾^T,ν_f⁽²⁾^T, . . . ,ν_f⁽³⁾^T]^T′

ν
_f
^(m)=[ν_f,0^(m),0, . . . ,0]^T

Here, ν^˜_f^(m)is an L+1-dimensional column vector having ν_{f, 0}^(m), and L zeros as elements.

The signal processing device acquires the target signal y_{f, t}by applying the determined convolutional beamformer w⁻_fto the frequency-divided observation signal x_{f, t}as follows.

y
_f,t
=w
_f
^H

x

_f,t (16)

First Embodiment

Next, a first embodiment will be described.

As illustrated in FIG. 1A, a signal processing device 1 according to this embodiment includes an estimation unit 11 and a suppression unit 12.

As illustrated in FIG. 1B, the frequency-divided observation signal x_{f, t}is input into the estimation unit 11 (equation (1)).

The estimation unit 11 acquires and outputs the convolutional beamformer w⁻_ffor calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increase the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model where the estimation signals are acquired by applying the convolutional beamformer w⁻_fto the frequency-divided observation signals x_{f, t}in respective frequency bands. For example, the estimation unit 11 determines the convolutional beamformer w⁻_fsuch that it increases the probability expressing speech-likeness of y_{f, t}based on the probability density function p ({y_{f, t}}_t=1:N; w⁻_f) (such that log p ({y_{f, t}}_t=1:N; w⁻_f) is increased, for example). The estimation unit 11 preferably determines the convolutional beamformer w⁻_fwhich maximizes the probability (maximizes log p ({y_{f, t}}_t=1:N; w⁻_f), for example).

The frequency-divided observation signal x_{f, t}and the convolutional beamformer w⁻_facquired in step S11 are input into the suppression unit 12. The suppression unit 12 acquires and outputs the target signal y_{f, t}(the estimation signal) by applying the convolutional beamformer w⁻_fto the frequency-divided observation signal x_{f, t}in each frequency band. For example, the suppression unit 12 acquires and outputs the target signal y_{f, t}by applying w⁻_fto x⁻_{f, t}as shown in equation (16).

In this embodiment, the convolutional beamformer w⁻_ffor calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model is determined where the estimation signals are acquired by applying the convolutional beamformer w⁻_fto the frequency-divided observation signals x_{f, t}. This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.

Second Embodiment

Next, a second embodiment will be described. Hereafter, processing units and steps described heretofore will be cited using identical reference numerals, and description thereof will be simplified.

As illustrated in FIG. 2A, a signal processing device 2 according to this embodiment includes an estimation unit 21 and the suppression unit 12. The estimation unit 21 includes a matrix estimation unit 211 and a convolutional beamformer estimation unit 212.

The estimation unit 21 of this embodiment acquires and outputs the convolutional beamformer w⁻_fwhich minimizes a sum of values (the cost function C₃(w⁻_f) of equation (13), for example) acquired by weighting the power of the estimation signals at each time belonging to a predetermined time interval by the reciprocal of the power σ_{f, t}²of the target signals or the reciprocal of the estimated power σ_{f, t}²of the target signals under the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer w⁻_fto the frequency-divided observation signals x_{f, t}”. As illustrated in equation (9), the convolutional beamformer w⁻_fis equivalent to a beamformer acquired by integrating a reverberation suppression filter F_{f, t}for suppressing reverberation from the frequency-divided observation signal x_{f, t}and the instantaneous beamformer w_{f, 0}for suppressing noise from a signal acquired by applying the reverberation suppression filter F_{f, t}to the frequency-divided observation signal x_{f, t}. Further, the constraint condition is a condition in which, for example, “a value acquired by applying an instantaneous beamformer to a steering vector having, as an element, transfer functions relating to the direct sound and the initial reflected sound from the sound source to the to the pickup position of the acoustic signals, or an estimated steering vector, which is an estimated vector of the steering vector, is a constant (w_{f, 0}^Hν_{f, 0}is a constant)”. The processing will be described in detail below.

As illustrated in FIG. 2B, the frequency-divided observation signals x_{f, t}and the power or estimated power σ_{f, t}²of the target signals are input into the matrix estimation unit 211. The matrix estimation unit 211 acquires and outputs a weighted space-time covariance matrix R_ffor each frequency band on the basis of the frequency-divided observation signals x_{f, t}and the power or estimated power σ_{f, t}²of the target signal. For example, the matrix estimation unit 211 acquires and outputs the weighted space-time covariance matrix R_fin accordance with equation (14).

The steering vector or estimated steering vector ν_{f, 0}(equation (4) or (5)) and the weighted space-time covariance matrix R_facquired in step S211 are input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer w⁻_fon the basis of the weighted space-time covariance matrix R_fand the steering vector or estimated steering vector ν_{f, 0}. For example, the convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer w⁻_fin accordance with equation (15).

This step is identical to the first embodiment, and therefore description thereof has been omitted.

In this embodiment, the weighted space-time covariance matrix R_fis acquired, and on the basis of the weighted space-time covariance matrix R_fand the steering vector or estimated steering vector ν_{f, 0}, the convolutional beamformer w⁻_fis acquired. This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.

Third Embodiment

Next, a third embodiment will be described. In this embodiment, an example of a method of generating σ_{f, t}²and ν_{f, 0}will be described.

As illustrated in FIG. 3, a signal processing device 3 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 33. The estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212. Further, as illustrated in FIG. 4, the parameter estimation unit 33 includes an initial setting unit 330, a power estimation unit 331, a reverberation suppression filter estimation unit 332, a reverberation suppression filter application unit 333, a steering vector estimation unit 334, an instantaneous beamformer estimation unit 335, an instantaneous beamformer application unit 336, and a control unit 337.

Hereafter, only the processing executed by the parameter estimation unit 33, which differs from the second embodiment, will be described. The processing performed by the other processing units is as described in the first and second embodiments.

The frequency-divided observation signal x_{f, t}is input into the initial setting unit 330. Using the frequency-divided observation signal x_{f, t}, the initial setting unit 330 generates and outputs a provisional power σ_{f, t}², which is a provisional value of the estimated power σ_{f, t}²of the target signal. For example, the initial setting unit 330 generates and outputs the provisional power σ_{f, t}as follows.

$\begin{matrix} σ_{f, t}^{2} = \frac{x_{f, t} {}^{H}x_{f, t}}{M} & (17) \\ Note that when M = 1, σ_{f, t}^{2} = {\langle x_{f, t} \rangle}^{2} = x_{f, t} {}^{H}x_{f, t} . \end{matrix}$

The frequency-divided observation signals x_{f, t}and the newest provisional powers σ_{f, t}²are input into the reverberation suppression filter estimation unit 332. The reverberation suppression filter estimation unit 332 determines and outputs a reverberation suppression filter F_{f, t}for minimizing the cost function C₂(F_f) of equation (7) with respect to t=d, d+1, . . . , d+L−1 in each frequency band.

The frequency-divided observation signal x_{f, t}and the newest reverberation suppression filter F_{f, t}acquired in step S332 are input into the reverberation suppression filter application unit 333. The reverberation suppression filter application unit 333 acquires and outputs an estimation signal y′_{f, t}by applying the reverberation suppression filter F_{f, t}to the frequency-divided observation signal x_{f, t}in each frequency band. For example, the reverberation suppression filter application unit 333 sets z_{f, t}, acquired in accordance with equation (8), as y′_{f, t}and outputs y′_{f, t}.

The newest estimation signal y′_{f, t}acquired in step S333 is input into the steering vector estimation unit 334. Using the estimation signal y′_{f, t}, the steering vector estimation unit 334 acquires and outputs a provisional steering vector ν_{f, 0}, which is a provisional vector of the estimated steering vector, in each frequency band. For example, the steering vector estimation unit 334 acquires and outputs the provisional steering vector ν_{f, 0}for the estimation signal y′_{f, t}in accordance with a steering vector estimation method described in NPL 1 and NPL 2. For example, as the provisional steering vector ν_{f, 0}, the steering vector estimation unit 334 outputs a steering vector estimated using y′_{f, t}as y_{f, t}according to NPL 2. Further, as noted above, a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having any one of the microphone numbers m₀∈(1, . . . , M) becomes a constant g may be used as ν_{f, 0}(equation (5)).

The newest estimation signal y′_{f, t}acquired in step S333 and the newest provisional steering vector ν_{f, 0}acquired in step S334 are input into the instantaneous beamformer estimation unit 335. The instantaneous beamformer estimation unit 335 acquires and outputs an instantaneous beamformer w_{f, 0}for minimizing C₁(w_{f, 0}) shown below in equation (18), which is acquired by setting x_{f, t}=y′_{f, t}in equation (2), in each frequency band on the basis of the constraint condition that “w_{f, 0}^Hν_{f, 0}is a constant”.

$\begin{matrix} C_{1} (w_{f, 0}) = \sum_{t = 1}^{N} {\langle w_{f, 0} {}^{H}y_{f, t}^{'} \rangle}^{2} & (18) \end{matrix}$

The newest estimation signal y′_{f, t}acquired in step S333 and the newest instantaneous beamformer w_{f, 0}acquired in step S335 are input into the instantaneous beamformer application unit 336. The instantaneous beamformer application unit 336 acquires and outputs an estimation signal y″_{f, t}by applying the instantaneous beamformer w_{f, 0}to the estimation signal y′_{f, t}in each frequency band. For example, the instantaneous beamformer application unit 336 acquires and outputs the estimation signal y″_{f, t}as follows.

y″
_f,t
=w
_f,0
^H
y′
_f,t (19)

The newest estimation signal y″_{f, t}acquired in step S336 is input into the power estimation unit 331. The power estimation unit 331 outputs the power of the estimation signal y″_{f, t}as the provisional power σ_{f, t}²in each frequency band. For example, the power estimation unit 331 generates and outputs the provisional power σ_{f, t}²as follows.

σ_f,t²=|y″_f,t|²=y″_f,t^Hy″_f,t (20)

The control unit 337 determines whether or not a termination condition is satisfied. There are no limitations on the termination condition, but for example, the termination condition may be satisfied when the number of repetitions of the processing of steps S331 to S336 exceeds a predetermined value, when the variation in σ_{f, t}²or ν_{f, 0}falls to or below a predetermined value after the processing of steps S331 to S336 is performed once, and so on. When the termination condition is not satisfied, the processing returns to step S332. When the termination condition is satisfied, on the other hand, the processing advances to step S337b.

In step S337b, the power estimation unit 331 outputs σ_{f, t}²acquired most recently in step S331 as the estimated power of the target signal, and the steering vector estimation unit 334 outputs ν_{f, 0}acquired most recently in step S334 as the estimated steering vector. As illustrated in FIG. 3, the estimated power σ_{f, t}²is input into the matrix estimation unit 211, and the estimated steering vector ν_{f, 0}is input into the convolutional beamformer estimation unit 212.

Fourth Embodiment

As described above, the steering vector is estimated on the basis of the frequency-divided observation signal x_{f, t}. Here, when the steering vector is estimated after suppressing (preferably, removing) reverberation from the frequency-divided observation signal x_{f, t}, the estimation precision improves. In other words, by acquiring a frequency-divided reverberation-suppressed signal in which the reverberation component of the frequency-divided observation signal x_{f, t}has been suppressed, and acquiring the estimated steering vector from the frequency-divided reverberation-suppressed signal, the precision of the estimated steering vector can be improved.

As illustrated in FIG. 6, a signal processing device 4 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 43. The estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212. As illustrated in FIG. 7, the parameter estimation unit 43 includes a reverberation suppression unit 431 and a steering vector estimation unit 432.

The fourth embodiment differs from the first to third embodiments in that before generating the estimated steering vector, the reverberation component of the frequency-divided observation signal x_{f, t}is suppressed. Hereafter, only a method for generating the estimated steering vector will be described.

The frequency-divided observation signal x_{f, t}is input into the reverberation suppression unit 431 of the parameter estimation unit 43 (FIG. 7). The reverberation suppression unit 431 acquires and outputs a frequency-divided reverberation-suppressed signal u_{f, t}in which the reverberation component of the frequency-divided observation signal x_{f, t}has been suppressed (preferably, in which the reverberation component of the frequency-divided observation signal x_{f, t}has been removed). There are no limitations on the method for suppressing (removing) the reverberation component from the frequency-divided observation signal x_{f, t}, and a well-known reverberation suppression (removal) method may be used. For example, the reverberation suppression unit 431 acquires and outputs the frequency-divided reverberation-suppressed signal u_{f, t}in which the reverberation component of the frequency-divided observation signal x_{f, t}has been suppressed using a method described in reference document 1.

Reference document 1: Takuya Yoshioka and Tomohiro Nakatani, “Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening,” IEEE Transactions on Audio, Speech, and Language Processing (Volume: 20, Issue: 10, December 2012)

The frequency-divided reverberation-suppressed signal u_{f, t}acquired by the reverberation suppression unit 431 is input into the steering vector estimation unit 432. Using the frequency-divided reverberation-suppressed signal u_{f, t}as input, the steering vector estimation unit 432 generates and outputs an estimated steering vector serving as an estimated vector of the steering vector. A steering vector estimation processing method of acquiring an estimated steering vector using a frequency-divided time series signal as input is well-known. The steering vector estimation unit 432 acquires and outputs the estimated steering vector ν_{f, 0}by using the frequency-divided reverberation-suppressed signal u_{f, t}as the input of a desired type of steering vector estimation processing. There are no limitations on the steering vector estimation processing method, and for example, the method described above in NPL 1 and NPL 2, methods described in reference documents 2 and 3, and so on may be used.

Reference document 2: N. Ito, S. Araki, M. Delcroix, and T. Nakatani, “Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noise and reverberant environments,” Proc IEEE ICASSP, pp. 681-685, 2017.
Reference document 3: S. Markovich-Golan and S. Gannot, “Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method,” Proc IEEE ICASSP, pp. 544-548, 2015.

The estimated steering vector ν_{f, 0}acquired by the steering vector estimation unit 432 is input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 performs the processing of step S212, described in the second embodiment, using the estimated steering vector ν_{f, 0}and the weighted space-time covariance matrix R_facquired in step S211. All other processing is as described in the first and second embodiments.

Fifth Embodiment

In a fifth embodiment, a method of executing steering vector estimation by successive processing will be described. In so doing, the estimated steering vector of each time frame number t can be calculated from frequency-divided observation signals x_{f, t}input successively online, for example.

As illustrated in FIG. 6, a signal processing device 5 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 53. The estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212. As illustrated in FIG. 7, the parameter estimation unit 53 includes a steering vector estimation unit 532. As illustrated in FIG. 8, the steering vector estimation unit 532 includes an observation signal covariance matrix updating unit 532a, a main component vector updating unit 532b, a steering vector updating unit 532c (the steering vector estimation unit), an inverse noise covariance matrix updating unit 532d, and a noise covariance matrix updating unit 532e. The fifth embodiment differs from the first to third embodiments only in that the estimated steering vector is generated by successive processing. Hereafter, only a method of generating the estimated steering vector will be described. The following processing is executed on each time frame number t in ascending order from t=1.

The frequency-divided observation signal x_{f, t}, which is a frequency-divided time series signal, is input into the steering vector estimation unit 532 (FIGS. 7 and 8).

<<Processing of Observation Signal Covariance Matrix Updating Unit 532a (Step S532a)>>

Using the frequency-divided observation signal x_{f, t}as input, the observation signal covariance matrix updating unit 532a (FIG. 8) acquires and outputs a spatial covariance matrix ψ_{x, f, t}of the frequency-divided observation signal x_{f, t}(a spatial covariance matrix of a frequency-divided observation signal belonging to a first time interval), which is based on the frequency-divided observation signal x_{f, t}(the frequency-divided observation signal belonging to the first time interval) and a spatial covariance matrix ψ_{x, f, t−1}of a frequency-divided observation signal x_{f, t−1}(a spatial covariance matrix of a frequency-divided observation signal belonging to a second time interval that is further in the past than the first time interval). For example, the observation signal covariance matrix updating unit 532a acquires and outputs a linear sum of a covariance matrix x_{f, t}x_{f, t}^Hof the frequency-divided observation signal x_{f, t}(the frequency-divided observation signal belonging to the first time interval) and the spatial covariance matrix ψ_{x, f, t−1}(the spatial covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval) as the spatial covariance matrix ψ_{x, f, t}of the frequency-divided observation signal x_{f, t}(the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval). The observation signal covariance matrix updating unit 532a acquires and outputs the spatial covariance matrix ψ_{x, f, t}in accordance with equation (21) shown below, for example.

ψ_x,f,t=βψ_x,f,t−1+x_f,tx_f,t^H (21)

Here, β is an oblivion coefficient, and is a real number belonging to a range of 0<β<1, for example. An initial matrix ψ_{x, f, 0}of the spatial covariance matrix ψ_{x, f, t−1}may be set as desired. For example, an M×M-dimensional unit matrix may be set as the initial matrix ψ_{x, f, 0}of the spatial covariance matrix γ_{x, f, t−1}.

The frequency-divided observation signal x_{f, t}and mask information γ_{f, t}⁽ⁿ⁾are input into the inverse noise covariance matrix updating unit 532d. The mask information γ_{f, t}⁽ⁿ⁾is information expressing the ratio of the noise component included in the frequency-divided observation signal x_{f, t}at a time-frequency point corresponding to the time frame number t and the frequency band number f. In other words, the mask information γ_{f, t}⁽ⁿ⁾expresses the occupancy probability of the noise component included in the frequency-divided observation signal x_{f, t}at a time-frequency point corresponding to the time frame number t and the frequency band number f. There are no limitations on the method of estimating the mask information γ_{f, t}⁽ⁿ⁾. Methods of estimating the mask information γ_{f, t}⁽ⁿ⁾are well-known, and include, for example, an estimation method using a complex Gaussian mixture model (CGMM) (reference document 4, for example), an estimation method using a neural network (reference document 5, for example), an estimation method integrating these methods (reference document 6 and reference document 7, for example), and so on.

Reference document 4: T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise,” Proc IEEE ICASSP-2016, pp. 5210-5214, 2016.
Reference document 5: J. Heymann, L. Drude, and R Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” Proc IEEE ICASSP-2016, pp. 196-200, 2016.
Reference document 6: T. Nakatani, N. Ito, T. Higuchi, S. Araki, and K. Kinoshita, “Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming,” Proc IEEE ICASSP-2017, pp. 286-290, 2017.
Reference document 7: Y. Matsui, T. Nakatani, M. Delcroix, K. Kinoshita, S. Araki, and S. Makino, “Online integration of DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming,” Proc. IWA ENC, pp. 71-75, 2018.

The mask information γ_{f, t}⁽ⁿ⁾may be estimated in advance and stored in a storage device, not illustrated in the figures, or may be estimated successively. Note that the upper right superscript “(n)” of “γ_{f, t}⁽ⁿ⁾” should be written directly above the lower right subscript “f, t”, but due to notation limitations has been written to the upper right of “f, t”.

The inverse noise covariance matrix updating unit 532d acquires and outputs an inverse noise covariance matrix ψ⁻¹_{n, f, t}(an inverse noise covariance matrix of the frequency-divided observation signal belonging to the first time interval) on the basis of the frequency-divided observation signal x_{f, t}(the frequency-divided observation signal belonging to the first time interval), the mask information γ_{f, t}⁽ⁿ⁾(mask information belonging to the first time interval), and an inverse noise covariance matrix ψ⁻¹_{n, f, t−1}(an inverse noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval). For example, the inverse noise covariance matrix updating unit 532d acquires and outputs the inverse noise covariance matrix ψ⁻¹_{n, f, t}in accordance with equation (22), shown below, using the Woodbury formula.

$\begin{matrix} Ψ_{n, f, t}^{- 1} = \frac{1}{a} (Ψ_{n, f, t - 1}^{- 1} = \frac{Y_{f, t}^{(n)} Ψ_{n, f, t - 1}^{- 1} x_{f, t} x_{f, t}^{H} Ψ_{n, f, t - 1}^{- 1}}{a + Y_{f, t}^{(n)} x_{f, t}^{H} Ψ_{n, f, t - 1}^{- 1} x_{f, t}}) & (22) \end{matrix}$

Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. An initial matrix ψ⁻¹_{n, f, 0}of the inverse noise covariance matrix ψ⁻¹_{n, f, t−1}may be set as desired. For example, an M×M-dimensional unit matrix may be set as the initial matrix ψ⁻¹_{n, f, 0}of the inverse noise covariance matrix ψ⁻¹_{n, f, t−1}. Note that the upper right superscript “−1” of “ψ⁻¹_{n, f, t}” should be written directly above the lower right subscript “n, f, t”, but due to notation limitations has been written to the upper left of “n, f, t”.

The spatial covariance matrix ψ_{x, f, t}acquired by the observation signal covariance matrix updating unit 532a and the inverse noise covariance matrix ψ⁻¹_{n, f, t}acquired by the inverse noise covariance matrix updating unit 532d are input into the main component vector updating unit 532b. The main component vector updating unit 532b acquires and outputs a main component vector ν^˜_{f, t}(a main component vector of the first time interval) relating to ψ⁻¹_{n, f, t}ψ_{x, f, t}(the product of an inverse matrix of the noise covariance matrix of the frequency-divided observation signal and the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval) by using a power method on the basis of the inverse noise covariance matrix ψ⁻¹_{n, f, t}(the inverse matrix of the noise covariance matrix of the frequency-divided observation signal), the spatial covariance matrix ψ_{x, f, t}(the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval), and a main component vector v^˜_{f, t−1}(a main component vector of the second time interval). For example, the main component vector updating unit 532b acquires and outputs a main component vector v^˜_{f, t}based on ψ⁻¹_{n, f, t}ψ_{x, f, t}v^˜_{f, t−1}. The main component vector updating unit 532b acquires and outputs the main component vector v^˜_{f, t}in accordance with equations (23) and (24) shown below, for example. Note that the upper right superscript “˜” of “v^˜_{f, t}” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.

$\begin{matrix} {\tilde{v}}_{f, t}^{'} = Ψ_{n, f, t}^{- 1} Ψ_{x, f, t} {\tilde{v}}_{f, t - 1} & (23) \\ {\tilde{v}}_{f, t} = \frac{{\tilde{v}}_{f, t}^{'}}{{\tilde{v}}_{f, t}^{ref}} & (24) \end{matrix}$

Here, v^˜_{f, t}^refexpresses an element corresponding to a predetermined microphone (a reference microphone ref) serving as a reference, among the M elements of a vector v^˜_{f, t}acquired from equation (23). In other words, in the example of equations (23) and (24), the main component vector updating unit 532b sets a vector acquired by normalizing the respective elements of v^˜′_{f, t}=ψ⁻¹_{n, f, f, t}ψ_{x, f, t}v^˜_{f, t−1}by v^˜_{f, t}^refas the main component vector v^˜_{f, t}. Note that the upper right superscript “˜” of “v^˜′_{f, t}” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.

The noise covariance matrix updating unit 532e, using the frequency-divided observation signal x_{f, t}(the frequency-divided observation signal belonging to the first time interval) and the mask information γ_{f, t}⁽ⁿ⁾; (the mask information of the first time interval) as input, acquires and outputs a noise covariance matrix γ_{n, f, t}of the frequency-divided observation signal x_{f, t}(a noise covariance matrix of the frequency-divided observation signal belonging to the first time interval), which is based on the frequency-divided observation signal x_{f, t}, the mask information γ_{f, t}⁽ⁿ⁾, and a noise covariance matrix ψ_{n, f, t−1}(a noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval). For example, the noise covariance matrix updating unit 532e acquires and outputs the linear sum of a product γ_{f, t}⁽ⁿ⁾x_{f, t}x_{f, t}^Hof the covariance matrix x_{f, t}x_{f, t}^Hof the frequency-divided observation signal x_{f, t}and the mask information γ_{f, t}⁽ⁿ⁾, and the noise covariance matrix ψ_{n, f, t−1}(the noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval) as the noise covariance matrix ψ_{n, f, t}of the frequency-divided observation signal x_{f, t}. For example, the noise covariance matrix updating unit 532e acquires and outputs the noise covariance matrix ψ_{n, f, t}in accordance with equation (25) shown below.

ψ_n,f,t=αψ_n,f,t−1+γ_f,t⁽ⁿ⁾x_f,tx_f,t^H (25)

Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example.

The steering vector updating unit 532c, using the main component vector v^˜_{f, t}(the main component vector of the first time interval) acquired by the main component vector updating unit 532b and the noise covariance matrix ψ_{n, f, t}(the noise covariance matrix of the frequency-divided observation signal) acquired by the noise covariance matrix updating unit 532e as input, acquires and outputs an estimated steering vector ν_{f, t}(an estimated steering vector of the first time interval) on the basis thereof. For example, the steering vector updating unit 532c acquires and outputs an estimated steering vector ν_{f, t}based on ψ_{n, f, t}v^˜_{f, t}. The steering vector updating unit 532c acquires and outputs the estimated steering vector ν_{f, t}in accordance with equations (26) and (27) shown below, for example.

$\begin{matrix} v_{f, t}^{'} = Ψ_{n, f, t} {\tilde{v}}_{f, t} & (26) \\ v_{f, t} = \frac{v_{f, t}^{'}}{v_{f, t}^{ref}} & (27) \end{matrix}$

Here, v_{f, t}^refexpresses an element corresponding to the reference microphone ref, among the M elements of a vector v′_{f, t}acquired from equation (26). In other words, in the example of equations (26) and (27), the steering vector updating unit 532c sets a vector acquired by normalizing the respective elements of v′_{f, t}=ψ_{n, f, t}v^˜_{f, t}by v_{f, t}^refas the estimated steering vector ν_{f, t}.

The estimated steering vector ν_{f, t}acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 treats the estimated steering vector ν_{f, t}as ν_{f, 0}, and performs the processing of step S212, described in the second embodiment, using the estimated steering vector ν_{f, t}and the weighted space-time covariance matrix R_facquired in step S211. All other processing is as described in the first and second embodiments. Further, as σ_{f, t}²input into the matrix estimation unit 211, either the provisional power generated as illustrated in equation (17) or the estimated power σ_{f, t}²generated as described in the third embodiment, for example, may be used.

Modified Example 1 of Fifth Embodiment

In step S532d of the fifth embodiment, the inverse noise covariance matrix updating unit 532d adaptively updates the inverse noise covariance matrix ψ⁻¹_{n, f, t}at each time point corresponding to the time frame number t by using the frequency-divided observation signal x_{f, t}and the mask information γ_{f, t}⁽ⁿ⁾. However, the inverse noise covariance matrix updating unit 532d may acquire and output the inverse noise covariance matrix ψ⁻¹_{n, f, t}by using a frequency-divided observation signal x_{f, t}of a time interval in which the noise component either exists alone or is dominant, without using the mask information γ_{f, t}⁽ⁿ⁾. For example, the inverse noise covariance matrix updating unit 532d may output, as the inverse noise covariance matrix ψ⁻¹_{n, f, t}, an inverse matrix of the temporal average of x_{f, t}x_{f, t}^Hwith respect to a frequency-divided observation signal x_{f, t}of a time interval in which the noise component either exists alone or is dominant. The inverse noise covariance matrix ψ⁻¹_{n, f, t}acquired in this manner is used continuously in the frames having the respective time frame numbers t.

In step S532e of the fifth embodiment, the noise covariance matrix updating unit 532e may acquire and output the noise covariance matrix ψ⁻¹_{n, f, t}of the frequency-divided observation signal x_{f, t}using a frequency-divided observation signal x_{f, t}of a time interval in which the noise component either exists alone or is dominant, without using the mask information γ_{f, t}⁽ⁿ⁾. For example, the noise covariance matrix updating unit 532e may output, as the noise covariance matrix ψ_{n, f, t}, the temporal average of x_{f, t}x_{f, t}^Hwith respect to a frequency-divided observation signal x_{f, t}of a time interval in which the noise component either exists alone or is dominant. The noise covariance matrix ψ_{n, f, t}acquired in this manner is used continuously in the frames having the respective time frame numbers t.

Modified Example 2 of Fifth Embodiment

In the fifth embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.

Sixth Embodiment

In the fifth embodiment, the steering vector estimation unit 532 acquires and outputs the estimated steering vector ν_{f, t}by successive processing using the frequency-divided observation signal x_{f, t}as input. As noted in the fourth embodiment, however, by estimating the steering vector after suppressing reverberation from the frequency-divided observation signal x_{f, t}, the estimation precision is improved. In the sixth embodiment, an example in which the steering vector estimation unit acquires and outputs the estimated steering vector ν_{f, t}by successive processing, as described in the fifth embodiment, after reverberation has been suppressed from the frequency-divided observation signal x_{f, t}will be described.

As illustrated in FIG. 6, a signal processing device 6 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 63. As illustrated in FIG. 7, the parameter estimation unit 63 includes the reverberation suppression unit 431 and a steering vector estimation unit 632. The sixth embodiment differs from the fifth embodiment in that before generating the estimated steering vector, the reverberation component of the frequency-divided observation signal x_{f, t}is suppressed. Hereafter, only a method of generating the estimated steering vector will be described.

As described in the fourth embodiment, the reverberation suppression unit 431 (FIG. 7) acquires and outputs the frequency-divided reverberation-suppressed signal u_{f, t}in which the reverberation component of the frequency-divided observation signal x_{f, t}has been suppressed (preferably, in which the reverberation component of the frequency-divided observation signal x_{f, t}has been removed).

The frequency-divided reverberation-suppressed signal u_{f, t}is input into the steering vector estimation unit 632. The processing of the steering vector estimation unit 632 is identical to the processing of the steering vector estimation unit 532 of the fifth embodiment except that the frequency-divided reverberation-suppressed signal u_{f, t}, rather than the frequency-divided observation signal x_{f, t}, is input into the steering vector estimation unit 632, and the steering vector estimation unit 632 uses the frequency-divided reverberation-suppressed signal u_{f, t}instead of the frequency-divided observation signal x_{f, t}. In other words, in the processing performed by the steering vector estimation unit 63 the frequency-divided observation signal x_{f, t}used in the processing of the steering vector estimation unit 532 is replaced by the frequency-divided reverberation-suppressed signal u_{f, t}. All other processing is identical to the fifth embodiment and the modified example thereof. More specifically, the frequency-divided reverberation-suppressed signal u_{f, t}, which is a frequency-divided time series signal, is input into the steering vector estimation unit 632. The observation signal covariance matrix updating unit 532a acquires and outputs the spatial covariance matrix ψ_{x, f, t}of the frequency-divided reverberation-suppressed signal u_{f, t}belonging to the first time interval, which is based on the frequency-divided reverberation-suppressed signal u_{f, t}belonging to the first time interval and the spatial covariance matrix ψ_{x, f, t−1}of a frequency-divided reverberation-suppressed signal u_{f, t}_i belonging to the second time interval that is further in the past than the first time interval. The main component vector updating unit 532b acquires and outputs the main component vector v^˜_{f, t}of the first time interval with respect to the product ψ⁻¹_{n, f, t}ψ_{x, f, t}of the inverse matrix ψ⁻¹_{n, f, t}of the noise covariance matrix of the frequency-divided reverberation-suppressed signal and the spatial covariance matrix ψ_{x, f, t}of the frequency-divided reliability-suppressed signal belonging to the first time interval on the basis of the inverse matrix ψ⁻¹_{n, f, t}of the noise covariance matrix of the frequency-divided reliability-suppressed signal u_{f, t}, the spatial covariance matrix ψ_{x, f, t}of the frequency-divided reliability-suppressed signal belonging to the first time interval, and the main component vector v^˜_{f, t−1}of the second time interval. The steering vector updating unit 532c acquires and outputs the estimated steering vector ν_{f, t}of the first time interval on the basis of the noise covariance matrix of the frequency-divided reverberation-suppressed signal u_{f, t}and the main component vector v^˜_{f, t}of the first time interval.

Seventh Embodiment

In a seventh embodiment, a method of estimating the convolutional beamformer by successive processing will be described. In so doing, the convolutional beamformer of each time frame number t can be estimated and the target signal y_{f, t}can be acquired from frequency-divided observation signals x_{f, t}input successively online, for example.

As illustrated in FIG. 6, a signal processing device 7 according to this embodiment includes an estimation unit 71, a suppression unit 72, and the parameter estimation unit 53. The estimation unit 71 includes a matrix estimation unit 711 and a convolutional beamformer estimation unit 712. The following processing is executed on each time frame number t in ascending order from t=1.

The frequency-divided observation signal x_{f, t}is input into the parameter estimation unit 53 (FIGS. 6 and 7). As described in the fifth embodiment, the steering vector estimation unit 532 (FIG. 8) of the parameter estimation unit 53 acquires and outputs the estimated steering vector ν_{f, t}by successive processing using the frequency-divided observation signal x_{f, t}as input (step S532). The estimated steering vector ν_{f, t}is represented by the following M-dimensional vector.

ν_f,t=[ν_f,t⁽¹⁾,ν_f,t⁽²⁾, . . . ,ν_f,t^(M)]^T

Here, ν_{f, t}^(m)represents an element corresponding to the microphone having the microphone number m, among the M elements of the estimated steering vector ν_{f, t}. The estimated steering vector ν_{f, t}acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 712.

The frequency-divided observation signal x_{f, t}and the power or estimated power σ_{f, t}²of the target signal are input into the matrix estimation unit 711 (FIG. 6). As σ_{f, t}²input into the matrix estimation unit 711, either the provisional power generated as illustrated in equation (17) or the estimated power σ_{f, t}²generated as described in the third embodiment, for example, may be used. On the basis of the frequency-divided observation signal x_{f, t}(the frequency-divided observation signal belonging to the first time interval), the power or estimated power σ_{f, t}²of the target signal (the power or estimated power of the frequency-divided observation signal belonging to the first time interval), and an inverse matrix

R̆
_f,t−1
⁻¹

of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the second time interval that is further in the past than the first time interval), the matrix estimation unit 711 estimates and outputs an inverse matrix

R̆
_f,t
⁻¹

of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the first time interval). An example of the space-time covariance matrix is as follows.

${\overset{⋓}{R}}_{f, t} = \sum_{r = 0}^{t} \frac{a^{t - T}}{σ_{f, t}^{2}} {\overline{x}}_{f, t} {\overline{x}}_{f, t}^{H}$

In this case, the matrix estimation unit 711 generates and outputs the inverse matrix

R̆
_f,t
⁻¹

of the space-time covariance matrix in accordance with equations (28) and (29) shown below, for example.

$\begin{matrix} k_{f, t} = \frac{{\overset{⋓}{R}}_{f, t}^{- 1} {\overline{x}}_{f, t}}{a σ_{f, t}^{2} + {\overline{x}}_{f, t}^{H} {\overset{⋓}{R}}_{f, t - 1}^{- 1} {\overline{x}}_{f, t}} & (28) \\ {\overset{⋓}{R}}_{f, t}^{- 1} = \frac{1}{a} ({\overset{⋓}{R}}_{f, t - 1}^{- 1} - k_{f, t} {\overline{x}}_{f, t}^{H} {\overset{⋓}{R}}_{f, t - 1}^{- 1}) & (29) \end{matrix}$

Here, k_{f, t}in equation (28) is an (L+1)M-dimensional vector, and the inverse matrix of equation (29) is an (L+1)M×(L+1)M matrix. α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. Further, an initial matrix of the inverse matrix

R̆
_f,t−1
⁻¹

of the space-time covariance matrix may be set as desired, and an example of the initial matrix is an (L+1)M-dimensional unit matrix shown below.

R̆
_f,0
⁻¹
=I
_(L+1)M

R̆
_f,t
⁻¹

(the inverse matrix of the space-time covariance matrix of the first time interval) acquired by the matrix estimation unit 711, and the estimated steering vector ν_{f, t}acquired by the parameter estimation unit 53 are input into the beamformer estimation unit 712. The convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer w⁻_{f, t}(the convolutional beamformer of the first time interval) on the basis thereof. For example, the convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer w⁻_{f, t}in accordance with equation (30), shown below.

$\begin{matrix} {\overline{w}}_{f, t} = \frac{{\overset{⋓}{R}}_{f, t}^{- 1} {\overline{v}}_{f, t}}{{\overline{v}}_{f, t}^{H} {\overset{⋓}{R}}_{f, t}^{- 1} {\overline{v}}_{f, t}} & (30) \end{matrix}$

where

ν
_f,t=[ν_f,t⁽¹⁾,ν_f,t⁽²⁾, . . . ,ν_f,t^(M)]

and

ν
_f,t
^(m)=[g_fν_f,t^(m),0, . . . ,0]

[g_fν_f,t^(m),0, . . . 0]

is an L+1-dimensional vector. g_fis a scalar constant other than 0.

The frequency-divided observation signal x_{f, t}and the convolutional beamformer w⁻_{f, t}acquired by the beamformer estimation unit 712 are input into the suppression unit 72. The suppression unit 72 acquires and outputs the target signal y_{f, t}by applying the convolutional beamformer w⁻_{f, t}to the frequency-divided observation signal x_{f, t}in each time frame number t and frequency band number f. For example, the suppression unit 72 acquires and outputs the target signal y_{f, t}in accordance with equation (31) shown below.

y
_f,t
=w
_f,t
^H

x

_f,t (31)

Modified Example 1 of Seventh Embodiment

The parameter estimation unit 53 of the signal processing device 7 according to the seventh embodiment may be replaced by the parameter estimation unit 63. In other words, in the seventh embodiment, the parameter estimation unit 63, rather than the parameter estimation unit 53, may acquire and output the estimated steering vector ν_{f, t}by successive processing, as described in the sixth embodiment, using the frequency-divided observation signal x_{f, t}as input.

Modified Example 2 of Seventh Embodiment

In the seventh embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.

Eighth Embodiment

In the second embodiment, an example in which the analytical solution of w⁻_ffor minimizing the cost function C₃(w⁻_f) on the basis of a constraint condition in which w_{f, 0}^Hν_{f, 0}is a constant is viewed as equation (15) and the convolutional beamformer w⁻_fis acquired in accordance with equation (15) was described. In an eighth embodiment, an example in which the convolutional beamformer is acquired using a different optimal solution will be described.

When an (M−1)×M block matrix corresponding to the orthogonal complement of the estimated steering vector ν_{f, 0}is set as B_f, B_f^Hν_{f, 0}=0 is satisfied. An infinite number of block matrices B_fof this type exist. Equation (32) below shows an example of the block matrix B_f.

$\begin{matrix} B_{f}^{H} [\frac{- \overline{v_{f, 0}}}{v_{f, 0}^{ref}}, I_{M - 1}] & (32) \end{matrix}$

Here, ν^˜_{f, 0}is an M−1-dimensional column vector constituted by elements of the steering vector ν_{f, 0}or the estimated steering vector ν_{f, 0}that correspond to microphones other than the reference microphone ref, ν_{f, 0}^refis the element of ν_{f, 0}that corresponds to the reference microphone ref, and I_M−1is an (M−1)×(M−1)-dimensional unit matrix.

g_fis set as a scalar constant other than 0, a_{f, 0}is set as an M-dimensional modified instantaneous beamformer, and the instantaneous beamformer w_{f, 0}is expressed as the sum of a constant multiple g_fν_{f, 0}of the steering vector ν_{f, 0}or a constant multiple g_fν_{f, 0}of the estimated steering vector ν_{f, 0}and a product B_fa_{f, 0}of the block matrix B_fcorresponding to the orthogonal complement of the steering vector ν_{f, 0}or the estimated steering vector ν_{f, 0}and the modified instantaneous beamformer a_{f, 0}. In other words, the instantaneous beamformer w_{f, 0}is expressed as

w
_f,0
=g
_fν_f,0+B_fa_f,0 (33)

Accordingly, B_f^Hν_{f, 0}=0, and therefore the constraint condition that “w_{f, 0}^Hν_{f, 0}is a constant” is expressed as follows.

w
_f,0
^Hν_f,0=(g_fν_f,0+B_fa_f,0)^Hν_f,0=g_f^H|∥_f,0|2=constant

Hence, even under the definition given in equation (33), the constraint condition that “w_{f, 0}^Hν_{f, 0}is a constant” is satisfied in relation to any modified instantaneous beamformer a_{f, 0}. It is therefore evident that the instantaneous beamformer w_{f, 0}may be defined as illustrated in equation (33). In this embodiment, the convolutional beamformer is estimated using the optimal solution of the convolutional beamformer acquired when the instantaneous beamformer w_{f, 0}is defined as illustrated in equation (33). This will be described in detail below.

As illustrated in FIG. 9, a signal processing device 8 according to this embodiment includes an estimation unit 81, a suppression unit 82, and a parameter estimation unit 83. The estimation unit 81 includes a matrix estimation unit 811, a convolutional beamformer estimation unit 812, an initial beamformer application unit 813, and a block unit 814.

The parameter estimation unit 83 (FIG. 9), using the frequency-divided observation signal x_{f, t}as input, acquires the estimated steering vector by an identical method to any of the parameter estimation units 33, 43, 53, 63 described above, and outputs the acquired estimated steering vector as ν_{f, 0}. The output estimated steering vector ν_{f, 0}is transmitted to the initial beamformer application unit 813 and the block unit 814.

The estimated steering vector ν_{f, 0}and the frequency-divided observation signal x_{f, t}are input into the initial beamformer application unit 813. The initial beamformer application unit 813 acquires and outputs an initial beamformer output z_{f, t}(an initial beamformer output of the first time interval) based on the estimated steering vector ν_{f, 0}and the frequency-divided observation signal x_{f, t}(the frequency-divided observation signal belonging to the first time interval). For example, the initial beamformer application unit 813 acquires and outputs an initial beamformer output z_{f, t}based on the constant multiple of the estimated steering vector ν_{f, 0}and the frequency-divided observation signal r_{f, t}. The initial beamformer application unit 813 acquires and outputs the initial beamformer output z_{f, t}in accordance with equation (34) shown below, for example.

z
_f,t=(g_fν_f,0)^Hx_f,t (34)

The output initial beamformer output z_{f, t}is transmitted to the convolutional beamformer estimation unit 812 and the suppression unit 82.

The estimated steering vector ν_{f, 0}and the frequency-divided observation signal x_{f, t}are input into the block unit 814. The block unit 814 acquires and outputs a vector x⁼_{f, t}based on the frequency-divided observation signal x_{f, t}and the block matrix B_fcorresponding to the orthogonal complement of the estimated steering vector ν_{f, 0}. As noted above, B_f^Hν_{f, 0}=0 is satisfied. Equation (32) shows an example of the block matrix B_f, but the present invention is not limited to this example, and any block matrix B_fin which B_f^Hν_{f, 0}=0 is satisfied may be used. The block unit 814 acquires and outputs the vector x⁼_{f, t}in accordance with equations (35) and (36) shown below, for example.

$\begin{matrix} {\overline{\overline{x}}}_{f, t - d}^{(m)} = {[x_{f, t - d}^{(m)}, x_{f, t - d - 1}^{(m)}, \dots, x_{f, t - d - L + 1}^{(m)}]}^{T} & (35) \\ {\overline{\overline{x}}}_{f, t} = {[{(B_{f}^{H} x_{f, t})}^{T}, {\overline{\overline{x}}}_{f, t - d}^{(1)}^{T}, {\overline{\overline{x}}}_{f, t - d}^{(2)}^{T}, \dots, {\overline{\overline{x}}}_{f, t - d}^{(M)}^{T}]}^{T} & (36) \end{matrix}$

Note that the upper right superscript “=” of “x⁼_{f, t}” should be written directly above the lower right subscript “x”, as shown in equation (36), but due to notation limitations may also be written to the upper right of “x”. The output vector x⁼_{f, t}is transmitted to the matrix estimation unit 811, the convolutional beamformer estimation unit 812, and the suppression unit 82. Further, when L=0, the right side of equation (35) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (36) is as shown below in equation (36A).

x

_f,t
=B
_f
^H
x
_f,t (36A)

The vector x⁼_{f, t}acquired by the block unit 814 and the power or estimated power σ_{f, t}²of the target signal are input into the matrix estimation unit 811. Either the provisional power generated as illustrated in equation (17) or the estimated power σ_{f, t}²generated as described in the third embodiment, for example, may be used as σ_{f, t}². Using the vector x⁼_{f, t}and the power or estimated power σ_{f, t}²of the target signal, the matrix estimation unit 811 acquires and outputs a weighted modified space-time covariance matrix R⁼_f, which is based on the estimated steering vector ν_{f, 0}, the frequency-divided observation signal x_{f, t}, and the power or estimated power σ_{f, t}²of the target signal and increases the probability expressing the speech-likeness of the estimation signal when the instantaneous beamformer w_{f, 0}is expressed as illustrated in equation (33). For example, the matrix estimation unit 811 acquires and outputs a weighted modified space-time covariance matrix R⁼_fbased on the vector x⁼_{f, t}, and the power or estimated power σ_{f, t}²of the target signal. The matrix estimation unit 811 acquires and outputs the weighted modified space-time covariance matrix R⁼_fin accordance with equation (37) below, for example.

$\begin{matrix} {\overline{\overline{R}}}_{f} = \sum_{t = 1}^{N} \frac{{\overline{\overline{x}}}_{f, t} {\overline{\overline{x}}}_{f, t}^{H}}{σ_{f, t}^{2}} & (37) \end{matrix}$

The output weighted modified space-time covariance matrix R⁼_fis transmitted to the convolutional beamformer estimation unit 812.

The initial beamformer output z_{f, t}acquired by the initial beamformer application unit 813, the vector x⁼_{f, t}acquired by the block unit 814, and the weighted modified space-time covariance matrix R⁼_facquired by the matrix estimation unit 811 are input into the convolutional beamformer estimation unit 812. Using these, the convolutional beamformer estimation unit 812 acquires and outputs a convolutional beamformer w⁼_fthat is based on the estimated steering vector ν_{f, t}he weighted modified space-time covariance matrix R⁼_f, and the frequency-divided observation signal x_{f, t}. For example, the convolutional beamformer estimation unit 812 acquires and outputs the convolutional beamformer w⁼_fin accordance with equation (38) shown below.

w

_f
=R
_f
⁻¹

x

_f,t
z
_f,t
^H (38)

w

_f=[a_f,0^Tw_f⁽¹⁾^T,w_f⁽²⁾^T, . . . ,w_f^(M)^T]^T (38A)

w

_f
^(m)=[w_f,d^(m),w_f,d+1^(m), . . . w_f,d+L−1^(m)]^T (38B)

The output convolutional beamformer w⁼_fis transmitted to the suppression unit 82.

Note that when L=0, the right side of equation (38B) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (38A) is as shown below.

w

_f
=a
_f,0

The vector x_{f, t}output from the block unit 814, the initial beamformer output z_{f, t}output from the initial beamformer application unit 813, and the convolutional beamformer w⁼_foutput from the convolutional beamformer estimation unit 812 are input into the suppression unit 82. The suppression unit 82 acquires and outputs the target signal y_{f, t}by applying the initial beamformer output z_{f, t}and the convolutional beamformer w⁼_fto the vector x⁼_{f, t}. This processing is equivalent to processing for acquiring and outputting the target signal y_{f, t}by applying the convolutional beamformer w⁻_fto the frequency-divided observation signal x_{f, t}. For example, the suppression unit 82 acquires and outputs the target signal y_{f, t}in accordance with equation (39) shown below.

y
_f,t
=z
_f,t
+w
_f
^H

x

_f,t (39)

Modified Example 1 of Eighth Embodiment

A known steering vector ν_{f, 0}acquired on the basis of actual measurement or the like may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector ν_{f, 0}acquired by the parameter estimation unit 83. In this case, the initial beamformer application unit 813 and the block unit 814 perform steps S813 and S814, described above, using the steering vector ν_{f, 0}instead of the estimated steering vector ν_{f, 0}.

Ninth Embodiment

In a ninth embodiment, a method for executing convolutional beamformer estimation based on the eighth embodiment by successive processing will be described. The following processing is executed on each time frame number t in ascending order from t=1.

As illustrated in FIG. 10, a signal processing device 9 according to this embodiment includes an estimation unit 91, a suppression unit 92, and a parameter estimation unit 93. The estimation unit 91 includes an adaptive gain estimation unit 911, a convolutional beamformer estimation unit 912, a matrix estimation unit 915, the initial beamformer application unit 813, and the block unit 814.

The parameter estimation unit 93 (FIG. 10), using the frequency-divided observation signal x_{f, t}as input, acquires and outputs the estimated steering vector ν_{f, t}by an identical method to either of the parameter estimation units 53, 63 described above. The output estimated steering vector ν_{f, t}is transmitted to the initial beamformer application unit 813 and the block unit 814.

The estimated steering vector ν_{f, t}(the estimated steering vector of the first time interval) and the frequency-divided observation signal x_{f, t}(the frequency-divided observation signal belonging to the first time interval) are input into the initial beamformer application unit 813, and the initial beamformer application unit 813 acquires and outputs the initial beamformer output z_{f, t}(the initial beamformer output of the first time interval) as described in the eighth embodiment using ν_{f, t}instead of ν_{f, 0}. The output initial beamformer output z_{f, t}is transmitted to the suppression unit 92.

The estimated steering vector ν_{f, t}and the frequency-divided observation signal x_{f, t}are input into the block unit 814, and the block unit 814 acquires and outputs the vector x⁼_{f, t}as described in the eighth embodiment by using ν_{f, t}instead of ν_{f, 0}. The output vector x⁼_{f, t}is transmitted to the adaptive gain estimation unit 911, the matrix estimation unit 915, and the suppression unit 92.

The initial beamformer output z_{f, t}output from the initial beamformer application unit 813 and the vector x⁼_{f, t}output from the block unit 814 are input into the suppression unit 92. Using these, the suppression unit 92 acquires and outputs the target signal y_{f, t}, which is based on the initial beamformer output z_{f, t}(the initial beamformer output of the first time interval), the estimated steering vector ν_{f, t}(the estimated steering vector of the first time interval), the frequency-divided observation signal x_{f, t}, and a convolutional beamformer w⁼_{f, t}_, (the convolutional beamformer of the second time interval that is further in the past than the first time interval). For example, the suppression unit 92 acquires and outputs the target signal y_{f, t}in accordance with equation (40) below.

y
_f,t
=z
_f,t
+w
_f,t−1
^H

x

_f,t (40)

Here, the initial vector w⁼_{f, 0}of the convolutional beamformer w⁼_{f, t−1}may be any (LM+M−1)-dimensional vector. An example of the initial vector w⁼_{f, 0}is an (LM+M−1)-dimensional vector in which all elements are 0.

The vector x⁼_{f, t}output from the block unit 814, an inverse matrix R^˜−_{f, t−1}of the weighted modified space-time covariance matrix output from the matrix estimation unit 915, and the power or estimated power σ_{f, t}²of the target signal are input into the adaptive gain estimation unit 911. As σ_{f, t}²input into the matrix estimation unit 711, either the provisional power generated as illustrated in equation (17) or the estimated power σ_{f, t}²generated as described in the third embodiment, for example, may be used. Note that the “˜” of “R^˜−1_{f, t−1}” should be written directly above the “R”, but due to notation limitations may also be written to the upper right of “R”. Using these, the adaptive gain estimation unit 911 acquires and outputs an adaptive gain k_{f, t}(the adaptive gain of the first time interval) that is based on the inverse matrix R^˜−1_{f, t−1}of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval), the estimated steering vector ν_{f, t}(the estimated steering vector of the first time interval), the frequency-divided observation signal x_{f, t}, and the power or estimated power σ_{f, t}²of the target signal. For example, the adaptive gain estimation unit 911 acquires and outputs the adaptive gain k_{f, t}as an (LM+M−1)-dimensional vector in accordance with equation (41) shown below.

$\begin{matrix} k_{f, t} = \frac{{\tilde{R}}_{f, t - 1}^{- 1} {\overline{\overline{x}}}_{f, t}}{a σ_{f, t}^{2} + {\overline{\overline{x}}}_{f, t}^{H} {\tilde{R}}_{f, t - 1}^{- 1} {\overline{\overline{x}}}_{f, t}} & (41) \end{matrix}$

Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. Further, an initial matrix of the inverse matrix R^˜−1_{f, t−1}of the weighted modified space-time covariance matrix may be any (LM+M−1)×(LM+M−1)-dimensional matrix. An example of the initial matrix of the inverse matrix R^˜−1_{f, t−1}of the weighted modified space-time covariance matrix is an (LM+M−1)-dimensional unit matrix. Here,

${\overline{\overline{x}}}_{f, t} = {[{(B_{f}^{H} x_{f, t})}^{T}, {\overline{\overline{x}}}_{f, t - d}^{(1)}^{T}, {\overline{\overline{x}}}_{f, t - d}^{(2)}^{T}, \dots, {\overline{\overline{x}}}_{f, t - d}^{(M)}^{T}]}^{T}$

${\overline{\overline{x}}}_{f, t - d}^{(m)} = {[x_{f, t - d}^{(m)}, x_{f, t - d - 1}^{(m)}, \dots, x_{f, t - d - L + 1}^{(m)}]}^{T}$

$and$

${\tilde{R}}_{f, t} = \sum_{r = 0}^{t} \frac{a^{t - T}}{σ_{f, t}^{2}} {\overline{\overline{x}}}_{f, t} {\overline{\overline{x}}}_{f, t}^{H}$

Note that R^˜_{f, t}itself is not calculated. The output adaptive gain k_{f, t}is transmitted to the matrix estimation unit 915 and the convolutional beamformer estimation unit 912.

<Processing of matrix estimation unit 915 (step S915)β The vector x_{f, t}output from the block unit 814 and the adaptive gain k_{f, t}output from the adaptive gain estimation unit 911 are input into the matrix estimation unit 915. Using these, the matrix estimation unit 915 acquires and outputs an inverse matrix R^˜−1_{f, t}of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the first time interval) that is based on the adaptive gain k_{f, t}(the adaptive gain of the first time interval), the estimated steering vector ν_{f, t}(the estimated steering vector of the first time interval), the frequency-divided observation signal x_{f, t}, and the inverse matrix R^˜−1_{f, t−1}, of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval). For example, the matrix estimation unit 915 acquires and outputs the inverse matrix R^˜−1_{f, t}of the weighted modified space-time covariance matrix in accordance with equation (42) below.

$\begin{matrix} {\tilde{R}}_{f, t}^{- 1} = \frac{1}{a} ({\tilde{R}}_{f, t - 1}^{- 1} - k_{f, t} {\overline{\overline{x}}}_{f, t}^{H} {\tilde{R}}_{f, t - 1}^{- 1}) & (42) \end{matrix}$

The output inverse matrix R^˜−1_{f, t}of the weighted modified space-time covariance matrix is transmitted to the adaptive gain estimation unit 911.

The target signal y_{f, t}output from the suppression unit 92 and the adaptive gain k_{f, t}output from the adaptive gain estimation unit 911 are input into the convolutional beamformer estimation unit 912. Using these, the convolutional beamformer estimation unit 912 acquires and outputs the convolutional beamformer w⁼_{f, t}(the convolutional beamformer of the first time interval), which is based on the adaptive gain k_{f, t}(the adaptive gain of the first time interval), the target signal y_{f, t}(the target signal of the first time interval), and the convolutional beamformer w⁼_{f, t−1}(the convolutional beamformer of the second time interval). For example, the convolutional beamformer estimation unit 912 acquires and outputs the convolutional beamformer w⁼_{f, t}in accordance with equation (43) shown below.

w

_f,t
=w
_f,t−1
−k
_f,t
t
_f,t
^H (43)

The output convolutional beamformer w⁼_{f, t}is transmitted to the suppression unit 92.

Modified Example 1 of Ninth Embodiment

In the ninth embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.

Modified Example 2 of Ninth Embodiment

A known steering vector ν_{f, t}may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector ν_{f, t}acquired by the parameter estimation unit 93. In this case, the initial beamformer application unit 813 and the block unit 814 perform steps S813 and S814, described above, using the steering vector ν_{f, t}instead of the estimated steering vector ν_{f, t}.

Tenth Embodiment

The frequency-divided observation signals x_{f, t}input into the signal processing devices 1 to 9 described above may be any signals that correspond respectively to a plurality of frequency bands of an observation signal acquired by picking up an acoustic signal emitted from a sound source. For example, as illustrated in FIGS. 11A and 11C, a time-domain observation signal x(i)=[x(i)⁽¹⁾, x(i)⁽²⁾, . . . , x(i)^(M)]^T(where i is an index expressing a discrete time) acquired by picking up an acoustic signal emitted from a sound source in M microphones may be input into a dividing unit 1051, and the dividing unit 1051 may transform the observation signal x(i) into frequency-divided observation signals x_{f, t}in the frequency domain and input the frequency-divided observation signals x_{f, t}into the signal processing devices 1 to 9. There are no limitations on the transformation method from the time domain to the frequency domain, and the discrete Fourier transform or the like, for example, may be used. Alternatively, as illustrated in FIG. 11B, frequency-divided observation signals x_{f, t}acquired by another processing unit, not illustrated in the figures, may be input into the signal processing devices 1 to 9. For example, the time-domain observation signal x(i) described above may be transformed into frequency-domain signals in each time frame, the frequency-domain signals may be processed by another processing unit, and the frequency-divided observation signals x_{f, t}acquired as a result may be input into the signal processing devices 1 to 9.

The target signals y_{f, t}output from the signal processing devices 1 to 9 may either be used in other processing (speech recognition processing or the like) without being transformed into time-domain signals y(i) or be transformed into a time-domain signal y(i). For example, as illustrated in FIG. 11C, the target signals y_{f, t}output from the signal processing devices 1 to 9 may be output as is and used in other processing. Alternatively, as illustrated in FIGS. 11A and 11B, the target signals y_{f, t}output from the signal processing devices 1 to 9 may be input into an integration unit 1052, and the integration unit 1052 may acquire and output a time-domain signal y(i) by integrating the target signals y_{f, t}. There are no limitations on the method for acquiring the time-domain signal y(i) from the target signals y_{f, t}, and the inverse Fourier transform or the like, for example, may be used.

Test results relating to the methods of the respective embodiments will be illustrated below.

Test Results 1 (First Embodiment)

Next, noise/reverberation suppression results acquired by the first embodiment and conventional methods 1 to 3 will be illustrated.

In this test, a data set of the “REVERB Challenge” was used as the observation signal. Acoustic data (Real Data) acquired by picking up English-language speech read aloud in a room with stationary noise and reverberation using microphones disposed in positions away (0.5 to 2.5 m) from the speaker, and acoustic data (Sim Data) acquired by simulating this environment are recorded in the data set. The number of microphones M=8. The frequency-divided observation signals were determined by the short-time Fourier transform. The frame length was set at 32 milliseconds, the frame shift was set at 4, and the prediction delay was set at d=4. Using these data, the speech quality and speech recognition precision of signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3 were evaluated.

FIG. 12 shows evaluation results acquired in relation to the speech quality of the observation signal and the signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3. “Sim” denotes the Sim Data, and “Real” denotes the Real Data. “CD” denotes cepstrum distortion, “SRMR” denotes the signal-to-reverberation modulation ratio, “LLR” denotes the log-likelihood ratio, and “FWSSNR” denotes the frequency-weighted segmental signal-to-noise ratio. CD and LLR indicate better speech quality as the values thereof decrease, while SRMR and FWSSNR indicate better speech quality as the values thereof increase. The underlined values are optimal values. As illustrated in FIG. 12, it is evident that according to the present invention, noise and reverberation can be suppressed more adequately than with conventional methods 1 to 3.

FIG. 13 shows a word error rate in the speech recognition results acquired in relation to the observation signal and the signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3. The word error rate indicates better speech recognition precision as the value thereof decreases. The underlined values are optimal values. “R1N” denotes a case in which the speaker is positioned close to the microphones in room 1, while “R1F” denotes a case in which the speaker is positioned far away from the microphones in room 1. Similarly, “R2N” and “R3N” respectively denote cases in which the speaker is positioned close to the microphones in rooms 2 and 3, while “R2F” and “R3E” respectively denote cases in which the speaker is positioned far away from the microphones in rooms 2 and 3. “Ave” denotes an average value. As illustrated in FIG. 12, it is evident that according to the present invention, noise and reverberation can be suppressed more adequately than with conventional methods 1 to 3.

Test Results 2 (Fourth Embodiment)

FIG. 14 shows noise/reverberation suppression results acquired in a case where the steering vector was estimated without suppressing the reverberation of the frequency-divided observation signal x_{f, t}(without reverberation suppression) and a case where the steering vector was estimated after suppressing the reverberation of the frequency-divided observation signal x_{f, t}(with reverberation suppression), as described in the fourth embodiment. Note that “WER” expresses the character error rate when speech recognition was performed using the target signal acquired by implementing noise/reverberation suppression. As the value of WER decreases, a better performance is achieved. As illustrated in FIG. 14, it is evident that the speech quality of the target signal is better with reverberation suppression than without reverberation suppression.

Test Results 3 (Seventh and Ninth Embodiments)

FIGS. 15A, 15B, and 15C show noise/reverberation suppression results acquired in a case where convolutional beamformer estimation was executed by successive processing, as described in the seventh and ninth embodiments. In FIGS. 15A, 15B, and 15C, L=64 [msec], α=0.9999, and β=0.66. Further, “Adaptive NCM” indicates results acquired when the estimated steering vector ν_{f, t}generated by the method of the fifth embodiment was used. Further, “PreFixed NCM” indicates results acquired when the estimated steering vector ν_{f, t}generated by the method of modified example 1 of the fifth embodiment was used. Furthermore, “observation signal” indicates results acquired when no noise/reverberation suppression was implemented. Thus, it is evident that the speech quality of the target signal is improved by the noise/reverberation suppression of the seventh and ninth embodiments.

Other Modified Examples and so on

Note that the present invention is not limited to the embodiments described above. For example, in the above embodiments, d is set at the same value in all of the frequency bands, but d may be set for each frequency band. In other words, a positive integer d_fmay be used instead of d. Similarly, in the above embodiments, L is set at the same value in all of the frequency bands, but L may be set for each frequency band. In other words, a positive integer L_fmay be used instead of L.

In the first to third embodiments, examples in which batch processing is performed by determining the cost functions and so on (equations (2), (7), (12), (13), (14), and (18)) by using a time frame corresponding to 1≤t≤N as a processing unit were described, but the present invention is not limited thereto. For example, rather than using a time frame corresponding to 1≤t≤N as a processing unit, the processing may be executed using a partial time frame thereof as a processing unit. Alternatively, the time frame that is used as the processing unit may be updated in real time, and the processing may be executed by determining the cost functions and so on in processing units of each time point. For example, when the number of the current time frame is expressed as t_c, a time frame corresponding to 1≤t≤t_c, may be set as the processing unit, or a time frame corresponding to t_c−η≤t≤t_cmay be set as the processing unit in relation to a positive integer constant η.

The various types of processing described above do not have to be executed in time series, as described above, and may be executed in parallel or individually either in accordance with the processing power of the device that executes the processing or in accordance with necessity. Furthermore, the processing may be modified appropriately within a scope that does not depart from the spirit of the present invention.

The devices described above are configured by, for example, having a general-purpose or dedicated computer including a processor (a hardware processor) such as a CPU (central processing unit) and a memory such as a RAM (random-access memory)/ROM (read-only memory) execute a predetermined program. The computer may include one processor and one memory, or pluralities of processors and memories. The program may be either installed in the computer or recorded in the ROM or the like in advance. Further, instead of electronic circuitry, such as a CPU, that realizes a functional configuration by reading a program, some or all of the processing units may be configured using electronic circuitry that realizes processing functions without the use of a program. Electronic circuitry constituting a single device may include a plurality of CPUs.

When the configurations described above are realized by a computer, the processing content of the functions to be included in the devices is described by the program. The computer realizes the processing functions described above by executing the program. The program describing the processing content may be recorded in advance on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of this type of recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and so on.

The program is distributed by, for example, selling, transferring, renting, etc. a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, the program may be stored in a storage device of a server computer and distributed by being transferred from the server computer to another computer over a network.

For example, the computer that executes the program first stores the program recorded on the portable recording medium or transferred from the server computer temporarily in a storage device included therein. During execution of the processing, the computer reads the program stored in the storage device included therein and executes processing corresponding to the read program. As a different form of execution of the program, the computer may read the program directly from the portable recording medium and execute processing corresponding to the program. Alternatively, every time the program is transferred to the computer from the server computer, the computer may execute processing corresponding to the received program. Instead of transferring the program from the server computer to the computer, the processing described above may be executed by a so-called ASP (Application Service Provider) type service, in which processing functions are realized only by issuing commands to execute the processing and acquiring results.

Instead of realizing the processing functions of the present device by executing a predetermined program on a computer, at least some of the processing functions may be realized by hardware.

INDUSTRIAL APPLICABILITY

The present invention can be used in various applications in which it is necessary to suppress noise and reverberation from an acoustic signal. For example, the present invention can be used in speech recognition, call systems, conference call systems, and so on.

REFERENCE SIGNS LIST

1-9 Signal processing device

11, 21, 71, 81, 91 Estimation unit

12, 22 Suppression unit

Number	Date	Country	Kind
2018-234075	Dec 2018	JP	national
PCT/JP2019/016587	Apr 2019	JP	national

SIGNAL PROCESSING APPARATUS, SIGNAL PROCESSING METHOD, AND PROGRAM

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (2)

PCT Information