The present invention relates to a noise suppression technology, and more particularly to a noise suppression system, a noise suppression method, and a program suitable for a system which extracts a desired signal by suppressing a noise component included in an input signal, usage thereof, and the like.
Development on a technology for acquiring a desired signal from an input signal in which the desired signal and noise are mixed has progressed. For instance, PTL 1 discloses a configuration, in which temporary estimated speech is obtained by suppressing noise included in an input speech signal, and the temporary estimated speech is corrected with use of a standard pattern of speech for making it possible to remove a noise component with high accuracy without lacking speech information. The technology of PTL 1 uses an expectation of temporary estimated speech, which is obtained by an expectation calculation processing using probabilities at which probability distributions constituting a standard pattern output temporary estimated speech, and using a mean of the probability distributions constituting the standard pattern, as a correction value of temporary estimated speech.
Note that PTL 2 and NPL 1 are described in example embodiments to be described later. PTL 2 discloses a method for removing noise. The noise removing method includes obtaining a first signal-to-noise ratio for each frequency first, obtaining a weight for each frequency based on the first signal-to-noise ratio, and obtaining estimated noise for each frequency based on a weighted frequency domain signal, which is obtained by applying a weight for each frequency to a frequency domain signal. The noise removing method further includes obtaining a second signal-to-noise ratio based on a frequency domain signal and estimated noise for each frequency, determining a suppression coefficient based on the second signal-to-noise ratio, and applying the suppression coefficient as a weight to the frequency domain signal.
In PTL 1, lack of speech information is prevented by correcting temporary estimated speech with use of a standard pattern of speech. However, accuracy of noise suppression may be decreased due to fluctuation of the magnitude of noise, or the like.
The present invention is made in view of the above problem, and an object of the present invention is to provide a technology for avoiding a decrease in the accuracy of noise suppression even when the magnitude of noise fluctuates with respect to an input signal in which noise is mixed in a desired signal, and suppressing a noise component with high accuracy.
In order to achieve the aforementioned object, a noise suppression system according to an aspect of the present invention, includes: a priori S/N ratio estimated value and expectation calculation means that applies correction to an estimated value of a priori S/N ratio relating to a signal and noise estimated from an input signal in which the signal and the noise are mixed, based on a priori S/N ratio model or based on a signal model and a noise model, and acquires an expectation of the a priori S/N ratio; noise suppression coefficient calculation means that calculates a noise suppression coefficient with use of the a priori S/N ratio expectation; and noise suppression means that suppresses the noise included in the input signal by multiplying the input signal by the noise suppression coefficient.
A noise suppression method according to another aspect of the present invention, includes: applying correction to an estimated value of a priori S/N ratio relating to a signal and noise estimated from an input signal in which the signal and the noise are mixed, based on a priori S/N ratio model or based on a signal model and a noise model, and acquiring an expectation of the a priori S/N ratio; calculating a noise suppression coefficient with use of the a priori S/N ratio expectation; and suppressing the noise component included in the input signal by multiplying the input signal by the noise suppression coefficient.
According to another aspect of the present invention, a program which causes a computer to execute: applying correction to an estimated value of a priori S/N ratio relating to a signal and noise estimated from an input signal in which the signal and the noise are mixed, based on the a priori S/N ratio model or based on a signal model and a noise model, and acquiring an expectation of a priori S/N ratio; calculating a noise suppression coefficient with use of the a priori S/N ratio expectation; and suppressing the noise component included in the input signal by multiplying the input signal by the noise suppression coefficient. According to the present invention, a non-transitory computer readable recording medium recording the program is provided.
According to the present invention, it is possible to avoid a decrease in the accuracy of noise suppression even when the magnitude of noise fluctuates with respect to an input signal in which noise is mixed in a desired signal, and to suppress a noise component with high accuracy.
In the following, a basic idea common to the example embodiments of the present invention is described, and then, each of the example embodiments is described. Note that in the following description, it is needless to say that the reference signs in parenthesis merely illustrate an example for clarifying the basic idea of the present invention more, and are not to be construed for limiting the present invention. Further, in block diagrams illustrating the configurations of first to fourth example embodiments, directions of arrows between the blocks merely illustrate an example, and do not limit directions of signals between the blocks.
According to a preferred example embodiment of the present invention, a noise suppression system (100 in
According to another example embodiment of the present invention, a priori S/N ratio model may be estimated with use of a speech model prepared in advance and a noise model prepared in advance, in place of using a priori S/N ratio model prepared in advance. For instance, the noise suppression system (300 in
Alternatively, according to another example embodiment of the present invention, a noise suppression system (400 in
A priori S/N ratio and an after S/N ratio are distinguishably defined as follows.
A priori S/N ratio=desired signal power/noise power
After S/N ratio=(mixed signal power of desired signal and noise)/noise power
The first a priori S/N ratio estimation unit 101 receives an input signal X0 in which a desired signal and noise are mixed. The first a priori S/N ratio estimation unit 101 estimates a ratio (a priori S/N ratio) Rsn1 of desired signal power and noise power, which are included in an input signal X0, and outputs an estimated a priori S/N ratio Rsn1. Note that an input signal X0 is a frequency spectrum (a frequency amplitude spectrum, a frequency power spectrum, or the like) of a mixed signal in which a desired signal and noise are mixed, and is a signal in a frequency domain (a complex signal including a real part and an imaginary part), which is obtained by applying discrete Fourier transform (DFT) or the like to a signal in a time domain. Further, an input signal X0 to be described in the following example embodiments is obtained in the same manner as described above.
The a priori S/N ratio expectation calculation unit 102 receives a priori S/N ratio Rsn1 output from the first a priori S/N ratio estimation unit 101, and a priori S/N ratio model Msn stored in advance in the storage unit 105. The a priori S/N ratio model Msn is constituted by a priori S/N ratio pattern. The a priori S/N ratio expectation calculation unit 102 compares between a priori S/N ratio Rsn1 and a priori S/N ratio model Msn, and outputs a value obtained by correcting the a priori S/N ratio Rsn1 by the a priori S/N ratio model Msn, as a priori S/N ratio expectation RsnE.
The noise suppression coefficient calculation unit 103 receives a priori S/N ratio expectation RsnE output from the a priori S/N ratio expectation calculation unit 102. The noise suppression coefficient calculation unit 103 calculates a noise suppression coefficient W0 with use of a priori S/N ratio expectation RsnE, and outputs the noise suppression coefficient W0.
The noise suppression unit 104 receives a noise suppression coefficient W0 output from the noise suppression coefficient calculation unit 103, and an input signal X0. The noise suppression unit 104 suppresses a noise component included in an input signal X0 by multiplying the input signal X0 by a noise suppression coefficient W0, and outputs an estimated value S0 of a desired signal.
In the first example embodiment, the first a priori S/N ratio estimation unit 101, the a priori S/N ratio expectation calculation unit 102, the noise suppression coefficient calculation unit 103, the noise suppression unit 104, and the storage unit 105 may be integrally mounted in a single device. Alternatively, each of the units may be configured as a distributed system to be connected to each other via a communication means such as a network. Further, at least a part of the processes/functions of the first a priori S/N ratio estimation unit 101, the a priori S/N ratio expectation calculation unit 102, and the noise suppression coefficient calculation unit 103 may be implemented by a program to be executed on a computer. Further, at least a part of the processes/functions of the noise suppression unit 104, and the storage unit 105 (read control, write control) may be implemented by a program to be executed on a computer. The same idea as described above is also applied to the other example embodiments.
According to the first example embodiment, a priori S/N ratio Rsn1 is corrected by a priori S/N ratio model Msn taking into consideration fluctuation of the magnitude of noise. It is possible to suppress a noise component with high accuracy without removing a desired signal component even when the magnitude of noise fluctuates, by multiplying an input signal X0 by a noise suppression coefficient W0 calculated with use of a priori S/N ratio expectation RsnE.
Next, a noise suppression system 200 according to the second example embodiment of the present invention is described referring to
The noise suppression system 200 includes a first a priori S/N ratio estimation unit 201, an a priori S/N ratio expectation calculation unit 202, a noise suppression coefficient calculation unit 203, a noise suppression unit 204, and a storage unit 205 which stores a priori S/N ratio model (a priori S/N ratio pattern) Msn in advance.
The first a priori S/N ratio estimation unit 201 receives an input signal X0 in which a desired signal and noise are mixed. Then, the first a priori S/N ratio estimation unit 201 estimates a ratio (a priori S/N ratio) Rsn1 of desired signal power and noise power, which are included in the input signal X0, and outputs the estimated Rsn1.
The a priori S/N ratio expectation calculation unit 202 receives a priori S/N ratio Rsn1 output from the first a priori S/N ratio estimation unit 201, and a priori S/N ratio model Msn stored and held in advance in the storage unit 205. The a priori S/N ratio expectation calculation unit 202 compares between the estimated a priori S/N ratio Rsn1 and the a priori S/N ratio model Msn, and outputs a priori S/N ratio expectation RsnE, which is a value corrected by the a priori S/N ratio model Msn.
The noise suppression coefficient calculation unit 203 receives an output RsnE from the a priori S/N ratio expectation calculation unit 202. The noise suppression coefficient calculation unit 203 calculates a noise suppression coefficient W0 with use of a priori S/N ratio expectation RsnE, and outputs W0.
The noise suppression unit 204 receives a noise suppression coefficient W0 output from the noise suppression coefficient calculation unit 203, and an input signal X0. The noise suppression unit 204 suppresses a noise component included in an input signal by multiplying the input signal X0 by a noise suppression coefficient W0, and outputs an estimated value S0 of a desired signal.
In the following, each of the units of the noise suppression system 200 in
First of all, a process of the first a priori S/N ratio estimation unit 201 in
X
0(f,t)=S(f,t)+N(f,t) (Equation 1)
Note that X0(f, t) is a frequency spectrum (a frequency amplitude spectrum, a frequency power spectrum, or the like) of a mixed signal in which a desired signal and noise are mixed. The frequency spectrum is a signal in a frequency domain (a complex signal including a real part and an imaginary part), which is obtained by applying discrete Fourier transform (DFT) or the like to a signal in a time domain, for instance. A power component is obtained by performing a square operation i.e. multiplying an amplitude component, an amplitude component by absolute value calculation. The parameter f is a frequency index (the frequency index is, for instance, from a DC (direct-current) component (index: 0) to a Nyquist frequency), and the parameter t is a time (discrete time) index. Further, X0, S, and N at the time index t are vectors, each of which has a component in a frequency direction as an element.
The parameter S on the right side is a frequency spectrum of a desired speech component.
Further, N is a frequency spectrum of a noise component.
The first noise estimation unit 2011 receives an input signal X0, estimates a noise component included in the input signal X0, and outputs first estimated noise N1.
The first speech estimation unit 2012 receives an input signal X0 and first estimated noise N1, and outputs first estimated speech S1.
The a priori S/N ratio estimation unit 2013 receives the first estimated speech S1 and the first estimated noise N1, and outputs an estimated a priori S/N ratio Rsn1(=S1/N1). Note that S1 and N1 at the time index t are vectors, each of which has a component in a frequency direction as an element.
The first noise estimation unit 2011 estimates a noise component included in an input signal X0, and outputs first estimated noise N1.
N
1
=NE[X
0] (Equation 2)
Note that NE[ ] denotes a noise estimator. It is possible to use a minimum statistics method, a weighed noise estimation method, or the like, all of which are well-known methods for estimation of a noise component included in an input signal X0. Note that the right side of Equation 2 is calculated for each component of a vector X0 by the noise estimator NE[ ], and are outputs with respect to the each component of the vector X0. In this example, the output with respect to the component of the vector X0 means: yi=NE[xi] (where yi denotes the i-th component of an output vector, and xi denotes the i-th component of a vector X0).
The first speech estimation unit 2012 estimates a speech component included in an input signal X0 by suppressing a noise component included in the input signal X0, and outputs first estimated speech S1.
S
1
=NS[X
0
,N
1] (Equation 3)
Note that NS[ ] denotes a noise suppressor. For instance, a spectral subtraction (SS) method described in NPL 1 may be used. The right side of Equation 3 is calculated for each component of a vector X0 and for each component of a vector N1 by the noise suppressor NS[ ], and are outputs with respect to each component of the vector X0 and a component of the vector N1. In this example, the output with respect to the component of the vector means: yi=NS[Xi, Ni] (where yi denotes the i-th component of an output vector, and Xi and Ni denote the i-th components of a vector Xi and a vector N1). In addition to the above, a Wiener Filter (WF) method, an MMSE STSA (Minimum Mean Square Error Short Time Spectral Amplitude) method, an MMSE LSA (Minimum Mean Square Error Log Spectral Amplitude) method, or the like may be used.
The a priori S/N ratio estimation unit 2013 receives first estimated speech S1 (a speech component included in an input signal X0) from the first speech estimation unit 2012, and first estimated noise N1 from the first noise estimation unit 2011, estimates an S/N ratio (=S1/N1) of a speech signal to noise, and outputs the estimated value as a priori S/N ratio Rsn1.
The right side of Equation 4 is calculated for each component of a vector S1 and for each component of a vector N1, and are outputs with respect to the each component of the vector S1 and the each component of the vector N1. For instance, S1/N1 is output like (S12/Nii, S12/N11, . . . , S1n/N1n). The output with respect to the component of the vector means: yi=xi/zi (where yi denotes the i-th component of an output vector, and xi, and zi denote the i-th components of a vector S1 and a vector N1).
Note that in the a priori S/N ratio estimation unit 2013, first estimated noise N1 of the denominator on the right side of (Equation 4) may be a noise component N1′(=X0−S1), which is re-estimated with use of an input signal X0 and first estimated speech S1. In this case, a priori S/N ratio Rsn1 is given by the following (Equation 5).
The right side of Equation 5 is also calculated for each component of a vector X0 and for each component of a vector S1 in the same manner as described in paragraph [0053]. Further, when the WF method, the MMSE STSA method, or the MMSE LSA method is used in the first speech estimation unit 2012, the first speech estimation unit 2012 may obtain a priori S/N ratio. In view of the above, a priori S/N ratio estimated by the first speech estimation unit 2012 may be regarded as an output (a priori S/N ratio Rsn1) of the first a priori S/N ratio estimation unit 201. In this case, the a priori S/N ratio estimation unit 2013 in
A priori S/N ratio Rsn1 may be calculated, for instance, with use of a value for each frequency band B (e.g. a Mel-frequency band), which is a series of frequency indexes f in (Equation 7), or with use of a value obtained by summing up all the frequency indexes f in (Equation 8), in addition to a value for each frequency index f in the following (Equation 6). Note that a priori S/N ratio Rsn1 at the time index t exists by the number equal to the number of frequency indexes f or the number of frequency bands B. Therefore, a priori S/N ratio Rsn1 at t is a vector which has a component in a frequency direction as an element.
The feature transformation unit 2021 receives a priori S/N ratio Rsn1 output from the first a priori S/N ratio estimation unit 201, and outputs a feature Fsn1 of the a priori S/N ratio Rsn1.
The expectation calculation unit 2022 receives the feature Fsn1, and a priori S/N ratio model (a priori S/N ratio pattern) Msn prepared in advance, and outputs a feature FsnE of a priori S/N ratio expectation.
The feature inverse transformation unit 2023 receives the feature FsnE, and outputs a priori S/N ratio expectation RsnE.
The feature transformation unit 2021 transforms a priori S/N ratio Rsn1 into a feature Fsn1, and outputs the feature Fsn1. As a feature, it is possible to use a logarithmic value in the following (Equation 9), a value (cepstrum) obtained by applying discrete cosine transform (DCT) to a logarithmic value, as expressed by (Equation 10), or the like, for instance.
F
sn1=log Rsn1 (Equation 9)
Note that log expressed by Equation 9 is a natural logarithm. The same definition is applied to log that is described hereinafter. Note that log may employ a common logarithm in addition to a natural logarithm. Note that the right side of Equation 9 is logarithmically calculated for each component of a vector Rsn1, and are outputs with respect to the each component of the vector Rsn1. In this example, the output with respect to the component of the vector Rsn1 means: yi=log xi (where yi denotes the i-th component of an output vector, and xi denotes the i-th component of a vector Rsn1).
F
sn1
=C[log Rsn1] (Equation 10)
Note that C[ ] denotes a DCT operator. The right side of Equation 10 is subjected to cosine transform for each component of a vector log Rsn1, and are outputs with respect to the each component of the vector Rsn1. In this example, the output with respect to the component of the vector Rsn1 means: zi=C[xi] (where zi denotes the i-th component of an output vector, and xi denotes the i-th component of the vector Rsn1). Further, logarithmic computation in Equation 10 is the same as the calculation in Equation 9.
Note that a feature Fsn1 may be calculated for each time index t. Alternatively, a difference with respect to a feature at a past time (e.g., t−1) may be obtained, and a primary difference feature may be used. Further alternatively, a further difference may be obtained, and a secondary difference feature may be used. There exist features Fsn1 at the time index t by the number equal to the number of dimensions of cepstrum, the number of primary difference features, or the number of secondary difference features. Therefore, a feature Fsn1 at the time index t is a multi-dimensional vector.
The expectation calculation unit 2022 receives a feature Fsn1, and a priori S/N ratio model Msn stored in advance in the storage unit 205, and outputs a feature FsnE of a priori S/N ratio expectation. In the following, as an example, a priori S/N ratio model Msn is described as a Gaussian mixture model (GMM), which is constituted by Gaussian distributions whose number is G. Note that it is needless to say that the present invention is not limited to the following example.
A priori S/N ratio model Msn is regarded as a Gaussian mixture model such that Gaussian distributions whose number is G (G>1) with an average value μsn,g and a dispersion σ2sn,g are mixed with a weight wsn,g. Note that g is an index of Gaussian distribution (g=0, 1, . . . , G−1).
The expectation calculation unit 2022 calculates a feature FsnE of a priori S/N ratio expectation as a weighted sum of average values μsn,g of a priori S/N ratio models Msn as expressed by the following (Equation 11).
F
snE=Σg=0G-1P(g|Fsn1)μsn,g (Equation 11)
In (Equation 11), P(g|Fsn1) as a weight is a posterior probability with respect to a feature Fsn1. P(g|Fsn1) is calculated as expressed by (Equation 12), for instance.
In (Equation 12), P(Fsn1|g) is a probability at which a Gaussian distribution g of a priori S/N ratio model Msn outputs a feature Fsn1, and is calculated as expressed by the following (Equation 13).
Note that both of a feature Fsn1 and an average value μsn,g are D-dimensional column vectors, and a dispersion σ2sn,g is a D×D matrix. The parameter det[ ] denotes a determinant operator. Further, T denotes transposition, and {Fsn1−μsn,g}T denotes a D-dimensional row vector. Note that the value of D representing the number of dimensions may be changed as necessary depending on the type of an input signal. When a speech signal is included, ten or more dimensions may be desirable.
A priori S/N ratio model Msn stored and held in advance in the storage unit 105 is expressed by using an average value μsn,g and a dispersion σ2sn,g. The dispersion σ2sn,g includes fluctuation of a speech signal or fluctuation of the magnitude of noise. In view of the above, in (Equation 11), a posterior probability P(g|Fsn1) to be used as a weight is a value taking into consideration fluctuation of the magnitude of noise.
A priori S/N ratio model Msn may be generated with use of a feature Fsn1 with respect to a large amount of input signals in advance. In the case of a Gaussian mixture model, a priori S/N ratio model Msn may be learnt (generated) with use of an expectation maximization algorithm or the like, for instance. Alternatively, a priori S/N ratio model Msn may be generated by combining a speech model Ms and a noise model Mn. A method for combining a speech model Ms and a noise model Mn will be described in the next example embodiment (refer to the description on an expectation calculation unit 3062 in
The feature inverse transformation unit 2023 transforms a feature FsnE of a priori S/N ratio expectation, and outputs a priori S/N ratio expectation RsnE. When a logarithmic value in (Equation 9) is used by the feature transformation unit 2021, inverse transformation is applied by (Equation 14). When a value obtained by applying cosine transform to a logarithmic value is used as expressed by (Equation 10), inverse transformation may be applied by (Equation 15).
R
snE=exp[FsnE] (Equation 14)
R
snE=exp[C−1[FsnE]] (Equation 15)
Note that exp[ ] denotes an exponential operator, and C−1[ ] denotes an inverse cosine transform operator (inverse discrete cosine transform (IDCT) operator). Note that the right side of Equation 14 may be expressed as exp[FsnE], which is an exp function. The right side is calculated for each component of a vector FsnE, and is output corresponding to a vector component like (eFsnE1, eFsnE2, . . . , eFsnEn). In this example, the output with respect to the component of the vector FsnE means: yi=exi (where yi denotes the i-th component of an output vector, and xi denotes the i-th component of a vector FsnE). Further, the right side of Equation 15 may be expressed as exp[C−1[FsnE]], which is an exp function. C−1[FsnE] is calculated for each component of an inverse-cosine-transformed vector FsnE, and is output corresponding to a component of the vector FsnE. In this example, the expression that the right side is output with respect to a vector component means: zi=C−1[xi] (where zi denotes the i-th component of an output vector, and xi denotes the i-th component of a vector FsnE). Further, an exponential operation in Equation 15 is the same as the calculation in Equation 14.
In this example, substituting (Equation 11) in (Equation 15) yields the following mathematical expression.
R
snE=exp[C−1[Σg=0G-1P(g|Fsn1)μsn,g]]=exp[Σg=0G-1P(g|Fsn1)C−1[μsn,g]] (Equation 16)
Inverse cosine transform C−1 is a linear transform. In view of the above, a value C−1[μsn,g], which is a value obtained by applying inverse cosine transform to an average value μsn,g of a priori S/N ratio model Msn, is stored and held in advance in the storage unit 205. As far as an average value μsn,g of a priori S/N ratio model Msn does not change, in (Equation 16), inverse cosine transform operation is unnecessary by using a operation result C−1[μsn,g] of the storage unit 205.
The noise suppression coefficient calculation unit 203 calculates and outputs a noise suppression coefficient W0 with use of a priori S/N ratio expectation RsnE. For instance, it is possible to calculate a noise suppression coefficient by a Wiener Filter method as expressed by the following mathematical expression, with use of a priori S/N ratio expectation RsnE.
The right side of Equation (17) is calculated for each component of a vector RsnE, and are outputs with respect to the each component of the vector RsnE represented by {(RsnE1/(1+RsnE1), (RsnE2/(1+RsnE2), . . . , (RsnEn/(1+RsnEn)), for instance. The output with respect to the component of the vector RsnE means: yi=xi/(1+xi) (where yi denotes the i-th component of an output vector, and xi denotes the i-th component of a vector RsnE).
Note that it is needless to say that the other noise suppression method such as the MMSE STSA method or the MMSE LSA method may be used when the noise suppression coefficient calculation unit 203 calculates a noise suppression coefficient with use of a priori S/N ratio expectation RsnE.
When a noise suppression method using an after S/N ratio (a ratio between a mixed signal including a desired signal and noise, and noise) is employed in calculating a noise suppression coefficient, the noise suppression coefficient calculation unit 203 may calculate an after S/N ratio (X0/N1) from an input signal X0 and first estimated noise N1 in the first a priori S/N ratio estimation unit 201, and may use the after S/N ratio for calculation of a noise suppression coefficient.
The noise suppression unit 204 suppresses a noise component included in an input signal X0 by multiplying the input signal X0 by a noise suppression coefficient W0, and outputs an estimated value S0 of a desired signal.
S
0
=W
0
X
0 (Equation 18)
Specifically, approximating a priori S/N ratio expectation RsnE by a ratio of an estimated value S0 of a desired signal to an estimated value N0 of noise yields approximation: W0≈S0/(S0+N0). Then, W0×X0 becomes an estimated value S0 of a desired signal from X0≈S0+N0.
The first a priori S/N ratio estimation unit 201 estimates a ratio Rsn1 of a desired signal and noise, which are included in an input signal X0 in which the desired signal and noise are mixed.
The a priori S/N ratio expectation calculation unit 202 compares between a priori S/N ratio Rsn1 estimated by the first a priori S/N ratio estimation unit 201, and a priori S/N ratio model Msn in the storage unit 205, and calculates a priori S/N ratio expectation RsnE, which is a value corrected by the a priori S/N ratio model Msn.
The noise suppression coefficient calculation unit 203 calculates a noise suppression coefficient W0 with use of a priori S/N ratio expectation RsnE.
The noise suppression unit 204 suppresses a noise component included in an input signal by multiplying the input signal X0 by a noise suppression coefficient W0, and obtains an estimated value S0 of a desired signal.
According to the example embodiment, a priori S/N ratio Rsn1 is corrected by a priori S/N ratio model Msn taking into consideration fluctuation of the magnitude of noise. By using a noise suppression coefficient calculated with use of a corrected a priori S/N ratio expectation RsnE, it is possible to suppress a noise component with high accuracy without removing a desired signal component even when the magnitude of noise fluctuates.
Next, a noise suppression system according to the third example embodiment of the present invention is described referring to
The operations of a noise suppression coefficient calculation unit 303 and a noise suppression unit 304 in
The first speech and first noise estimation unit 305 receives an input signal X0 in which a desired signal and noise are mixed. Then, the first speech and first noise estimation unit 305 outputs an estimated value S1 of a first desired signal (speech) and an estimated value N1 of first noise, which are included in the input signal X0.
The a priori S/N ratio expectation calculation unit 306 receives an estimated value S1 of a first desired signal (speech) and an estimated value N1 of first noise output from the first speech and first noise estimation unit 305, and a speech model (a speech pattern) Ms stored and held in advance in the storage unit 307. Further, the a priori S/N ratio expectation calculation unit 306 receives a noise model (a noise pattern) Mn stored and held in advance in the storage unit 308. The a priori S/N ratio expectation calculation unit 306 compares between an estimated value S1 of a desired signal (speech) and an estimated value N1 of noise, and between a speech model Ms and a noise model Mn; and outputs a priori S/N ratio expectation RsnE.
The first noise estimation unit 3051 receives an input signal X0, and outputs first estimated noise N1.
The first speech estimation unit 3052 receives an input signal X0 and first estimated noise N1, and outputs first estimated speech S1. The operations of the first noise estimation unit 3051 and the first speech estimation unit 3052 in
The feature transformation unit 3061s receives first estimated speech S1, and outputs a feature Fs1 of the first estimated speech S1.
The feature transformation unit 3061n receives first estimated noise N1, and outputs a feature Fn1 of the first estimated noise N1.
The expectation calculation unit 3062 receives a feature Fs1, a feature Fn1, a speech model Ms prepared in advance, and a noise model Mn prepared in advance, and outputs a feature FsnE of a priori S/N ratio expectation.
The feature inverse transformation unit 3063 receives a feature FsnE, and outputs a priori S/N ratio expectation RsnE. The operation of the feature inverse transformation unit 3063 is the same as the operation of the feature inverse transformation unit 2023 in
The feature transformation unit 3061s receives first estimated speech S1, transforms the input first estimated speech S1, and outputs a feature Fs1. As a feature, it is possible to use a logarithmic value in (Equation 19), a value (cepstrum) obtained by applying cosine transform (discrete cosine transform) to a logarithmic value as expressed by (Equation 20), or the like.
F
s1=log S1 (Equation 19)
Note that the right side of Equation 19, note that the right side of Equation 19 is logarithmically calculated for each component of a vector S1, and are outputs with respect to each component of the vector S1. In this example, the output with respect to the component of the vector means: yi=log xi (where yi denotes the i-th component of an output vector, and xi denotes the i-th component of a vector S1).
F
s1
=C[log S1] (Equation 20)
Further, the right side of Equation 20 is subjected to cosine transform for each component of a vector log S1, and is output corresponding to a component of a vector S1. In this example, the output with respect to the component the vector S1means: zi=C[xi] (where zi denotes the i-th component of an output vector, and xi denotes the i-th component of a vector S1). Further, logarithmic operation of Equation 20 is the same as the calculation in Equation 19.
The feature transformation unit 3061n receives first estimated noise N1, transforms the input first estimated noise N1, and outputs a feature Fn1. As a feature, it is possible to use a logarithmic value in (Equation 21), a value (cepstrum) obtained by applying cosine transform (discrete cosine transform) to a logarithmic value as expressed by (Equation 22), or the like.
F
n1=log N1 (Equation 21)
Note that the right side of Equation 21, note that the right side of Equation 21 is logarithmically calculated for each component of a vector N1, and are outputs with respect to the each component of the vector N1. In this example, the output with respect to the component of the vector N1 means: yi=log xi (where yi denotes the i-th component of an output vector, and xi denotes the i-th component of a vector N1).
F
n1
=C[log N1] (Equation 22)
Further, the right side of Equation 22 is subjected to cosine transform for each component of a vector log N1, and is output corresponding to the component of the vector N1. The right side of Equation 20 is subjected to cosine transform for each component of a vector log N1, and are outputs with respect to the component of the vector N1. In this example, the output with respect to a vector N1means: zi=C[xi] (where zi denotes the i-th component of an output vector, and xi denotes the i-th component of a vector N1). Further, logarithmic operation of Equation 22 is the same as the calculation in Equation 21.
Note that features Fs1 and Fn1 may be calculated for each time index t. Alternatively, a difference with respect to a feature at a past time (e.g., t−1) may be obtained, and a primary difference feature may be used. Further alternatively, a further difference may be obtained, and a secondary difference feature may be used. There exist features Fs1 and Fn1 at the time index t by the number equal to the number of dimensions of cepstrum, the number of primary difference features, or the number of secondary difference features. Therefore, features Fs1 and Fn1 at the time index t is a multi-dimensional vector.
The expectation calculation unit 3062 receives:
outputs a feature FsnE of a priori S/N ratio expectation.
In the following example, the third example embodiment of the present invention is described based on the premise that:
Taking into consideration that:
it is possible to express a feature Fsn1 of a priori S/N ratio as follows with use of features Fs1 and Fn1.
F
sn1
=F
s1
−F
n1 (Equation 23)
As described above, in this example, a speech model Ms is a Gaussian mixture model, in which Gaussian distributions whose number is Gs with an average value μs,gs and a dispersion σ2s,gs are mixed with a weight ws,gs.
Further, a noise model Mn is a Gaussian mixture model, in which Gaussian distributions whose number is Gn with an average value μn,gn and a dispersion σ2n,gn are mixed with a weight wn,gn.
Note that gs and gn are indexes of Gaussian distribution.
In this example, when it is assumed that a speech signal and a noise signal are independent of each other, a priori S/N ratio model is a Gaussian mixture model, in which Gaussian distributions whose number is G (=Gs×Gn) with an average value μsn,g (=μs,gs−μn,gn) and a dispersion σ2sn,g (=σ2s,gs+σ2n,gn) are mixed with a weight wsn,g (=ws,gs×wn,gn).
The expectation calculation unit 3062 calculates and outputs a feature FsnE of an expectation by (Equation 11) in the same manner as the expectation calculation unit 2022 in
According to the example embodiment, a speech model Ms and a noise model Mn may be held in the storage units (307, 308), in place of the a priori S/N ratio model Msn in the second example embodiment. According to this configuration, the example embodiment is advantageous in reducing a required storage capacity, as compared with the second example embodiment. The reason for this is because A+B<AB is established when the number of speech models Ms is A (A>2), and the number of noise models Mn is B (B>2). For instance, when the number of speech models Ms is three, and the number of noise models Mn is two, the number of a priori S/N ratio models can be six. Specifically, it is possible to reduce the number of models to be stored in a storage unit.
Further, according to the example embodiment, when the system is adapted to a different noise environment, and the like, for instance, it is only necessary to re-generate a noise model Mn. This facilitates adaptation to a different noise environment.
Further, according to the example embodiment, when reliability of a feature Fn1 of noise is instantaneously decreased, such as when speech is instantaneously included in the feature Fn1 of noise, the feature Fn1 of noise is substituted by an average value μn,gn of a noise model in (Equation 23). This makes it possible to avoid in advance a situation that speech may be inadvertently suppressed as noise. Note that determination as to whether or not a feature Fn1 of noise is reliable may be performed by comparing between the feature Fn1 of noise and a noise model Mn. For instance, when a feature Fn1 of noise is within the range: μn,gn±3σn,gn (where μn,gn is an average value of a noise model, and σn,gn is a standard deviation), reliability may be high, and when the feature Fn1 of noise is out of the range, reliability may be low.
As described above, according to the example embodiment, an expectation of a feature of a priori S/N ratio is calculated with use of a feature of a priori S/N ratio, and a priori S/N ratio model constituted by a speech model and a noise model; and a noise suppression coefficient is obtained from the expectation of the feature of the a priori S/N ratio. The aforementioned configuration provides operational advantages i.e. suppressing a noise component with high accuracy without removing a desired signal component even when the magnitude of noise fluctuates, as well as the other example embodiments. Further, the example embodiment provides new operational advantages i.e. reducing a capacity of a storage device, and facilitating adaptation to a different noise environment.
A noise suppression system according to a fourth example embodiment of the present invention is described referring to
The operations of a first speech and first noise estimation unit 405, a noise suppression coefficient calculation unit 403, and a noise suppression unit 404 in
The a priori S/N ratio expectation calculation unit 406 receives output values S1 and N1 of the first speech and first noise estimation unit 405, and a speech model (a speech pattern) Ms prepared in advance. The a priori S/N ratio expectation calculation unit 406 outputs a priori S/N ratio expectation RsnE with use of estimated S1 and N1, and a speech model Ms.
The noise model generation unit 4064 receives a feature Fn1 of first estimated noise, generates (successively updates) a noise model Mn, and outputs the generated noise model Mn. In the following, to simplify the description, a noise model is described as a single Gaussian distribution. Note that it is needless to say that the fourth example embodiment of the present invention is not limited to such a distribution.
A noise model Mn is regarded as a single Gaussian distribution with an average value μn and a dispersion σ2n.
μn=AVE[Fn1] (Equation 24)
σn2=VAR[Fn1] (Equation 25)
Note that AVE[ ] denotes an operator which calculates an average value, and VAR[ ] denotes an operator which calculates a dispersion value. For instance, an average value μn(t) and a dispersion σ2n(t) of a noise model Mn at the time index t are respectively and successively updated as expressed by the following (Equation 26) and (Equation 27).
μn(t)=αμμn(t−1)+(1−αμ)Fn1(t) (Equation 26)
σn2(t)=ασσn2(t−1)+(1−ασ){Fn1(t)−μn(t)}2 (Equation 27)
In this example, αμ and ασ are respectively a time constant (0.0 to 1.0) for calculating an average value and a dispersion value, and are normally set to a value of from 0.9 to 1.0 for obtaining an averaging effect. Note that it is needless to say that a noise model Mn may be generated by a method other than the aforementioned exemplary method.
The expectation calculation unit 4062 receives:
outputs a feature FsnE of a priori S/N ratio expectation.
The operation of the expectation calculation unit 4062 is basically the same as the operation of the expectation calculation unit 3062 in
In this example, when it is difficult to generate a priori S/N ratio model by combining a noise model Mn and a speech model Ms that change momentarily by the expectation calculation unit 4062 in the aspect of the amount of calculation, the amount of calculation may be reduced by performing the following device, for instance.
First of all, an average value μsn,g (=μs,gs−μn,gn) of a priori S/N ratio model is considered. In (Equation 13), calculation of a difference between a feature Fsn1 of a priori S/N ratio and an average value μsn,g of a priori S/N ratio model is rewritten with use of an average value μs,gs of a speech model and an average value μn,gn of a noise model.
{Fsn1−μsn,g}={Fsn1−(μs,ng−μn,ng)} (Equation 28)
When the number Gn of mixture distributions of a noise model Mn is smaller than the number Gs of mixture distributions of a speech model Ms, for instance, when the noise model Mn is regarded as a single Gaussian distribution, the following (Equation 29) is applied.
{Fsn1−(μs,ng−μn)}={(Fsn1+μn)−μs,ng} (Equation 29)
Specifically, a difference between an average value μs,gs of a speech model Ms, and a value obtained by adding an average value μn of a noise model to a feature Fsn1 of a priori S/N ratio is calculated. According to this configuration, calculation of an average value of a priori S/N ratio model is unnecessary.
Next, a dispersion σ2sn,g (=σ2s,gs+σ2n,gn) of a priori S/N ratio model is considered.
As a speech model Ms, for instance, a tree-structured speech model as illustrated in
Further, by retrieving a tree structure from an upper layer according to a calculation result of (Equation 13), it is not necessary to calculate a dispersion σ2sn,g of all the a priori S/N ratio models.
Further, when a dispersion σ2n,gn of noise hardly changes, it is possible to reduce the amount of calculation while keeping the accuracy of noise suppression by reducing the calculation frequency of a dispersion σ2sn,g of a priori S/N ratio model.
According to the example embodiment, it is unnecessary to prepare a noise model in advance, because a noise model Mn is generated from an input signal X0.
Further, according to the example embodiment, it is possible to use a noise model suitable for noise included in an input signal X0 by successively updating a noise model Mn. As a result, it is possible to suppress noise with high accuracy, as compared with the third example embodiment.
As another example embodiment, the noise suppression system described in the aforementioned example embodiment may be applied to a microphone unit.
Further, the present invention is applicable to a configuration, in which a noise suppression program that implements the functions of the noise suppression systems of the aforementioned example embodiments is supplied directly or remotely to a system or a device. Therefore, the present invention also provides a program to be installed in a computer, a medium storing the program, or a World Wide Web (WWW) server which downloads the program in order to implement the program on the computer. According to the present invention, a non-transitory computer readable medium storing a program which causes a computer to execute the processing steps included in the example embodiments is provided.
The present invention is not limited to the aforementioned example embodiments, but may be configured by combining the example embodiments in various ways, for instance. Further, the present invention may be applied to a system constituted by a plurality of devices, or may be applied to a single device.
Note that each of the disclosures of the aforementioned patent literatures and non-patent literature is incorporated with reference in the present specification. The example embodiments and examples may be modified/adjusted within the scope of all the disclosures of the present invention (including the claims), and based on the basic technical idea of the present invention. Further, a variety of combinations and selections of various disclosure elements (including the elements of the claims, the elements of the examples, the elements of the drawings and the like) are available within the scope of the claims of the present invention. Specifically, it is needless to say that the present invention includes various modifications and amendments, which could have been achieved by a person skilled in the art according to all the disclosures including the claims, and the technical idea.
This application claims the priority based on Japanese Patent Application No. 2014-145753 filed on Jul. 16, 2014, and all of the disclosure of which is hereby incorporated.
Number | Date | Country | Kind |
---|---|---|---|
2014-145753 | Jul 2014 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2015/003604 | 7/16/2015 | WO | 00 |