The present invention relates to a technology which suppresses an echo in audio.
In the above-mentioned technical field, as shown in patent document 1, the technology to suppress the echo is known. This is the technology which generates an artificial linear echo signal from an output audio signal (far-end signal) by using an adaptive filter, suppresses a linear echo component in an input audio signal, and further, suppresses a non-linear echo component. In particular, by estimating a non-linear echo signal mixed to the input audio signal by using the artificial linear echo signal, a near-end audio signal is relatively clearly extracted from the input audio signal.
However, an echo generated in a stereophonic audio output cannot be appropriately suppressed by the technology described in patent document 1.
The reason is because in the echo suppression device described in patent document 1, it is not assumed that two or more output audio signals (the far-end signal in patent document 1) exist to the input audio signal.
An object of the present invention is to provide a technology which solves the above-mentioned problem.
An audio processing device according to one aspect of the present invention includes
first audio output means for outputting first audio based on a first output audio signal,
second audio output means for outputting second audio based on a second output audio signal,
audio input means for inputting audio and outputting an input audio signal,
first artificial linear echo generation means for generating a first artificial linear echo signal estimated to be generated by the first audio travelling to the audio input means from the first output audio signal and outputting it,
second artificial linear echo generation means for generating a second artificial linear echo signal estimated to be generated by the second audio travelling to the audio input means from the second output audio signal and outputting it,
linear echo suppression means for generating a signal in which a linear echo signal mixed to the input audio signal is suppressed based on the outputs of the first artificial linear echo generation means and the second artificial linear echo generation means and outputting it,
non-linear echo estimation means for estimating a non-linear echo signal based on the first artificial linear echo signal and the second artificial linear echo signal, and
non-linear echo suppression means for suppressing the signal outputted by the linear echo suppression means based on the non-linear echo signal estimated by the non-linear echo estimation means.
An audio processing method according to one aspect of the present invention includes
an audio input step in which first audio and second audio that are outputted by two audio output means based on a first output audio signal and a second output audio signal are inputted by audio input means and an input audio signal is outputted,
a first artificial linear echo generation step in which a first artificial linear echo signal estimated to be generated by the first audio travelling to the audio input means is generated from the first output audio signal and outputted,
a second artificial linear echo generation step in which a second artificial linear echo signal estimated to be generated by the second audio travelling to the audio input means is generated from the second output audio signal and outputted,
a linear echo suppression step in which a signal in which a linear echo signal mixed to the input audio signal is suppressed is generated based on the first artificial linear echo signal and the second artificial linear echo signal and outputted,
a non-linear echo estimation step in which a non-linear echo signal is estimated based on the first artificial linear echo signal and the second artificial linear echo signal, and
a non-linear echo suppression step in which the signal outputted in the linear echo suppression step is suppressed based on the non-linear echo signal estimated in the non-linear echo estimation step.
A non-transitory medium according to one aspect of the present invention recording an audio processing program causing a computer to perform:
an audio input step in which first audio and second audio that are outputted by two audio output means based on a first output audio signal and a second output audio signal are inputted by audio input means and an input audio signal is outputted,
a first artificial linear echo generation step in which a first artificial linear echo signal estimated to be generated by the first audio travelling to the audio input means is generated from the first output audio signal and outputted,
a second artificial linear echo generation step in which a second artificial linear echo signal estimated to be generated by the second audio travelling to the audio input means is generated from the second output audio signal and outputted,
a linear echo suppression step in which a signal in which a linear echo signal mixed to the input audio signal is suppressed based on the first artificial linear echo signal and the second artificial linear echo signal is generated and outputted,
a non-linear echo estimation step in which a non-linear echo signal is estimated based on the first artificial linear echo signal and the second artificial linear echo signal, and
a non-linear echo suppression step in which the signal outputted in the linear echo suppression step is suppressed based on the non-linear echo signal estimated in the non-linear echo estimation step.
By using the present invention, the echo generated in a stereophonic audio output can be appropriately suppressed.
The exemplary embodiment of the present invention will be exemplarily described in detail below with reference to the drawings. However, the components described in the following exemplary embodiment are shown as an example. Therefore, a technical scope of the present invention is not limited to those descriptions. cl First Exemplary Embodiment
An audio processing device 100 according to a first exemplary embodiment of the present invention will be described by using
As shown in
Among these units, the first audio output unit 101 and the second audio output unit 102 output audios that correspond to a first output audio signal and a second output audio signal, respectively.
Audio is inputted to the audio input unit 103.
The first artificial linear echo generation unit 104 generates a first artificial linear echo signal based on the first output audio signal sent to the first audio output unit 101 and outputs it.
The second artificial linear echo generation unit 105 generates a second artificial linear echo signal based on the second output audio signal sent to the second audio output unit 102 and outputs it.
The linear echo suppression unit 106 suppresses a linear echo signal mixed to an input audio signal based on the first artificial linear echo signal and the second artificial linear echo signal and outputs it.
The non-linear echo estimation unit 107 estimates the non-linear echo signal based on the first artificial linear echo signal and the second artificial linear echo signal and outputs it.
The non-linear echo suppression unit 108 suppresses the non-linear echo signal mixed to the input audio signal in which the linear echo signal is suppressed based on a result of an estimation of a non-linear echo signal and outputs it.
By using the above-mentioned configuration, the echo generated by a device having two audio input means, that is a stereophonic audio output, can be appropriately suppressed.
The reason is because the following configuration is included. First, the first artificial linear echo generation unit 104 and the second artificial linear echo generation unit 105 generate the first artificial linear echo signal and the second artificial linear echo signal based on the first output audio signal and the second output audio signal and output them, respectively. Secondly, the linear echo suppression unit 106 suppresses the linear echo signal mixed to the input audio signal based on the first artificial linear echo signal and the second artificial linear echo signal. Thirdly, the non-linear echo estimation unit 107 estimates the non-linear echo signal based on the first artificial linear echo signal and the second artificial linear echo signal and the non-linear echo suppression unit 108 suppresses the non-linear echo signal and outputs it.
Next, an audio processing device 200 according to a second exemplary embodiment of the present invention will be described by using
As shown in
Further, the audio processing device 200 includes an adaptive filter 214, an adaptive filter 224, and an addition unit 205. The adaptive filters 214 and 224 input the first output signal xR(k) and the second output signal xL(k), generate artificial linear echo signals, and output them, respectively. The addition unit 205 adds the artificial linear echo signals that are outputted by the adaptive filter 214 and the adaptive filter 224, respectively and outputs it as a combined artificial linear echo signal.
Further, the audio processing device 200 includes a linear echo canceller 206, a non-linear echo estimation unit 207, a flooring unit 208, and a non-linear echo suppressor 209. The combined artificial linear echo signal generated by the addition unit 205 is supplied to both of the linear echo canceller 206 and the non-linear echo estimation unit 207.
The linear echo canceller 206 subtracts the artificial linear echo signal combined by the addition unit 205 from a mixed signal P(k) and output it. On the other hand, the non-linear echo estimation unit 207 estimates a non-linear echo signal based on the artificial linear echo signal combined by the addition unit 205. The flooring unit 208 applies a flooring process to the non-linear echo signal estimated by the non-linear echo estimation unit 207 and outputs a flooring result. The non-linear echo suppressor 209 suppresses the non-linear echo signal in the output signal of the linear echo canceller 206 by gain control based on the flooring result and outputs it.
The above-mentioned configuration is conceived based on a new idea in which the influence of echoes caused by two speakers are regarded as the influence of a linear echo caused by one speaker and are suppressed. And, the echoes caused by two speakers can be suppressed by using a very simple configuration.
Next, the circuit configuration of the audio processing device 200 will be explained by using
As explained by using
The addition unit 205 adds the generated artificial linear echo signals and generates the combined artificial linear echo signal.
A subtractor subtracts the combined artificial linear echo signal from the input audio signal outputted by the microphone 203 as the linear echo canceller 206, generates a residual signal d(k), and outputs it.
The residual signal d(k) is inputted to a fast Fourier transform (FFT) unit 301 and a combined artificial linear echo signal y(k) is inputted to a fast Fourier transform unit 302.
The audio processing device 200 further includes the fast Fourier transform unit 301, the fast Fourier transform unit 302, the non-linear echo estimation unit 207, the flooring unit 208, the non-linear echo suppressor 209, and an inverse fast Fourier transform (IFFT) unit 306.
The fast Fourier transform units 301 and 302 convert the residual signal d(k) and the artificial linear echo signal y(k) into frequency spectrums, respectively.
The non-linear echo estimation unit 207, the flooring unit 208, and the non-linear echo suppressor 209 are provided for each frequency component.
The inverse fast Fourier transform unit 306 integrates an amplitude spectrum derived for each frequency component and a corresponding phase, performs an inverse fast Fourier transform and performs recombination to form an output signal zi(k) in a time domain. Further, namely, the output signal zi(k) in the time domain is a signal having an audio waveform sent to a communication partner.
Although the waveform of the linear echo signal is completely different from that of the non-linear echo signal, with respect to the spectral amplitude for each frequency, there is a correlation between the amplitudes of the both signals. Namely, when the amplitude of the artificial linear echo signal is large, the amplitude of the non-linear echo signal is large. In other words, an amount of the non-linear echo signal can be estimated based on the artificial linear echo signal.
Accordingly, the non-linear echo estimation unit 207 estimates the spectral amplitude of the desired audio signal based on the estimated amount of the non-linear echo signal. Although the estimated spectral amplitude of the audio signal has an error, the flooring unit 208 performs a flooring process so as not to cause an uncomfortable feeling subjectively by the estimation error.
For example, when the estimated spectral amplitude of the audio signal is excessively small and smaller than the spectral amplitude of a background noise, the signal level varies according to the presence or absence of an echo and a feeling of strangeness is brought. As a countermeasure against this, the flooring unit 208 estimates the level of the background noise and uses it as a lower limit of the estimated spectral amplitude to reduce the level variation.
On the other hand, when the large residual echo remains in the estimated spectral amplitude by the estimation error, the residual echo intermittently and rapidly changes to an artificial additional sound called musical noise. As a countermeasure against this, in order to eliminate the echo, the non-linear echo suppressor 209 does not perform a subtraction of the estimated non-linear echo signal and functions as a spectral gain calculation unit which performs a multiplication of a gain so as to obtain the amplitude that is approximately equal to the amplitude obtained by the subtraction. By performing a smoothing process to prevent a sudden gain change, an intermittent change of the residual echo can be suppressed.
Hereinafter, the internal configuration of the non-linear echo estimation unit 207, the flooring unit 208, and the non-linear echo suppressor 209 will be described by using a mathematical expression.
The residual signal d(k) inputted to the fast Fourier transform unit 301 is a sum of a near-end signal s(k) and a residual non-linear echo signal q(k).
d(k)=s(k)+q(k) (1)
It is assumed that the linear echo is almost completely eliminated by the adaptive filter 214, the adaptive filter 224, and the subtractor (the linear echo canceller 206). Only a non-linear component is considered in a frequency domain. By the fast Fourier transform units 301 and 302, equation (1) is converted into the following equation in frequency domain.
D(m)=S(m)+Q(m) (2)
Here, m is a frame number and the vectors D(m), S(m), and Q(m) are expressions of which d(k), s(k), and q(k) are converted into a frequency domain, respectively. It is assumed that each frequency is independent. By transforming equation (2), it is expressed as follows at the i-th frequency.
Si(m)=Di(m)−Qi(m) (3)
Because the adaptive filter 214, the adaptive filter 224, and the subtractor (the linear echo canceller 206) remove the correlation, there is hardly a correlation between Di(m) and Yi(m). Accordingly, the subtractor 276 performs a calculation of 2) as follows.
2) is derived from Di(m) by using an absolute value obtaining circuit 271 and an averaging circuit 273.
On the other hand, the non-linear echo signal |Qi(m)| can be modeled as a product of a regression coefficient ai and an average echo replica as follows.
Accordingly, the absolute value obtaining circuit 272 and the averaging circuit 274 derive the average echo replica from Yi(m) and an integration unit 275 multiplies it by the regression coefficient ai. Here, the regression coefficient ai is a regression coefficient indicating a correlation between |Qi(m)| and |Yi(m)|. This model is based on an experimental result showing that there is a significant correlation between |Qi(m)| and |Yi(m)|.
Equation (3) is an additive model that is widely used for a noise suppression. In the spectral shaping shown in
A square root of equation (6) is taken, a mean square of equation (3) is taken, and ai2*|Yi(m)|2 is substituted for |Qi(m)|2 in equation (4). By performing this process, the estimation value ) of |Si(m)| may be obtained as follows. By performing this method, the non-linear echo signal can be further effectively suppressed.
Because the model is not elaborate, the estimated amplitude ) has a non-negligible error. When the error is large and an over-subtraction occurs, a high-frequency component of the near-end signal decreases or a feeling of modulation occurs. In particular, when the near-end signal is constantly generated like a sound of an air conditioner, the feeling of modulation is uncomfortable. In order to reduce the feeling of modulation subjectively, the flooring on a spectrum is used by the flooring unit 208.
First, in the flooring unit 208, the averaging circuit 281 estimates a stationary component |Ni(m)| of the near-end signal Di(m). Next, a maximum value selection circuit 282 uses the stationary component |Ni(m)| as a lower limit and performs the flooring. As a result, an amplitude estimation value |Ŝi() of the near-end signal that is better estimated can be obtained. After that, a divider 291 calculates a ratio of |Ŝi to (
) to
). Further, an averaging circuit 292 performs an averaging of this ratio and obtains the spectral gain
Finally, as shown in mathematical expression (5), an integrator 293 calculates the product of the spectral gain Gi(m) and the residual signal |Di(m)|. By performing this process, the amplitude |Zi(m)| can be obtained as the output signal. The inverse fast Fourier transform unit 306 performs an inverse Fourier transform of the amplitude |Zi(m)| and outputs the audio signal zi(k) in which the non-linear echo is effectively suppressed.
The regression coefficient ai can be estimated from the input to the microphone 203 when an audio is outputted from the speaker. As disclosed in republication 2009/051197, the regression coefficient may be updated according to the status.
By using the above-mentioned configuration, the linear echo signal and the non-linear echo signal caused by two speakers 201 and 202 can be effectively suppressed.
The reason is because the echo is suppressed by the linear echo canceller 206, the fast Fourier transform unit 301, the fast Fourier transform unit 302, the non-linear echo estimation unit 207, the flooring unit 208, the non-linear echo suppressor 209, and the inverse fast Fourier transform unit 306 based on the combined artificial linear echo signal obtained by combining the outputs of the adaptive filter 214 and the adaptive filter 224.
Further, when the above-mentioned configuration is used, a circuit design can be efficiently performed. p The reason is because with respect to the first output signal xR(k) and the second output signal xL(k) sent to two speakers, the linear echo canceller 206, the fast Fourier transform unit 301, the fast Fourier transform unit 302, the non-linear echo estimation unit 207, the flooring unit 208, the non-linear echo suppressor 209, and the inverse fast Fourier transform unit 306 are shared.
Next, an audio processing device 400 according to a third exemplary embodiment of the present invention will be described by using
As compared with the audio processing device 200 according to the second exemplary embodiment, the audio processing device 400 according to the third exemplary embodiment is different in the respect that it does not include the non-linear echo estimation unit 207 but includes a non-linear echo estimation unit 417 and a non-linear echo estimation unit 427.
The non-linear echo estimation unit 417 functions as first non-linear echo estimation means that estimate a first non-linear echo signal from the first artificial linear echo signal and the non-linear echo estimation unit 427 functions as second non-linear echo estimation means that estimate a second non-linear echo signal from the second artificial linear echo signal. The configuration and the operation of the audio processing device 400 according to the third exemplary embodiment are the same as those of the audio processing device 200 according to the second exemplary embodiment excluding the above-mentioned points.
Therefore, the same reference numbers are used for the components having the same configuration and operation as the second exemplary embodiment and the detailed explanation of these components is omitted.
The audio processing device 400 includes the fast Fourier transform unit 301, a fast Fourier transform unit 502, and a fast Fourier transform unit 503. Further, the audio processing device 400 includes a non-linear echo estimation unit 507, a non-linear echo estimation unit 508, the flooring unit 208, the non-linear echo suppressor 209, and the inverse fast Fourier transform unit 306.
The fast Fourier transform unit 301 converts the residual signal d(k) into a frequency spectrum Di(m). The fast Fourier transform unit 502 and the fast Fourier transform unit 503 convert two artificial linear echo signals y1(k) and y2(k) into frequency spectrums Yi1 (m) and Yi2(m), respectively.
The non-linear echo estimation unit 507, the non-linear echo estimation unit 508, the flooring unit 208, and the non-linear echo suppressor 209 are provided for each frequency component.
The inverse fast Fourier transform unit 306 integrates an amplitude spectrum derived for each frequency component and a corresponding phase, performs an inverse fast Fourier transform, and performs recomposition of the output signal zi(k) in time domain. Further, namely, the output signal zi(k) in time domain is a signal having an audio waveform that is sent to a communication partner.
The non-linear echo estimation units 507 and 508 estimate a spectral amplitude of a desired audio signal based on an estimated amount of a non-linear echo signal.
Because the adaptive filter 214, the adaptive filter 224, and the subtractor (the linear echo canceller 206) remove the correlation, there is hardly a correlation between Di(m) and Yi(m). Accordingly, 2) can be obtained by the subtractor 276 as follows.
The non-linear echo signals |Qi1(m)| and |Qi2(m)| can be modeled as a product of one of the regression coefficients ai1 and as2 and one of the average echo replicas and
as follows.
Accordingly, an absolute value obtaining circuit 572 and an averaging circuit 574 derive the average echo replica from Yi1(m) and an integration unit 575 performs multiplication of the regression coefficient ai1. Further, an absolute value obtaining circuit 582 and an averaging circuit 584 derive the average echo replica
from Yi2m) and an integration unit 585 performs multiplication of the regression coefficient ai2.
On the other hand, the estimation value ) of |Si(m)| may be obtained as follows. By performing this process, the non-linear echo signal can be further effectively suppressed.
In order to reduce the feeling of modulation subjectively, the flooring on the spectrum is performed by the flooring unit 208. The integrator 293 calculates the product of the spectral gain Gi(m) and the residual signal |Di(m)| and outputs the amplitude |Zi(m)| as the output signal. The inverse fast Fourier transform unit 306 performs an inverse Fourier transform of the amplitude |Zi(m)| and outputs the audio signal zi(k) in which the non-linear echo is effectively suppressed.
The regression coefficients ai1 and ai2 can be individually estimated from the input of the microphone 203 when the audio is individually outputted from one of the speakers 201 and 202. As disclosed in republication 2009/051197, the regression coefficient may be updated according to the status.
By using the above-mentioned configuration, the third exemplary embodiment can obtain the effect that is the same as that of the second exemplary embodiment.
The reason is because the non-linear echo estimation unit 417 and the non-linear echo estimation unit 427 are included instead of the non-linear echo estimation unit 207.
The exemplary embodiment of the present invention has been described in detail above. However, a system or a device in which the different features included in the respective exemplary embodiments are arbitrarily combined is also included in the scope of the present invention.
Further, the present invention may be applied to a system composed of a plurality of devices and it may be applied to a stand-alone device. Furthermore, the present invention can be applied to a case in which an information processing program which realizes the function of the exemplary embodiment is directly or remotely supplied to the system or the device.
Accordingly, a program installed in a computer to realize the function of the present invention by the computer, a medium storing the program, and a WWW (World Wide Web) server which downloads the program are also included in the scope of the present invention.
Hereinafter, as an example, in a case in which the audio process described in the second exemplary embodiment is realized by software, a flow of this process executed by a CPU (Central Processing Unit) 602 provided in a computer 600 will be described by using
First, the CPU 602 inputs a first audio and a second audio outputted from two speakers 201 and 202 from the microphone 203 based on a first output audio signal and a second output audio signal and outputs a input audio signal (S601).
The CPU 602 generates a first artificial linear echo signal estimated to be generated by an audio travelling from the speaker 201 to the microphone 203 from the first output audio signal (S603).
The CPU 602 generates a second artificial linear echo signal estimated to be generated by an audio travelling from the speaker 202 to the microphone 203 from the second output audio signal (S605).
The CPU 602 suppresses a linear echo signal mixed to the input audio signal based on the first artificial linear echo signal and the second artificial linear echo signal (S607).
The CPU 602 estimates the non-linear echo signal based on the first artificial linear echo signal and the second artificial linear echo signal (S609). The CPU 602 suppresses the estimated non-linear echo signal (S611).
By performing the above mentioned processes, this exemplary embodiment can obtain the effect that is the same as that of the second exemplary embodiment.
Further, an input unit 601 may include the audio input unit 103 and the microphone 203. An output unit 603 may include the first audio output unit 101, the second audio output unit 102, the speaker 201, and the speaker 202. A memory 604 stores information. When the CPU 602 performs the operation of each step, the CPU 602 writes the required information into the memory 604 and reads out the required information from the memory 604.
The recording medium 707, which records a code of the above-mentioned program(software), may be supplied to the computer 600, and CPU 602 may read and carry out the code of the program which is stored in the recording medium 707. Or, CPU 602 may make the code of the program, which is stored in the recording medium 707, stored in the memory 604. That is, the exemplary embodiment includes an exemplary embodiment of the recording medium 707 recording the program, which is executed by the computer 600 (CPU 602), temporarily or non-temporarily.
While the present invention has been described with reference to the exemplary embodiment, the present invention is not limited to the above-mentioned exemplary embodiment. Various changes, which a person skilled in the art can understand, can be added to the composition and the details of the invention of the present application in the scope of the invention of the present application.
This application claims priority from Japanese Patent Application No. 2011-112078 filed on May 19, 2011, the disclosure of which is hereby incorporated by reference in its entirety.
100 audio processing device
101 first audio output unit
102 second audio output unit
103 audio input unit
104 first artificial linear echo generation unit
105 second artificial linear echo generation unit
106 linear echo suppression unit
107 non-linear echo estimation unit
108 non-linear echo suppression unit
200 audio processing device
201 speaker
202 speaker
203 microphone
205 addition unit
206 linear echo canceller
207 non-linear echo estimation unit
208 flooring unit
209 non-linear echo suppressor
214 adaptive filter
224 adaptive filter
271 absolute value obtaining circuit
272 absolute value obtaining circuit
273 averaging circuit
274 averaging circuit
275 integration unit
276 subtractor
281 averaging circuit
282 maximum value selection circuit
291 divider
292 averaging circuit
293 integrator
301 fast Fourier transform unit
302 fast Fourier transform unit
306 inverse fast Fourier transform unit
400 audio processing device
417 non-linear echo estimation unit
427 non-linear echo estimation unit
502 fast Fourier transform unit
503 fast Fourier transform unit
507 non-linear echo estimation unit
508 non-linear echo estimation unit
572 absolute value obtaining circuit
574 averaging circuit
575 integration unit
582 absolute value obtaining circuit
584 averaging circuit
585 integration unit
600 computer
602 CPU
707 recording medium
Number | Date | Country | Kind |
---|---|---|---|
2011-112078 | May 2011 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2012/063408 | 5/18/2012 | WO | 00 | 11/4/2013 |