This application claims priority under 35 U.S.C. §119(a) to an application entitled “Apparatus and Method for Beamforming in Reflection of Actual Noise Environment Character” filed in the Korean Industrial Property Office on Feb. 7, 2007 and assigned Serial No. 2007-0012803, the contents of which are hereby incorporated by reference.
1. Field of the Invention
The present invention relates to a beamforming apparatus and a beamforming method, and more particularly to an apparatus and a method for performing beamforming for an input signal in consideration of an actual noise environment character.
2. Description of the Related Art
In general, a microphone refers to a transducer for converting acoustic signals conveyed through air vibration into electrical signals. With the recent development of robot control technologies, a microphone has been used as a robot audio interface, i.e. a means for freely communicating ideas between a robot and a user. The robot converts speech signals, which are input through a microphone used as a robot audio interface, into electrical signals and analyzes the converted data, thereby recognizing a user's speech. In addition to the robot, a speech recognition apparatus providing a speech recognition service through the equipped microphone has been increasingly developed.
In a case of such a speech recognition apparatus receiving specific speech signals, if a microphone of the apparatus is located to have directivity towards a direction in which the speech signals are input, the speech recognition apparatus can prevent input of noise occurring in a surrounding environment. In this case, only one microphone having a high directivity can also have directivity towards a direction in which specific speech signals are input. However, when a microphone array is formed by arranging a number of microphones instead of one microphone, it is possible to freely acquire a directivity character suitable for user purposes. Therefore, it is common for a speech recognition apparatus to be equipped with a microphone array enabling use of an audio interface.
Meanwhile, when a software process is performed to eliminate noise for speech signals input through a microphone array, beams are formed from the microphone array toward a specific direction according to the software process. In order to achieve a high directivity from a microphone to a desired direction after forming beams by such a microphone array, a beamforming technology is used.
If a high directivity is formed toward the direction in which a user speech is input through the above-described beamforming, speech signals input from the outside of the beams are automatically reduced. Therefore, it is possible to selectively acquire speech signals input from the direction of interest. The microphone array can suppress surrounding noise, such as noise from an indoor computer fan, television sounds, etc, and the partial reverberation retro-reflected from objects, such as furniture and walls. That is, the microphone array can acquire a higher Signal to Noise Ratio (SNR) for speech signals generated from beams of the interesting direction, by using the beamforming technology. Therefore, the beamforming points beams to a sound source and plays an important role in spatial filtering which suppresses all signals input from different directions.
The beamformer performing beamforming for such input signals shows effective performance as it consistently has over all frequency domains. In this case, a beamformer using a Minimum Variance Distortionless Response (MVDR) algorithm is generally used in a noise environment having a stationary character.
A construction by which a beamformer using an MVDR algorithm performs a beamforming operation and outputs a noise-eliminated signal will be described with reference to
First, when speech signals on the time domain input through the microphone array 100 are transformed into signals on the frequency domain, and the resultant signals are input to the beamforming unit 110, the beamforming unit 110 can derive output values using Equation (1) below.
In Equation (1), N denotes the number of microphones constituting the microphone array 100, Xi(ω) represents an ith input signal on the frequency domain from among N microphones. Also, a filter factor called Wi of Equation (1) is determined depending on a model format defining a noise environment.
The MVDR algorithm based on a minimum variance solution is widely used as an algorithm for performing beamforming so as to suppress noise from all directions except for a desired direction of input signals in the microphone array 100.
A filter factor value ‘W’ for performing beamforming through such an MVDR algorithm is defined by Equation (2) below.
In Equation (2), d is a vector affecting decision of the direction so that microphone array 100 is oriented toward a sound source. In a Uniform Linear microphone Array (ULA) arranged with a same distance between adjacent microphones, d can be expressed as defined by Equation (3) below.
d=[d1d2 . . . dn]Γ (3)
In Equations (2) and (3),
c represents the speed of sound, n represents a serial number of a corresponding microphone, d represents distance between microphones, and θ represents an angle of incident speech signals with respect to the array. Γ represents a coherence matrix, which can be expressed by Equation (4) below.
In Equation (4), each component of the coherence matrix corresponds to coherence for the input X0X1, which can be defined by Equation (5) below. Herein, Φ represents Power Spectral Density (PSD) between two input noise signals.
That is, performance of the beamforming unit 110 is determined according to a spatial character of only an input signal. Therefore, if a coherence of a noise environment is well defined, it is possible to effectively improve the performance of the beamforming unit 110.
Generally, in an indoor noise environment, signals are retro-reflected and diffused due to obstacle, such as walls, and furniture. Therefore, signals input from all directions of a noise environment to the microphone are regarded to have constant power, which is called a diffuse environment. If dij represents a space between a microphone i and a microphone j, a coherence in an ideal diffuse environment can be defined by using a sinc function as shown in equation (6). Coherences are calculated by using the sinc function as shown in equation (6) below and the resultant values are applied to a beamformer, which is called a super-directive beamformer.
As such, a conventional beamformer calculates coherences by applying the above-described Equation (6) using the sinc function, which is fixed regardless of data based on an actual noise magnitude. By using the calculated coherences, the beamformer is employed and applied to a noise filtering.
As described above, since an indoor environment, such as a house or an office has a reverberant character against signals, the environment can be assumed as a diffuse environment. However, an actual coherence significantly changes according to a noise environment, as shown in
If a speech recognition apparatus is placed at an ideal diffuse environment and speech signals are input from such a diffuse environment to the speech recognition apparatus, a coherence between two input signals on the low frequency domain must be approximated to have a value of 1. However, the coherence has practically different values depending on a position and a space at which the microphones are arranged. Even if the same kind of microphone is used, each microphone has a different gain. An actual measurement coherence may have frequently different values since the microphone itself generates noise.
However, a coherence used in a current beamformer corresponds to a coherence calculated by using only a fixed sinc function regardless of an actual noise environment, as shown in Equation (6). Therefore, as shown in
Accordingly, the present invention has been made to solve the above-mentioned problems occurring in the prior art, and the present invention provides a beamforming apparatus and a beamforming method for achieving an effective spatial filtering by employing a beamformer reflecting an actual noise environment character.
The present invention also provides a beamforming apparatus and a beamforming method for calculating a coherence value in consideration of an actual noise environment.
In accordance with an aspect of the present invention, there is provided an apparatus for beamforming in consideration of an actual noise environment character, the apparatus including a microphone array having at least microphone, the microphone array outputting a signal input through the microphone; a coherence function generation unit for calculating coherences for input signals according to each space between microphones, calculating averages of the coherences for the same distance, and filtering the calculated averages of the coherences and outputting the resultant values, when an input signal is input; a spatial filter factor calculation unit for calculating and outputting a spatial filter factor by using the filtered average coherences; and a beamforming execution unit for performing beamforming for the input signals by using the spatial filter factor, thereby outputting a noise-processed signal.
In accordance with another aspect of the present invention, there is a method for beamforming in consideration of an actual noise environment in a speech recognition apparatus equipped with a microphone array including at least one microphone, the method including when an input signal is input to the microphone, calculating coherences for the input signal according to spaces between microphones, and calculating averages of the coherences for each same distance between the microphones; filtering the calculated averages of the coherences and calculating a spatial filter factor by using the filtered average coherences; and performing beamforming for the input signal by using the spatial filter factor, thereby outputting a noise-processed signal.
The above and other aspects, features and advantages of the present invention will be more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which:
Hereinafter, an exemplary embodiment of the present invention will be described with reference to the accompanying drawings. In the following description, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.
The present invention provides a method which, in a speech recognition apparatus equipped with a microphone array including a plurality of microphones, reflects a noise character of an actual environment to a beamformer by analyzing a signal input from each of the microphones, calculating the coherence in consideration of the actual environment noise character, and applying the resultant values to the beamformer.
Hereinafter, an internal construction of a speech recognition apparatus for performing beamforming in consideration of an actual environment noise character according to an embodiment of the present invention will be described with reference to
First, the microphone array 300 includes a plurality of microphones 300-1 to 300-N, which are linearly arranged with the same space between the microphones to each receive an input signal. In this case, the input speech signals corresponding to input signals having noise and speech, and each of the microphones outputs the input signal to the beamforming unit 310.
The beamforming unit 310 receives a signal input from each of microphone arrays 300-1 to 300-N and calculates coherences for a noise section of the input signal according to a space of each microphone. Then, the beamforming unit 310 calculates averages of the coherences, which are obtained from each same distance, and performs the filtering so as to smoothen a rapidly changing part in the average coherence function. Then, the beamforming unit 310 calculates a beamforming spatial filter factor by using the filtered coherence, performs beamforming for the input signal by using the calculated spatial filter factor, thereby outputting a noise-processed signal.
The beamforming unit 310 includes a coherence function generation unit 312 having a coherence calculation unit 314, a coherence average calculation unit 316, and a filter 318, a spatial filter factor calculation unit 320, and a beamforming execution unit 322. Hereinafter, a detailed operation for respective constructions of the beamforming unit 310 will be described.
First, the coherence calculation unit 314 analyzes a signal input from each of the microphones 300-1 to 300-N, and calculates coherences according to a space between microphones. The coherences calculated according to the space between microphones are input to the coherence average calculation unit 316, and the coherence average calculation unit 316 calculates an average value of the input coherences obtained from the same distance. That is, each coherence average value is calculated according to the same distance between the microphones.
Then, the coherence average values for each same distance calculated by the coherence average calculation unit 316 are input to the filter 318, and the filter 318 performs the filtering of the input average values to be smoothened and outputs the resultant values.
The spatial filter factor calculation unit 320 calculates the spatial filter factor for beamforming by using the input coherences. In this case, calculation of the spatial filter factor through the coherences will be described in more detail by Equation (9) below.
Such a spatial filter factor calculated from the spatial filter factor calculation unit 320 is input to the beamforming execution unit 322, and the beamforming execution unit 322 removes noise from the input signal through the spatial filtering process using the calculated spatial filter factor and outputs a noise-filtered signal.
Now, a beamforming operation for signals input to a microphone array including four microphones, for example, will be described.
First, the coherence calculation unit 314 calculates three coherence functions for input signals, received to each of four microphones, based on the each distance between microphones. In this case, since it is assumed that the number of microphones is four, three coherence functions are calculated. If the number of microphones is N, the number of coherences to be calculated between adjacent microphones is N−1. Moreover, under an assumption that a preceding part of a signal input to the microphones (for example, about 20 frames) is a noise section, the coherence is calculated by using Equation (5) with the signal of the noise section after subjecting the input signal to a discrete Fourier transform.
The coherence between adjacent microphones arranged with the same space has the similar distribution as shown in
That is, the coherence average calculation unit 316 calculates the coherence average values for the same distance between the microphones by Equation (7).
In the coherence matrix of equation (6), respective components are determined according to the distance between two microphones. As shown in
When the number of the microphones used in the microphone array is four, the average values of coherences for each of a, 2a, and 3a having the same distance are defined by Equation (7). That is, because there are three coherences having a distance of a, three average values are calculated. Because there are two coherences having a distance of 2a, two average values are calculated. Also, because there is only one coherence having a distance of 3a, it is possible to use the coherence having a distance of 3a as it is without calculating a separate average value.
Also, Equation (7) may be differently applied according to the number of the microphones. For example, when the number of microphones is six, there are five spaces of a to 5a between microphones. Therefore, five combinations can be calculated. Also, respective average coherences calculated according to the same distance between each of the microphones also have a great fluctuation width in the range of the whole frequency bandwidth, as expressed by the dotted lines in the graph of
Therefore, errors caused by sensitivity of the coherence rapidly changing depending on frequencies are reduced, and a filtering operation is performed in the filter 318 so as to smooth a width of a coherence function varying according to frequencies. In this case, in order to smoothen rapid changing coherences by performing the filtering of the average coherences, one of the following methods can be used. The methods include a first method of applying a moving average filter, a second method for subjecting the coherence function to Fourier transform and passing the resultant function through a Low Pass Filter (LPF), a third method using a median filter, and a fourth method using one dimensional Gaussian smoothing filter.
When the coherence function has a smoothened curve by applying the moving average filter, i.e. the first method of the filtering methods, the filtering can be performed as shown in equation (8) below.
In Equation (8), k=1, 2, 3, h=⅓, and n represents an index for a frequency.
The coherences filtered by the filter 318 are input to the spatial filter factor calculation unit 320. Then, the spatial filter factor calculation unit 320 calculates a beamforming spatial filter factor by using the input coherences.
Hereinafter, an operation for calculating a beamforming spatial filter factor by using the coherences input from the spatial filter factor calculation unit 320 will be described in more detail.
In the coherence matrix as shown in Equation (4), since the averages for the coherences obtained from the microphones arranged between the same distance is calculated, it can be said that ΓX
The spatial filter factor calculation unit 320 calculates spatial filter factors for beamforming by applying the coherence matrix as shown in Equation (9) to the above-described Equation (2).
Then, the beamforming execution unit 322 performs beamforming for the input signal in consideration of the calculated spatial filter factors. In this case, a signal output through the beamforming execution unit 322 can be calculated by Equation (1). In this case, the output signals are subjected to an inverse discrete Fourier transform so as to obtain a noise-eliminated waveform.
As noted from
Now, a process by which a speech recognition apparatus having the same construction of
In step 600, a speech signal is input through respective microphones constituting the microphone array 300, and the input signal is output to the coherence calculation unit 314 of the beamforming unit 310.
In step 602, the coherence calculation unit 314 calculates coherences for a noise section of the input signal between each space of microphones and outputs the resultant values to the coherence average calculation unit 316. Herein, a detailed operation for calculating coherences according to each space of microphones will be described with reference to the description of the coherence calculation unit 314 of
In step 604, the coherence average calculation unit 316 calculates averages of input coherences according to the same distance and outputs the resultant values to the filter 318.
In step 606, the filter 318 performs the filtering of the input average coherence so as to smoothen a rapidly changing part in the average coherence function. In this case, the filtering method can be achieved by selecting one of the four filtering methods described above in relation to the filter 318 of
In step 608, the spatial filter factor calculation unit 320 calculates a beamforming spatial filter factor by using the filtered average coherence, as shown in Equation (9).
In step 610, the beamforming execution unit 322 performs beamforming of the input signals by using the calculated spatial filter factor. In step 612, a noise-processed signal is output.
In the present invention as described above, when a beamformer performs beamforming of signals input through a microphone array, the coherence is applied to the beamformer in consideration of an actual noise environment. Therefore, it is possible to improve the performance of indoor noise removal. In the present invention, since a relatively simple operation formula is used in calculating coherences in consideration of an actual noise environment, it is possible to rapidly process speech signals which are frequently input to the microphone array and acquire output signals. Moreover, the beamforming technology of a microphone array according to the present invention provides a basis so that an audio interface technology, used between a person and either a robot, a computer, or a mobile device, can be effectively applied to a noisy environment.
While the invention has been shown and described with reference to a certain exemplary embodiment thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2007-0012803 | Feb 2007 | KR | national |
Number | Name | Date | Kind |
---|---|---|---|
4752961 | Kahn | Jun 1988 | A |
5581620 | Brandstein et al. | Dec 1996 | A |
7930175 | Haulick et al. | Apr 2011 | B2 |
20050195988 | Tashev et al. | Sep 2005 | A1 |
Number | Date | Country |
---|---|---|
1020060043338 | May 2006 | KR |
1020060085392 | Jul 2006 | KR |
1020060127078 | Dec 2006 | KR |
Number | Date | Country | |
---|---|---|---|
20080187152 A1 | Aug 2008 | US |