The present invention will be described below in detail on the basis of the drawings showing the embodiments thereof. The embodiments will be described in the case that the sound signal to be processed is mainly voice generated by a human being.
The general purpose computer, operating as the sound arrival direction estimating apparatus 1 according to Embodiment 1 of the present invention, comprises at least an operation processing unit 11, such as a CPU, a DSP or the like, a ROM 12, a RAM 13, a communication interface unit 14 capable of carrying out data communication to and from an external computer, multiple voice input units 15 that accept voice input, and a voice output unit 16 that outputs voice. The voice output unit 16 outputs voice inputted from the voice input unit 31 of each of communication terminal apparatuses 3 that can carry out data communication via a communication network 2. Voice whose noise is suppressed is outputted from a voice output unit 32 of each of the communication terminal apparatuses 3.
The operation processing unit 11 is connected to the above-mentioned each hardware units of the sound arrival direction estimating apparatus 1 via an internal bus 17. The operation processing unit 11 controls the above-mentioned hardware units, and performs various software functions according to processing programs stored in the ROM 12, such as, for example, a program for calculating the amplitude component of a signal on a frequency axis, a program for estimating a noise component from the calculated amplitude component, a program for calculating a signal-to-noise ratio (SN ratio) at each frequency on the basis of the calculated amplitude component and the estimated noise component, a program for extracting a frequency at which the SN ratio is larger than a predetermined value, a program for calculating the difference between the arrival distances on the basis of the phase difference (hereinafter to be called as a phase difference spectrum) at the extracted frequency, and a program for estimating the direction of the sound source on the basis of the difference between the arrival distances.
The ROM 12 is configured by a flash memory or the like and stores the above-mentioned processing programs and numerical value information referred by the processing programs required to make the general purpose computer to function as the sound arrival direction estimating apparatus 1. The RAM 13 is configured by a SRAM or the like and stores temporary data generated during program execution. The communication interface unit 14 downloads the above-mentioned programs from an external computer, transmits output signals to the communication terminal apparatuses 3 via the communication network 2, and receives inputted sound signals.
Specifically, the voice input units 15 are configured by multiple microphones that respectively accept sound input and used to specify the direction of a sound source, amplifiers, A/D covertures and the like. The voice output unit 16 is an output device, such as a speaker. For convenience of explanation, the voice input units 15 and the voice output unit 16 are built in the sound arrival direction estimating apparatus 1 as shown in
As shown in
The voice accepting unit 201 accepts from two microphones voice generated by a human being, as sound inputs, which is a sound source. In this embodiment 1, input 1 and input 2 are accepted via the voice input units 15 and 15 each being a microphone.
With respect to inputted voice, the signal conversion unit 202 converts signals on a time axis into signals on a frequency axis, that is, complex spectra IN1(f) and IN2(f). Herein, f represents a frequency (radian). In the signal conversion unit 202, a time-frequency conversion process, such as Fourier transform, is carried out. In Embodiment 1, the inputted voice is converted into the spectra IN1(f) and IN2(f) by a time-frequency conversion process, such as Fourier transform.
The phase difference spectrum calculating unit 203 calculates phase spectra on the basis of the frequency converted spectra IN1(f) and IN2(f), and calculates the phase difference spectrum DIFF_PHASE(f) which is the difference between the calculated phase spectra, for each frequency. Note that the phase difference spectrum DIFF_PHASE(f) may be obtained not by obtaining each phase spectrum of the spectra IN1(f) and IN2(f), but by obtaining a phase component of IN1(f)/IN2(f). The amplitude spectrum calculating unit 204 calculates one of amplitude spectra, that is, an amplitude spectrum |IN1(f)| which is the frequency component of the input signal spectrum IN1(f) of the input 1 in the example shown in
Embodiment 1 has a configuration in which the amplitude spectrum |IN1(f)| is calculated for each frequency in Fourier-transformed spectra. However, Embodiment 1 may also have a configuration in which band division is performed, and the representative value of the amplitude spectrum |IN1(f)| is obtained in a divided band that is divided depending on specific central frequency and interval. The representative value in that case may be the average value of the amplitude spectrum |IN1(f)| in the divided band or may be the maximum value thereof. The representative value of the amplitude spectrum after the band division becomes |IN1(n)|. Where, n represents an index of a divided band.
The background noise estimating unit 205 estimates a background noise spectrum |NOISE1(f)| on the basis of the amplitude spectrum |IN1(f)|. The method of estimating the background noise spectrum |NOISE1(f)| is not limited to any particular method. It may also be possible to use known methods, such as a voice section detecting process being used in speech recognition or a background noise estimating process and the like being carried out in a noise canceling process used in mobile phones. In other words, any method of estimating the background noise spectrum can be used. In the case that the amplitude spectrum is band-divided as described above, the background noise spectrum |NOISE1(n)| should be estimated for each divided band. Where, n represents an index in of a divided band.
The SN ratio calculating unit 206 calculates the SN ratio SNR(f) by calculating the ratio between the amplitude spectrum |IN1(f)| calculated in the amplitude spectrum calculating unit 204 and the background noise spectrum |NOISE1(f)| estimated in the background noise estimating unit 205. The SN ratio SNR(f) is calculated by a following expression (1). In the case that the amplitude spectrum is band-divided, SNR(n) should be calculated for each divided band. Where, n represents an index of a divided band.
SNR(f)=20.0×log10(|IN1(f)|/|NOISE1(f)|) (1)
The phase difference spectrum selecting unit 207 extracts the frequency or the frequency band at which an SN ratio larger than a predetermined value is calculated in the SN ratio calculating unit 206, and selects the phase difference spectrum corresponding to the extracted frequency or the phase difference spectrum in the extracted frequency band.
The arrival distance difference calculating unit 208 obtains a function in which the relation between the selected phase difference spectrum and frequency f is linear-approximated with a straight line passing through an origin. On the basis of this function, the arrival distance difference calculating unit 208 calculates the difference between the distances to the voice input units 15 and 15 from the sound source, that is, the distance difference D between the distances along which voice arrives at the voice input units 15 and 15.
The sound arrival direction calculating unit 209 calculates an incident angle θ of sound input, that is, the angle θ indicating the direction in which it is estimated that a human being is present which is a sound source, using the distance difference D calculated by the arrival distance difference calculating unit 208 and the installation interval L of the voice input units 15 and 15.
The procedure performed by the operation processing unit 11 of the sound arrival direction estimating apparatus 1 according to Embodiment 1 of the present invention will be described below.
First, the operation processing unit 11 of the sound arrival direction estimating apparatus 1 accepts sound signals (analog signals) from the voice input units 15 and 15 (step S301). After A/D-conversion of the accepted sound signals, the operation processing unit 11 performs framing of the accepted sound signals in a predetermined time unit (step S302). Framing unit is determined depending on the sampling frequency, the kind of an application, etc. At this time, for the purpose of obtaining stable spectra, a time window such as a hamming window, a hanning window or the like is multiplied to the framed sampling signals. For example, framing is carried out in 20 to 40 ms units while being overlapped every 10 to 20 ms, and the following processes are performed for each of the frames.
The operation processing unit 11 converts signals on a time axis in frame units into signals on a frequency axis, that is, spectra IN1(f) and IN2(f) (step S303). Where, f represents a frequency (radian). The operation processing unit 11 carries out a time-frequency conversion process, such as Fourier transform. In Embodiment 1, the operation processing unit 11 converts signals on the time axis in frame units into the spectra IN1(f) and IN2(f), by carrying out a time-frequency conversion process, such as Fourier transform.
Next, the operation processing unit 11 calculates phase spectra using the real parts and the imaginary parts of the frequency-converted spectra IN1(f) and IN2(f), and calculates the phase difference spectrum DIFF_PHASE(f) which is the phase difference between the calculated phase spectra, for each frequency (step S304).
On the other hand, the operation processing unit 11 calculates the value of the amplitude spectrum |IN1(f)| which is the amplitude component of the input signal spectrum IN1(f) of input 1 (step S305).
However, the calculation is not required to be limited to the calculation of the amplitude spectrum with respect to the input signal spectrum IN1(f) of input 1. For example, as another method, it may be possible to calculate the amplitude spectrum with respect to the input signal spectrum |IN2(f)| of input 2, or it may also be possible to calculate the average value or the maximum value of the amplitude spectra of both inputs 1 and 2 as the representative value of the amplitude spectra. Herein, a configuration is adopted in which the amplitude spectrum |IN1(f)| is calculated for each frequency in Fourier-transformed spectra. However, it may be possible to adopt a configuration in which band division is performed, and the representative value of the amplitude spectrum |IN1(f)| is calculated in a divided band that is divided depending on specific central frequency and interval. The representative value may be the average value of the amplitude spectrum |IN1(f)| in the divided band or may be the maximum value thereof. Furthermore, the configuration is not limited to a configuration in which amplitude spectra are calculated, but it may be possible to adopt a configuration in which power spectra are calculated. The SN ratio SNR(f) in this case is calculated according to a following expression (2).
SNR(f)=10.0×log10(|IN1(f)|2/|NOISE1(f)|2) (2)
The operation processing unit 11 estimates a noise section on the basis of the calculated amplitude spectrum |IN1(F)|, and estimates the background noise spectrum |NOISE1(f)| on the basis of the amplitude spectrum |IN1(f)| of the estimated noise section (step S306).
Note that the method of estimating the noise section is not limited to any particular method. For example, as another method, with respect to the method of estimating the background noise spectrum |NOISE1(f)|, it may also be possible to use known methods, such as a voice section detecting process being used in speech recognition or a background noise estimating process and the like being carried out in a noise canceling process used in mobile phones. In other words, any method of estimating the background noise spectrum can be used. For example, it is possible to estimate a background noise level using power information in whole frequency bands, and to make the voice/noise judgment by obtaining a threshold value for judging voice/noise based on the estimated background noise level. As a result, in the case that judgment result is a noise, it is general that the background noise spectrum |NOISE1(f)| is estimated by correcting the background noise spectrum |NOISE1(f)| using the amplitude spectrum |IN1(f)| at that time.
The operation processing unit 11 calculates the SN ratio SNR(f) for each frequency or frequency band according to the expression (1) (or the expression (2) in case of power spectrum) (step S307). The operation processing unit 11 then selects a frequency or a frequency band at which the calculated SN ratio is larger than the predetermined value (step S308). The frequency or frequency band to be selected can be changed according to the method of determining the predetermined value. For example, the frequency or frequency band at which the SN ratio has the maximum value can be selected by comparing the SN ratios between the adjacent frequencies or frequency bands, and by continuously selecting the frequency or frequency band having larger SN ratio while sequentially storing them in the RAM 13 and by selecting it. It may also be possible to select N (N denotes natural number) pieces of frequencies or frequency bands in the decreasing order of the SN ratios.
On the basis of the phase difference spectrum DIFF_PHASE(f) corresponding to one or more selected frequencies or frequency bands, the operation processing unit 11 linear-approximates the relation between the phase difference spectrum DIFF_PHASE(f) and frequency f (step S309). As a result, it is possible to use the fact that the reliability of the phase difference spectrum DIFF_PHASE(f) at the frequency or frequency band at which the SN ratio is large. It is thus possible to raise the estimating accuracy of the proportional relation between the phase difference spectrum DIFF_PHASE(f) and the frequency f.
The operation processing unit 11 then calculates the difference D between the arrival distances of a sound input from the sound source according to a following expression (3) using a value of the linear-approximated phase difference spectrum DIFF_PHASE(π) in Nyquist frequency F, that is, R in
In addition, in
D=(R×c)/(F×2π) (3)
The operation processing unit 11 calculates the incident angle θ of sound input, that is, the angle θ indicating the direction in which it is estimated that the sound source is present using the calculated difference D between the arrival distances (step S311).
As shown in
θ=sin−1(D/L) (4)
In the case that N pieces of frequencies or frequency bands are selected in the decreasing order of the SN ratios, as described above, linear-approximating is performed by using the top N phase difference spectra. For example, as another method, it may be possible to replace the F and R in the expression (3) with the f and r, respectively, by not using the value R of the linear-approximated phase difference spectrum DIFF_PHASE(F) at the Nyquist frequency F, but the phase difference spectrum r (=DIFF_PHASE(f) at the selected frequency f, and calculate the difference D between the arrival distances for each selected frequency, then calculate the angle θ indicating the direction in which it is estimated that the sound source is present by using an average value of the calculated difference D. The calculation method is not limited to this kind of method as a matter of course. For example, it may also be possible to calculate the angle θ indicating the direction in which it is estimated that the sound source is present by calculating the representative value of the difference D between the arrival distances by weighting depending on the SN ratio.
Furthermore, in the case of estimating the direction in which a human being who generates voice is present, it may also be possible to calculate the angle θ indicating the direction in which it is estimated that the sound source is present by judging whether a sound input is a voice section indicating the voice generated by the human being, and by performing the above-mentioned process only when it is judged as a voice section.
Moreover, even if it is judged that the SN ratio is larger than the predetermined value, in the case that the phase difference is an unintended phase difference in view of the usage states, usage conditions, etc. of an application, it is preferable that the corresponding frequency or frequency band should be eliminated from those to be selected. For example, in the case that the sound arrival direction estimating apparatus 1 according to Embodiment 1 is applied to an apparatus, such as a mobile phone, that is supposed that voice is generated from the front direction, and in the case that it is estimated that the angle θ indicating the direction in which the sound source is present is calculated as θ<−90° or 90°<θ where it is assumed that the front is 0°, it is judged as an unintended state.
Still further, even if it is judged that the SN ratio is larger than the predetermined value, it is preferable that frequencies or frequency bands that are not desirable to estimate the direction of the target sound source should be eliminated from those to be selected, in view of the usage states, usage conditions, etc. of an application. For example, in the case that the target sound source is voice generated by a human being, there is no sound signal having frequencies of 100 Hz or less. Hence, frequencies of 100 Hz or less can be eliminated from the frequencies to be selected.
As described above, in the sound arrival direction estimating apparatus 1 according to Embodiment 1, the SN ratio for each frequency or frequency band is obtained on the basis of the amplitude component of the inputted sound signal, that is, the so-called amplitude spectrum, and the estimated background noise spectrum, and the phase difference (phase difference spectrum) at the frequency at which the SN ratio is large is used, whereby the difference D between the arrival distances can be obtained more accurately. Therefore, it is possible to accurately calculate the incident angle of the sound signal, that is, the angle θ indicating the direction in which it is estimated that the target sound source (a human being in Embodiment 1) is present, on the basis of the accurate difference D between the arrival distances.
A sound arrival direction estimating apparatus 1 according to Embodiment 2 of the present invention will be described below in detail referring to the drawings. Because the configuration of the general purpose computer operating as the sound arrival direction estimating apparatus 1 according to Embodiment 2 of the present invention is similar to that according to Embodiment 1, the configuration can be understood referring to the block diagram of
As shown in
The voice accepting unit 201 accepts from two microphones voice generated by a human being which is a sound source. In this embodiment 2, input 1 and input 2 are accepted via the voice input units 15 and 15 each being a microphone.
With respect to input voice, the signal conversion unit 202 converts signals on a time axis into signals on a frequency axis, that is, complex spectra IN1(f) and IN2(f). Herein, f represents a frequency (radian). In the signal conversion unit 202, a time-frequency conversion process, such as Fourier transform, is carried out. In Embodiment 2, the inputted voice is converted into the spectra IN1(f) and IN2(f) by a time-frequency conversion process, such as Fourier transform.
After A/D-conversion of the input signal accepted by the voice input units 15 and 15, obtained sample signals are framed in a predetermined time unit. At this time, for the purpose of obtaining stable spectra, a time window such as a hamming window, a hanning window or the like is multiplied to the framed sampling signals. Framing unit is determined depending on the sampling frequency, the kind of an application, etc. For example, framing is carried out in 20 to 40 ms units while being overlapped every 10 to 20 ms, and the following processes are performed for each of the frames.
The phase difference spectrum calculating unit 203 calculates phase spectra in frame units on the basis of the frequency converted spectra IN1(f) and IN2(f), calculates the phase difference spectrum DIFF_PHASE(f) which is the phase difference between the calculated phase spectra in frame units. Here, the amplitude spectrum calculating unit 204 calculates one of amplitude spectra, that is, an amplitude spectrum |IN1(f)| which is the frequency component of the input signal spectrum IN1(f) of the input 1 in the example shown in
The background noise estimating unit 205 estimates a background noise spectrum |NOISE1(f)| on the basis of the amplitude spectrum |IN1(f)|. The method of estimating the background noise spectrum |NOISE1(f)| is not limited to any particular method. It may also be possible to use known methods, such as a voice section detecting process being used in speech recognition or a background noise estimating process and the like being carried out in a noise canceling process used in mobile phones. In other words, any method of estimating the background noise spectrum can be used.
The SN ratio calculating unit 206 calculates the SN ratio SNR(f) by calculating the ratio between the amplitude spectrum |IN1(f)| calculated in the amplitude spectrum calculating unit 204 and the background noise spectrum |NOISE1(f)| estimated in the background noise estimating unit 205.
On the basis of the SN ratio calculated in the SN ratio calculating unit 206 and the phase difference spectrum DIFF_PHASEt-1(f) calculated at the last sampling time and stored in the RAM 13 after being corrected by the phase difference spectrum correcting unit 210, the phase difference spectrum correcting unit 210 corrects the phase difference spectrum DIFF_PHASEt(f) calculated at the present sampling time, that is, the next sampling time. At the current sampling time, the SN ratio and the phase difference spectrum DIFF_PHASEt(f) is calculated in a similar way as that done up to the last time, and the phase difference spectrum DIFF_PHASEt(f) of the frame at the current sampling time is calculated according to a following expression (5) using a correction coefficient α (0≦α≦1) that is set according to the SN ratio.
The correction coefficient α will be described later. For example, together with each program, the correction coefficient α is stored in the ROM 12 as the numerical value information which corresponds to the SN ratio and is referred by the processing program.
The arrival distance difference calculating unit 208 obtains a function in which the relation between the selected phase difference spectrum and frequency f is linear-approximated with a straight line passing through an origin. On the basis of this function, the arrival distance difference calculating unit 208 calculates the difference between the distances to the voice input units 15 and 15 from the sound source, that is, the distance difference D between the distances along which voice arrives at the voice input units 15 and 15.
The sound arrival direction calculating unit 209 calculates an incident angle θ of sound input, that is, the angle θ indicating the direction in which it is estimated that a human being is present which is a sound source, using the distance difference D calculated by the arrival distance difference calculating unit 208 and the installation interval L of the voice input units 15 and 15.
The procedure performed by the operation processing unit 11 of the sound arrival direction estimating apparatus 1 according to Embodiment 2 of the present invention will be described below.
First, the operation processing unit 11 of the sound arrival direction estimating apparatus 1 accepts sound signals (analog signals) from the voice input units 15 and 15 (step S701). After A/D-conversion of the accepted sound signals, the operation processing unit 11 performs framing of the accepted sound signals in a predetermined time unit (step S702). Framing unit is determined depending on the sampling frequency, the kind of an application, etc. At this time, for the purpose of obtaining stable spectra, a time window such as a hamming window, a hanning window or the like is multiplied to the framed sampling signals. For example, framing is carried out in 20 to 40 ms units while being overlapped every 10 to 20 ms, and the following processes are performed for each of the frames.
The operation processing unit 11 converts signals on a time axis in frame units into signals on a frequency axis, that is, spectra IN1(f) and IN2(f) (step S703). Where, f represents a frequency (radian) or a frequency band having a constant width at sampling. The operation processing unit 11 carries out a time-frequency conversion process, such as Fourier transform. In Embodiment 2, the operation processing unit 11 converts signals on the time axis in frame units into the spectra IN1(f) and IN2(f), by carrying out a time-frequency conversion process, such as Fourier transform.
Next, the operation processing unit 11 calculates phase spectra using the real parts and the imaginary parts of the frequency-converted spectra IN1(f) and IN2(f), and calculates the phase difference spectrum DIFF_PHASEt(f) which is the phase difference between the calculated phase spectra, for each frequency or frequency band (step S704).
On the other hand, the operation processing unit 11 calculates the value of the amplitude spectrum |IN1(f)| which is the amplitude component of the input signal spectrum IN1(f) of input 1 (step S705).
However, the calculation is not required to be limited to the calculation of the amplitude spectrum with respect to the input signal spectrum IN1(f) of input 1. For example, as another method, it may be possible to calculate the amplitude spectrum with respect to the input signal spectrum |IN2(f)| of input 2, or it may also be possible to calculate the average value or the maximum value of the amplitude spectra of both inputs 1 and 2 as the representative value of the amplitude spectra. Furthermore, the configuration is not limited to a configuration in which amplitude spectra are calculated, but it may be possible to adopt a configuration in which power spectra are calculated.
The operation processing unit 11 estimates a noise section on the basis of the calculated amplitude spectrum |IN1(f)|, and estimates the background noise spectrum |NOISE1(f)| on the basis of the amplitude spectrum |IN1(f)| of the estimated noise section (step S706).
The method of estimating the noise section is not limited to any particular method. For example, as another method, with respect to the method of estimating the background noise spectrum |NOISE1(f)|, it is possible to estimate a background noise level using power information in whole frequency bands, and to make the voice/noise judgment by obtaining a threshold value for judging voice/noise based on the estimated background noise level. As a result, in the case that judgment result is a noise, any methods for estimating the background noise spectrum can be used, in which the background noise spectrum |NOISE1(f)| is estimated by correcting the background noise spectrum |NOISE1(f)| using the amplitude spectrum |IN1(f)| at that time.
The operation processing unit 11 calculates the SN ratio SNR(f) for each frequency or frequency band according to the above-mentioned expression (1) (step S707). Next, the operation processing unit 11 judges whether the phase difference spectrum DIFF_PHASEt-1(f) at the last sampling time is stored in the RAM 13 or not (step S708).
In the case that the operation processing unit 11 judges that the phase difference spectrum DIFF_PHASEt-1(f) at the last sampling time is stored (YES at step S708), the operation processing unit 11 reads from the ROM 12 the correction coefficient α corresponding to the SN ratio at the calculated sampling time (current sampling time) (step S710). In addition, the correction coefficient α may be obtained by calculating using a function which represents relation between the SN ratio and the correction coefficient α and is built in the program in advance.
The operation processing unit 11 corrects the phase difference spectrum DIFF_PHASEt(f) according to the above-mentioned expression (5) using the correction coefficient α having been read from the ROM 12 corresponding to the SN ratio (step S711). After that, the operation processing unit 11 updates the corrected phase difference spectrum DIFF_PHASEt-1(f) stored in RAM 13, to the corrected phase difference spectrum DIFF_PHASEt(f) at the current sampling time, and stores it (step S712).
In the case that the operation processing unit 11 judges that the phase difference spectrum DIFF_PHASEt-1(f) at the last sampling time is not stored (NO at step S708), the operation processing unit 11 judges whether the phase difference spectrum DIFF_PHASEt(f) at the current sampling time is used or not (step S717). As the criterion for the judgment as to whether the phase difference spectrum DIFF_PHASEt(f) at the current sampling time is used or not, the criterion whether or not the sound signal is generated from the target sound source (whether or not a human being generates voice) such as the SN ratio in whole frequency bands, the judgment result of voice/noise, and the like is used.
In the case that the operation processing unit 11 judges that the phase difference spectrum DIFF_PHASEt(f) at the current sampling time is not used, that is, judges that there is a low possibility that a sound signal is generated from the sound source (NO at step S717), the operation processing unit 11 makes a predetermined initial value of the phase difference spectrum, to be the phase difference spectrum at the current sampling time (step S718). In this case, for example, the initial value of the phase difference spectrum is set to 0 (zero) for all frequencies. However, the setting at step S718 is not limited to this value (i.e. zero).
Next, the operation processing unit 11 stores the initial value of the phase difference spectrum as the phase difference spectrum at the current sampling time in the RAM 13 (step S719), and advances the processing to step S713.
In the case that the operation processing unit 11 judges that the phase difference spectrum DIFF_PHASEt(f) at the current sampling time is used, that is, judges that there is a high possibility that a sound signal is generated from the sound source (YES at step S717), the operation processing unit 11 stores the phase difference spectrum DIFF_PHASEt(f) at the current sampling time in the RAM 13 (step S720), and advances the processing to step S713.
On the basis of the selected phase difference spectrum DIFF_PHASE(f) stored at any one of step S712, S719 and S720, the operation processing unit 11 linear-approximates the relation between the phase difference spectrum DIFF_PHASE(f) and frequency f with a straight line passing through an origin (step S713). As a result, when linear-approximation based on the corrected phase difference spectrum is performed, it is possible to use the phase difference spectrum DIFF_PHASE(f) which reflects information of the phase difference at the frequency or frequency band at which the SN ratio is large (that is, high reliability) not at the current sampling time but at the past sampling time. It is thus possible to raise the estimating accuracy of a proportional relation between the phase difference spectrum DIFF_PHASE(f) and the frequency f.
The operation processing unit 11 calculates the difference D between the arrival distances of the sound signal from the sound source using the value of the phase difference spectrum DIFF_PHASE(F) which is linear-approximated at the Nyquist frequency F according to the above-mentioned expression (3) (step S714). Note that the difference D between the arrival distances can be calculated by replacing the F and R in the expression (3) with the f and r, respectively, even if the value r (=DIFF_PHASE(f) of the phase difference spectrum at arbitrarily frequency f is used without using the value R of the linear-approximated phase difference spectrum DIFF_PHASE(F) at the Nyquist frequency F. Then, the operation processing unit 11 calculates the incident angle θ of the sound signal, that is, the angle θ indicating the direction in which it is estimated that the sound source (human being) is present, using the calculated difference D between the arrival distances (step S715).
Furthermore, in the case of estimating the direction in which a human being who generates voice is present, it may also be possible to calculate the angle θ indicating the direction in which it is estimated that the sound source is present by judging whether a sound input is a voice section indicating the voice generated by the human being, and by performing the above-mentioned process only when it is judged as a voice section.
Moreover, even if it is judged that the SN ratio is larger than the predetermined value, in the case that the phase difference is an unintended phase difference in view of the usage states, usage conditions, etc. of an application, it is preferable that the corresponding frequency or frequency band should be eliminated from those corresponding to the phase difference spectrum at the current sampling time that is to be corrected. For example, in the case that the sound arrival direction estimating apparatus 1 according to Embodiment 2 is applied to an apparatus, such as a mobile phone, that is supposed that voice is generated from the front direction, and in the case that it is estimated that the angle θ indicating the direction in which the sound source is present is calculated as θ<−90° or 90°<θ where it is assumed that the front is 0°, it is judged as an unintended state. In this case, the phase difference spectrum at the current sampling time is not used, but the phase difference spectrum calculated at the last time or before is used.
Still further, even if it is judged that the SN ratio is larger than the predetermined value, it is preferable that frequencies or frequency bands that are not desirable to estimate the direction of the target sound source should be eliminated from those to be selected, in view of the usage states, usage conditions, etc. of an application. For example, in the case that the target sound source is voice generated by a human being, there is no sound signal having frequencies of 100 Hz or less. Hence, frequencies of 100 Hz or less can be eliminated from the frequencies to be selected.
As described above, in the sound arrival direction estimating apparatus 1 according to Embodiment 2, in the case that the phase difference spectrum in a frequency or a frequency band at which the SN ratio is large is calculated, correction is carried out while the phase difference spectrum at the sampling time (current sampling time) is weighted more than the phase difference spectrum calculated at the last sampling time, and in the case that the SN ratio is small, correction is carried out while the phase difference spectrum at the last sampling time is weighted. Hence, newly calculated phase difference spectra can be corrected sequentially. Phase difference information at frequencies at which the SN ratios at the past sampling times are large is also reflected in the corrected phase difference spectrum. Accordingly, the phase difference spectrum does not vary significantly under the influence of the state of background noise, the change in the content of the sound signal generated from a target sound source, etc. Therefore, it is possible to accurately calculate the incident angle of the sound signal, that is, the angle θ indicating the direction in which it is estimated that the target sound source is present, on the basis of the more accurate and stable difference D between the arrival distances. The method of calculating the angle θ indicating the direction in which it is estimated that the target sound source is present is not limited to the method in which the above-mentioned difference D between the arrival distances is used, but it is needless to say that various methods can be used, provided that the methods can carry out estimation with similar accuracy.
As described above in detail, according to a first aspect of the present invention, the signal-to-noise ratio (SN ratio) for each frequency is obtained on the basis of the amplitude component of the inputted sound signal, that is, the so-called amplitude spectrum, and the estimated background noise spectrum, and only the phase difference (phase difference spectrum) at the frequency at which the signal-to-noise ratio is large is used, whereby the difference between the arrival distances can be obtained more accurately. Therefore, it is possible to accurately estimate the incident angle of the sound signal, that is, the direction in which it is estimated that the sound source is present, on the basis of the accurate difference between the arrival distances.
In addition, according to a second aspect of the present invention, because the difference between the arrival distances is calculated by preferentially selecting frequencies that are less affected by noise components, the calculation result of the difference between the arrival distances does not vary significantly. Hence, it is possible to more accurately estimate the incident angle of the sound signal, that is, the direction in which the target sound source is present.
Furthermore, according to a third aspect of the present invention, in the case that the phase difference (phase difference spectrum) is calculated to obtain the difference between the arrival distances, newly calculated phase differences can be corrected sequentially on the basis of the phase differences calculated at the past sampling times. Because phase difference information at frequencies at which the SN ratios at the past sampling times are large is reflected in the corrected phase difference spectrum, the phase difference does not vary significantly depending on the state of background noise, the change in the content of the sound signal generated from a target sound source, etc. Therefore, it is possible to accurately estimate the incident angle of the sound signal, that is, the direction in which the target sound source is present, on the basis of the more accurate and stable difference between the arrival distances.
Moreover, according to a fourth aspect of the present invention, it is possible to accurately estimate the direction in which a sound source, such as a human being, generating voice is present.
As this invention may be embodied in several forms without departing from the spirit of essential characteristics thereof, the present embodiments are therefore illustrative and not restrictive, since the scope of the invention is defined by the appended claims rather than by the description preceding them, and all changes that fall within metes and bounds of the claims, or equivalence of such metes and bounds thereof are therefore intended to be embraced by the claims.
Number | Date | Country | Kind |
---|---|---|---|
2006-217293 | Aug 2006 | JP | national |
2007-033911 | Feb 2007 | JP | national |