This invention relates to a device, a method and a program for voice detection, and a recording medium. More particularly, it relates to a device, a method and a program for voice detection, and a recording medium, usable for detecting the voice domain in a dialog system that allows a plurality of speakers to utter simultaneously from different microphones allocated to them.
In a voice collection method, disclosed in Patent Document 1, an output from each of two microphones is divided into a plurality of frequency domains. The difference in parameter values of sound signals, arriving at the microphones, and which are variable by reason of microphone positions, is detected. Based on this difference in detection, frequency components of the respective sound signals are selected for sound source separation. The sound of interest is distinguished from the sound not of interest based on the difference in their frequency characteristics. The sound not of interest is suppressed in the frequency domain. The output frequency components of the respective sound signals are synthesized into sound source signals.
In a noise removal method, disclosed in Patent Document 2, an input time domain signal is separated into a plurality of subcomponents by a signal separation unit. The noise contained in the subcomponents, resulting from the signal separation, is estimated by a noise estimation unit, using the subcomponents. A noise removal unit removes the so estimated noise from the subcomponents.
JP Patent Kokai Publication No. JP2000-081900A
JP Patent Kokai Publication No. JP2005-308771A
It is noted that the total contents disclosed in the above Patent Documents 1 and 2 are to be incorporated by reference herein. The following analysis is given on the part of the present invention.
The methods of the above mentioned Patent Documents 1 and 2 suffer from the problem that voice detection may not be correctly made, for the following reason, in a region where the voices of a plurality of speakers overlap, viz., in across-talk region. In the methods of the above mentioned Patent Documents 1 and 2, large-small comparison is first made of the power values of the frequency components of each microphone. The power values of certain predetermined frequency bands or all of the frequency bands are summed together to calculate the total power. As a result, priority is put on the voice of a speaker that has a globally larger power.
It is now presupposed that, during the time a speaker A in front of a microphone A is uttering, a speaker B in front of a microphone A has uttered. In such case, interchange of detection domains occurs at a time point when the large-small relationship between the voice power of the speaker A and that of the speaker B in interchanged. It may be feared at this time, that, insofar as the speaker A is concerned, detection is halted short while as yet his/her utterance has not come to a close and, insofar as the speaker .B is concerned, detection is commenced only after some time lapse as from the start of his/her utterance. It may also be feared that, depending on the utterance timings of the speakers A and B, the voice from the microphones A and that from the microphone B are detected only in small chunks or fragments.
In view of the above depicted status of the art, it is an object of the present invention to provide a device, a method and a program for voice detection, and a recording medium, usable for detecting the voice domain in an interlocution system that allows a plurality of speakers uttering simultaneously from different microphones, according to which the voice may be detected to high accuracy in the cross-talk regions.
Thus, there is much to be desired in the art.
In a first aspect, a voice detection device according to the present invention includes a band-based power calculation unit that calculates, from one preset frequency band width (sub-band) to another, a total of values of the signal power entered from each of a plurality of microphones (sub-band power), and a band-based noise estimation unit that estimates the noise power from one sub-band to another. The voice detection device also includes a band-based SNR calculation unit that, from one sub-band to another, for each of the microphones, calculates a sub-band SNR, and that outputs a largest one of the sub-band SNRs for each microphone, as a microphone of interest, as being an SNR of a microphone of interest. The voice detection device further includes a voice/non-voice decision unit that determines the voice/non-voice for each microphone using the SNR of each microphone.
In a second aspect, for use in a dialog system in which a plurality of speakers are allowed to utter simultaneously from microphones allocated to them, a voice detection method for detecting a voice domain according to the present invention includes a band-based power calculation step that calculates, from one preset frequency band width (sub-band) to another, a total of values of the signal power entered from each of a plurality of microphones (sub-band power), and a band-based noise estimation step that estimates the noise power from one sub-band to another. The voice detection method also includes a band-based SNR calculation step that, from one sub-band to another, for each of the microphones, calculates a sub-band SNR, and that outputs a largest one of the sub-band SNRs for each microphone, as a microphone of interest, as being an SNR of a microphone of interest. The voice detection method further includes a voice/non-voice decision step that determines the voice/non-voice for each microphone using the SNR of each microphone.
In a third aspect, for use in a dialog system in which a plurality of speakers are allowed to utter simultaneously from microphones allocated to them, a voice detection program according to the present invention allows, in order to detect a voice domain, a computer system to execute a band-based power calculation processing that calculates, from one preset frequency band width (sub-band) to another, a total of values of the signal power entered from each of a plurality of microphones (sub-band power), and a band-based noise estimation processing that estimates the noise power from one sub-band to another. The program also allows the computer to execute a band-based SNR calculation processing that, from one sub-band to another, for each of the microphones, calculates a sub-band SNR, and that outputs a largest one of the sub-band SNRs for each microphone, as a microphone of interest, as being an SNR of a microphone of interest. The program further allows the computer to execute a voice/non-voice decision processing that determines the voice/non-voice for each microphone using the SNR of each microphone.
The meritorious effects of the present invention are summarized as follows.
According to the present invention, the voice may be detected to high accuracy in a region of overlap of the voices of a plurality of speakers (cross-talk region). The reason is that the power values of signals, entered from each of a plurality of microphones, may be summed together from one sub-band to another to calculate sub-band SNRs for a given microphone, and the largest one of the sub-band SNRs is used to make voice/non-voice decision for the microphone in question.
A first exemplary embodiment of the present invention will now be described with reference to the drawings.
The band-based power calculation unit 200 includes a frequency power calculation unit 101 and a band-based power integration unit 201.
The frequency power calculation unit 101 slices out an input signal at a preset interval of for example, 10 msec, and processes the so sliced out signal by pre-emphasis and windowing followed by FFT (Fast Fourier Transform). After the FFT, the frequency power calculation unit 101 calculates the power at a preset frequency division step of M to output the so calculated power values. For example, if a signal with a sampling frequency of 44.1 kHz is processed with FFT at 1024 points, the signal power may be calculated at an interval of approximately 43 Hz. This processing operation is carried out on each of a plurality of microphone signals entered simultaneously. It should be noted that the frequency-based power may be calculated by taking square sums of real and imaginary parts obtained on FFT. The power obtained at such constant frequency division step is here defined as the frequency power.
Based on these frequency power values, output from the frequency power calculation unit 101, the band-based power integration unit 201 finds a total of the frequency power values for each frequency division step of N, where N>M, to calculate a total of power values for each frequency division step of N. The frequency division step N is here termed the sub-band. The sub-band based power is termed a sub-band power. The band-based power integration unit 201 also saves the sub-band power values for a preset time duration, and calculates the sum of the power values of the preset time duration.
For the sub-band, a constant frequency division step N, where N>M, may be used. However, the width (frequency division step) of taking the sum may be varied from one frequency band to another. An example of varying the width (frequency division step) of taking the sum is varying the frequency division step according to the mel scale, by means of which the principal components of the voice may be expressed with emphasis. In calculating the mel frequency based total, the frequency division step becomes finer (narrower) for a low frequency range, while becoming coarser (broader) for a high frequency range. It should be noted that the sub-band power saving time interval may be constant, or may individually be set from one sub-band to another.
The band-based noise estimation unit 202 calculates the sub-band noise power which is the power of the sub-band based noise. The sub-band based noise power may be calculated in accordance with the following sequence from one sub-band to another. Initially, the sub-band power is compared from one microphone to another to select the microphone (speaker) with the maximum power value. The sub-band power is compared from one microphone to another to select the microphone with the minimum power value. The sub-band power of the so selected microphone with the minimum power value is stored. The above mentioned minimum power value stored is rendered the power of the sub-band noise associated with the microphone of the maximum power value. The sub-band noise power values of the remaining microphones are rendered the sub-band power values per se of these microphones. The reason the power values of the remaining microphones are rendered the sub-band power values per se of these microphones is that it is necessary to suppress the mistaken detection otherwise caused by the voice turning around. On the other hand, an SNR of the microphone with the maximum power value is enhanced because its noise power is replaced by the sub-band power of the minimum power value.
The above described processing of band-based noise estimation will now be described with reference to
For each of the microphones, the band-based SNR calculation unit 203 divides the sub-band power with the sub-band noise power from one sub-band to another to find a sub-band based power ratio of the signal to the noise (SNR). This power ratio is termed the sub-hand SNR. The largest value ratio of the sub-band SNR, out of the sub-band SNRs, calculated from one microphone to another, is selected as the SNR of the microphone of interest.
The processing of calculating the band-based SNR will now be described with reference to
If the SNR, calculated for a given signal by the band-based noise estimation unit 203, is smaller than a preset threshold value, the voice/non-voice detection unit 104 determines the signal in question to be the non-voice. If the SNR is determined to be larger than the preset threshold value, the voice/non-voice detection unit 104 determines the signal in question to be the voice.
The SNR, calculated by the band-based SNR calculation unit 203 as described above, has taken into account the fact that, depending on the difference in quality of the voice from one speaker to another or on the difference in the contents being uttered, there may be cases where the voice uttered differs in frequency. See the voice power waveforms of the speakers A and B of
To clarify the above mentioned advantageous effect of the above described exemplary embodiment, a formulation of
To calculate the power of the entire frequency range, an SNR calculation unit 103 of
Thus, in the formulation of
A second exemplary embodiment of the present invention takes into account possible applications of the present invention to an environment where the sorts of microphones used by speakers differ from one another or where the transmission systems of the input voices differ from one another. This second exemplary embodiment will now be described. It is presupposed that there are a plurality of microphones and a plurality of speakers each present in front of each of these microphones. Under this presupposition, the formulation of
In order for this presupposition to hold good, all of the microphones must be of the same sort, while the microphones and a sound recording or collecting section must be interconnected in the same way, as the matter of premises. On the other hand, the above premises may not hold good when the microphones are of variable sorts, for example, a fixed microphone or a pin microphone, or when the transmission systems between the microphones and the sound recording or collecting section are of variable types, as when the transmission used is a wired or wireless transmission system. In these cases, the microphones may be of variable characteristics, depending on their types, such that, if the signal of the same level is applied to these microphones, the power values derived from these microphones may differ from one microphone to another. It may also be feared that a signal obtained from a given microphone and transmitted over a transmission system, such as a wired or wireless transmission route, may arrive at the sound recording or collecting section at variable time points.
If these differences are taken into account, the presupposition of the formulation of
The delay estimation unit 21 calculates the power of the voice at a stated interval, from one microphone to another, in order to make the measurement of the time point of rapid rise in the power value. The delay estimation unit calculates a difference from an earliest one of time points of such rapid rises in the power value, and outputs the difference as delay time to the delay correction unit 22. At this time, the power may be calculated as a square sum of the waveforms of division steps of A/D conversion. The time juncture of rapid rise in the power value may be such a time juncture when the power has become larger than a preset threshold value. .
In the above described method, the delay time is estimated based on comparison of the power value itself with its threshold value. In an alternative method, a preset time span as from the start of sound recording is assumed to be a noise domain and, using this noise domain, the power of the steady-state noise is estimated. Then, a ratio between the power value of the steady-state noise and each of the signal power values at each time point of power measurement is found as an SNR, and the time point when the SNR has become larger than a threshold value is then found. Such time point is found from one microphone to another. The delay time may be measured by subtracting an earliest one of the time points of the microphones from the time point as measured with each microphone.
The delay correction unit 22 holds the input signal from each microphone for a preset time duration and outputs it at a timing hastened by a time corresponding to the delay time output from the delay estimation unit 21. It should be noted that the lower limit of the volume of the signal held by the delay correction unit 22 is to be not less than the delay caused between the microphones, that is, the differences of signal arrival timings. For example, if no delay is caused in the first microphone and a delay of 500 msec is caused in the second microphone, the delay time of 500 msec is output as the delay time from the delay estimation unit 21. The delay correction unit 22 then outputs the signal of the first microphone after a delay time of 500 msec.
In more detail, in case an input signal is subjected to A/D conversion, with the sampling frequency of 44.1 kHz and the number of quantization bits of 24, 22050 samples are held as a 500 msec signal. The memory used for holding this signal is termed a buffer. The delay correction unit 22 takes out the signal of the first microphone from the leading end of the buffer, while taking out the signal of the second microphone from the trailing end of the buffer. These signals of the first and second microphones are output simultaneously. Each time a new A/D converted signal is entered to the buffer, the old signal stored in the buffer is updated to the new signal. Thus, by continuing this sequence of operations, it is possible to output non-delayed signals on end.
The correction sound volume estimation unit 23 calculates power values of signals of the microphones for a preset time duration. After the calculations, the correction sound volume estimation unit divides the power values by the time duration to find averaged power values. The correction sound volume estimation unit then divides the power values of all of the microphones by the largest one of the averaged power values of the respective microphones. The correction sound volume estimation unit then outputs resulting values as correction coefficients to the sound volume correction unit 24. It should be noted that the signal used for calculating the correction coefficients may preferably be the signal equally supplied to the respective microphones, such as, for example, the background noise.
Or, the smallest power value or the smallest averaged power value, which may prove to be a reference power, may be selected in place of the largest averaged power value. The values of the ratio of the power values of the respective microphones to the so selected reference power may then be used as the correction coefficients.
The sound volume correction unit 24 multiplies the input signals from the respective microphones by the correction coefficients output from the correction sound volume estimation unit 23, and outputs the resulting signals. Specifically, the output signals may be obtained by multiplying the signals output from the A/D conversion by the above mentioned correction coefficients. An analog signal prior to the A/D conversion may be amplified by a general-purpose amplifier for audio equipment. This operation is to be carried out for each microphone signal.
The voice detection device of the present exemplary embodiment is configured for eliminating the delay and differences in the sound volume, otherwise caused from one microphone to another, as described above. It is thus possible to improve the accuracy in voice detection in an environment with variable microphone types and variable transmission systems. The reason is that timing adjustment corresponding to the delay time as well as sound volume correction with the correction coefficients has already been made with the input signal.
In particular, if the present exemplary embodiment is applied to the voice detection device of the above described first exemplary embodiment, it is possible to further improve the voice detection accuracy in a cross-talk region. The arrangement of the present exemplary embodiment may, of course, be applied to the voice detection device shown in
A third exemplary embodiment of the present invention, improved in connection with the above described second exemplary embodiment, will now be described in detail.
The sudden sound generation unit 25 is run in operation by a preset starting means, such as a switch, and outputs a large sound (sudden sound). The sudden sound is preferably a sound that covers the entire frequency range and that has its power value enlarged precipitously.
The delay estimation unit 21 and/or the correction sound volume estimation unit 23 is set into operation by the abrupt sound output from the sudden sound generation unit 25, whereby it is possible to improve the measurement accuracy of the correction coefficients as well as the delay time. The delay time and the correction coefficients may both be correctly calculated if, in a room where a plurality of microphones of variable types are set, the sudden sound generation unit 25 is run into operation after keeping the room in a state of silence for some time long.
Although certain preferred exemplary embodiments of the present invention have so far been described, the present invention is not to be limited to these exemplary embodiments, such that further alterations, substitutions or adjustments may be made without departing from the fundamental technical concept of the present invention. For example, in an environment where no delay is likely to be caused, the delay estimation unit 21 and the delay correction unit 22 in the above described second and third exemplary embodiments may be dispensed with. in similar manner, in an environment where the difference in the sound volume is not likely to be produced, both the correction sound volume estimation unit 23 and the sound volume correction unit 24 in the above described second exemplary embodiment may be dispensed with.
In addition, in the above described first exemplary embodiment, the band-based power, that is, the sub-band power, is calculated by a setup composed of the frequency power calculation unit 101 and the band-based power integration unit 201. It is however possible to combine the frequency power calculation unit 101 and the band-based power integration unit 201 in one processing block in which to carry out the processing operations of the respective units.
It is to be noted that the equation for calculating the SNR or the signal power shown in the above described exemplary embodiments is given as only by way of examples for illustration. Viz., a variety of methods for calculations that may occur to those skilled in the art may be used without departing from the scope of the invention.
The present invention may be used for a variety of applications, including a voice detection device and a program for implementing the voice detection device on a computer. The particular exemplary embodiments or examples may be modified or adjusted within the gamut of the entire disclosure of the present invention, inclusive of claims, based on the fundamental technical concept of the invention. Further, a wide variety of combinations or selections of elements disclosed herein may be made within the framework of the claims. That is, the present invention may encompass a variety of modifications or corrections that may occur to those skilled in the art in accordance with and within the gamut of the entire disclosure of the present invention, inclusive of claim and the technical concept of the present invention.
In the following, preferred modes are summarized. (refer to the voice detection device of the first aspect)
The voice detection device according to mode 1, wherein
said band-based noise estimation unit sets the sub-band noise power of other microphones so as to be the sub-band power of said other microphones.
The voice detection device according to mode 1 or 2, wherein
said sub-band is set so as to be narrower in width in a low frequency range and so as to be broader in width in a high frequency range.
The voice detection device according to any one of modes 1-3, further comprising:
a delay correction unit that corrects the delay of a signal entered from each of said microphones.
The voice detection device according to any one of modes 1-4, further comprising:
a sound volume correction unit that corrects the sound volume of a signal entered from each of said microphones.
The voice detection device according to mode 4 or 5, further comprising:
a delay time measurement unit that measures time points of rapid change in the power values of signals from said microphones to output the differences between said time points as the delay time to said delay correction unit.
The voice detection device according to mode 5 or 6, further comprising:
a correction sound volume estimation unit that calculates the values of the ratio of the power values of the respective microphones to output the resulting ratio values as correction coefficients to said sound volume correction unit.
The voice detection device according to mode 6 or 7, further comprising:
a sudden sound generation unit that outputs an abrupt sound of a short time duration.
The voice detection device according to any one of modes 1-8, wherein
said band-based power calculation unit calculates, from one preset frequency width (sub-band) to another, a total of power values for the preset frequency widths (sub-band power) for a preset time duration.
(refer to the voice detection method of the second aspect)
The voice detection method according to mode 10, wherein,
said band-based noise estimation unit sets the sub-band noise power of other microphones so as to be the sub-band power of said other microphones.
The voice detection method according to mode 10 or 11, wherein
said sub-band is set so as to be narrower in width in a low frequency range and so as to be broader in width in a high frequency range.
The voice detection method according to any one of modes 10-12, further comprising:
a delay correction step that corrects the delay of a signal entered from each of said microphones.
The voice detection method according to any one of modes 10-13, further comprising:
a sound volume correction step that corrects the sound volume of a signal entered from each of said microphones,
The voice detection method according to mode 13 or 14, further comprising:
a delay time measurement step of measuring time points of rapid change in the power values of signals from -said microphones to output the differences between said time points as the delay time to said delay correction unit.
The voice detection method according to mode 14 or 15, further comprising:
a correction sound volume estimation step that calculates the values of the ratio of the power values of the respective microphones to output the resulting ratio values as correction coefficients to said sound volume correction unit.
The voice detection method according to mode 15 or 16, wherein
the delay time or the power ratio of signals from the respective microphones is calculated based on an output signal from a sudden sound generation unit that outputs a sudden sound of a short time duration.
The voice detection method according to any one of modes 10-17, wherein
said band-based power calculation step calculates, from one frequency width (sub-band) to another, for a preset time duration, a total of power values at an interval of said frequency width for a preset time duration.
(refer to the voice detection program of the third aspect)
The voice detection program according to mode 19, wherein,
in said band-based noise estimation processing, said band-based noise estimation unit sets the sub-band noise power of other microphones so as to be the sub-band power of said other microphones.
The voice detection program according to mode 19 or 20, wherein
said sub-band is set so as to be narrower in width in a low frequency range and so as to be broader in width in a high frequency range.
The voice detection program according to any one of modes 19-21, wherein the program further allows a computer to execute a delay correction processing that corrects the delay of a signal entered from each of said microphones.
The voice detection program according to any one of modes 19-22, further comprising:
a sound volume correction processing that corrects the sound volume of a signal entered from each of said microphones.
The voice detection program according to mode 22 or 23, further comprising:
a delay time measurement processing of measuring time points of rapid change in the power values of signals from said microphones to output the differences between said time points as the delay time to said delay correction unit.
The voice detection program according to mode 23 or 24, further comprising:
a correction sound volume estimation processing that calculates the values of the ratio of the power values of the respective microphones to output the resulting ratio values as correction coefficients to said sound volume correction unit.
The voice detection program according to mode 24 or 25, wherein
the delay time or the power ratio of signals from the respective microphones is calculated based on an output signal from a sudden sound generation unit that outputs a sudden sound of a short time duration.
The voice detection program according to any one of modes 19-26, wherein
said band-based power calculation processing calculates, from one frequency width to another, for a preset time duration, a total of power values at an interval of said frequency width for a preset time duration.
A recording medium having stored therein the program according to any one of modes 19 to 27.
Number | Date | Country | Kind |
---|---|---|---|
2008-139541 | May 2008 | JP | national |
The present application is the National Phase of PCT/JP2009/059610, filed May 26, 2009, which claims priority rights based on the Japanese Patent Application 2008-139541 filed on May 28, 2008. The total of the contents disclosed in the Application of the senior filing date is to be incorporated by reference herein.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2009/059610 | 5/26/2009 | WO | 00 | 11/17/2010 |