This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2014-098021, filed on May 9, 2014, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a speech enhancement device, a speech enhancement method, and a speech enhancement computer program which are configured to enhance an input signal, for example.
An input signal generated by collecting speech with a microphone may include a noise component, or a signal component corresponding to voice of a speaker may be small in the input signal. When an input signal includes a noise component or when the signal component is small, speech of a speaker may be unclear in the input signal. In addition, in the case of a device configured to recognize speech of a speaker in an input signal and perform processing corresponding to the speech, if the speech of the speaker is unclear, the device may fail to perform desired processing due to deterioration of the accuracy of speech recognition. To address this, a technology called Auto Gain Control (AGC) that automatically adjusts the level of an input signal has been utilized (see Japanese Laid-open Patent Publication No. 56-84013, for example).
However, excessive adjustment of the level of an input signal may increase distortion of the input signal or may even enhance a noise component, and speech of a speaker may not typically become clear. In particular, when one word is long, the voice of a speaker tends to become smaller as the speech comes close to the ending of the word. As a result, a signal corresponding to the word may not be clearly identified in the input signal. In such a case, even if the conventional AGC is applied to the input signal, the speech of the speaker included in that input signal may remain unclear.
Hence, as one aspect, an object of the specification is to provide a speech enhancement device capable of making clear speech of a speaker which is included in an input signal, even when volume of speech produced by the speaker changes according to a time from beginning of speech production.
According to an aspect of the invention, a speech enhancement device includes: a speech production section detection unit configured to detect a speech production section in which a speaker produces speech, from an input signal generated by a speech input unit; a timer unit configured to measure an elapsed time from a starting point of the speech production section; a gain determination unit configured to determine a gain, which represents a level of enhancement of the input signal, according to the elapsed time; and an enhancement unit configured to enhance the input signal or a spectrum signal of the input signal in the speech production section according to the gain, whereby the input signal is enhanced only at necessary portions thereof.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
A speech enhancement device according to an embodiment is described hereinafter with reference to the drawings.
When a speaker continuously produces speech for a long period of time, volume of speech produced by the speaker may be reduced towards the ending of a word. Thus, even when the level of an input signal is adjusted using a same gain over an entire speech production section that is a section in an input signal in which the speaker produces speech, the speech of the speaker does not necessarily become clear.
In addition, even when an input signal is separated into sub-sections each being shorter than a speech production section and the levels of the input signals in the sub-sections are individually and independently adjusted, a gain may discontinuously change in adjacent sections. Thus, speech may be distorted, or noise may be enhanced in a part, in which the volume of speech produced by the speaker is temporarily reduced, between two consecutive speech production sections or within a single speech production section. Consequently, the speech of the speaker may not become clear.
Hence, this speech enhancement device adjusts a gain of an input signal, which represents a level of enhancement of the input signal according to an elapsed time from a starting point of a speech production section of a speaker, and thereby makes clear speech of the speaker in the input signal even when the volume of the speech produced by the speaker changes according to the elapsed time. Then, by enhancing the input signal from a time point when the elapsed time reaches a predetermined time, this speech enhancement device may make clear the speech of the speaker in the input signal even when the volume of produced speech at the ending of the word is reduced.
The microphone 2 is an example of a speech input unit and configured to collect sound around the speech enhancement device 1, generate an analog input signal corresponding to intensity of the sound, and output the analog input signal to the amplifier 3. The amplifier 3 is configured to amplify an analog input signal and then output the amplified analog input signal to the analog/digital converter 4. The analog/digital converter 4 is configured to generate a digitalized input signal by sampling the amplified analog input signal in a predetermined sampling cycle. Then, the analog/digital converter 4 is configured to output the digitalized input signal to the processor 5. Note that the digitalized input signal is hereinafter simply referred to as an input signal.
The processor 5 has one or more processor components, a readable and writable memory circuit, and a peripheral circuit of the memory circuit. Then, the processor 5 obtains a corrected input signal by performing speech enhancement processing on an input signal. Then, the processor 5 performs speech recognition processing on the corrected input signal and performs processing according to speech of a speaker. Alternatively, the processor 5 may output the corrected input signal to other devices via a communication interface (not illustrated).
The power calculation unit 11 is configured to divide an input signal for every frame having predetermined length and calculate power of speech in every frame. Frame length is set to 32 msec, for example. Note that the power calculation unit 11 may also make a part of two continuous frames overlap. In this case, the power calculation unit 11 may also set to 10 msec to 16 msec, for example, a frame shift amount to be included in a new frame when a shift is made from a current frame to a next frame.
The power calculation unit 11 uses time-frequency transform to transform an input signal from a time domain into a spectrum signal in a frequency domain, for every frame. The power calculation unit 11 may use, for example, fast Fourier transform (FFT) or modified discrete cosine transform (MDCT) as the time-frequency transform. Note that the power calculation unit 11 may also perform the time-frequency transform after multiplying each frame by a window function like Hamming window or Hanning window.
For example, when frame length is 32 msec and a sampling rate of the analog/digital converter 4 is 8 kHz, every frame includes 256 sample points. Thus, the power calculation unit 11 performs FFT on the 256 points.
For every frame, the power calculation unit 11 calculates from a spectrum signal of that frame a power integration value in a frequency band in which human voice is included, as a characteristic amount representative of characteristics of the human voice.
The power calculation unit 11 calculates a power integration value in a frequency band in which human voice is included, according to the following expression, for example:
where S(f) is a spectrum signal at a frequency f, and |S(f)|2 is a power spectrum at the frequency f. In addition, fmin and fmax each represents a lower limit and an upper limit of a frequency band in which human voice is included. Then, P is a power integration value.
Note that the power calculation unit 11 may directly determine a power integration value from a square sum of a sample point in every frame, without performing the time-frequency transform of the frame.
The power calculation unit 11 notifies the speech production section detection unit 12 of the power integration value in every frame. The power calculation unit 11 also outputs a spectrum signal of each frequency for every frame to the speech production section detection unit 12 and the enhancement unit 15. Note that instead of enhancing a spectrum signal, the power calculation unit 11 may directly enhance an inputted input signal, as depicted by a dot-line in
The speech production section detection unit 12 detects a speech production section from the input signal based on the power integration values for the respective frames. In the first embodiment, the speech production section detection unit 12 detects a speech production section by judging whether or not each frame is included in the speech production section based on the power integration value for the frame.
When a power integration value of a frame on which the speech production section detection unit 12 focuses is larger than a noise judgment threshold Thn, the speech production section detection unit 12 judges that the frame is included in a speech production section. In addition, it is preferable that the noise judgment threshold Thn is adaptively set according to a background noise level included in an input signal. Then, when an integration value of a power spectrum of an entire frequency band of a frame is less than a predetermined power threshold, the speech production section detection unit 12 judges that the frame is a silent frame in which any sound other than the background noise is not included. Then, the speech production section detection unit 12 estimates the background noise level based on the power integration value of the silent frame. For example, the speech production section detection unit 12 estimates the background noise level according to the following expression:
noiseP′=0.01·Ps+0.99·noiseP (2)
where Ps is a power integration value in the newest silent frame, and noiseP is the background noise level prior to updating. Then, noiseP′ is the background noise level after updating. In this case, the noise judgment threshold Thn is set according to the following expression, for example:
Thn=noiseP+γ (3)
where γ is a preset constant and set to 2 to 3 [dB], for example.
For every frame, the speech production section detection unit 12 notifies the timer unit 13 of a judgment result of whether or not the frame is included in a speech production section.
The timer unit 13 has a timer, for example, and is configured to measure an elapsed time after a speech production section starts. In the first embodiment, the timer unit 13 starts time measurement when a last frame is not included in a speech production section and a current frame is included in the speech production section. Then, the timer unit 13 continues the time measurement of the elapsed time while receiving from the speech production section detection unit 12 a judgment result that a frame is included in the speech production section. Then, when receiving from the speech production section detection unit 12 a judgment result that a frame is not included in the speech production section, the timer unit 13 finishes the time measurement and resets the elapsed time to 0. In addition, the timer unit 13 sets the elapsed time to 0 for a frame which is not included in a speech production section.
For every frame, the timer unit 13 notifies the gain determination unit 14 of the elapsed time after a speech production section starts.
The gain determination unit 14 adjusts a gain which represents a level of enhancement of an input signal according to the elapsed time after a speech production section starts. In the first embodiment, the gain determination unit 14 keeps the gain at a certain level till the elapsed time after the start of the speech production section exceeds adjustment start time. When the elapsed time exceeds the adjustment start time, the gain determination unit 14 sets the gain higher as the elapsed time is longer. With this, even if volume of produced speech of a speaker becomes smaller towards the ending of a word, the speech enhancement device 1 may selectively enhance the speech of an ending part of the word. On the other hand, the speech enhancement device 1 may control excessive enhancement of a leading part of a speech production section whose volume is sufficient, and thereby suppress distortion of a corrected input signal.
G=ρ(t-β)(When β≦t<β′) (4)
where t represents an elapsed time from a starting point of a speech production section. In addition, ρ is a constant larger than 1.0.
Depending on a speaker, volume is rapidly reduced when the speaker comes closer to the ending of a word. Even in such a case, according to the above example, since the speech enhancement device 1 rapidly increases the gain G as the speaker comes closer to a termination of a speech production section, the speech enhancement device 1 may appropriately enhance a part in which the volume is reduced in speech of the speaker.
In addition, the adjustment start time β may be set to 0. More specifically, the gain G may be adjusted from the starting point of the speech production section. In this case, it is preferable that the gain G is calculated according to the expression (4), so that an input signal is not excessively enhanced in a leading part of a speech production section in which volume of produced speech of a speaker is sufficient.
The gain determination unit 14 determines the gain G by the graph in
The enhancement unit 15 enhances an input signal for every frame, according to the gain G received from the gain determination unit 14. In the first embodiment, the enhancement unit 15 enhances a spectrum signal of each frequency, according to the following expression:
where S′(f)2 represents a power spectrum of a frequency f after enhancement. Then, S′(f) represents a spectrum signal of the frequency f after enhancement. Note that the enhancement unit 15 may reduce a noise component from the enhanced power spectrum S′(f)2.
The enhancement unit 15 obtains a corrected input signal for every frame by transforming a corrected spectrum signal into a signal in a time domain through frequency-time transform. Note that the frequency-time transform is inverse transform of the time-frequency transform performed by the power calculation unit 11. Lastly, the enhancement unit 15 obtains a corrected input signal by combining a corrected input signal of every continuous frame.
The power calculation unit 11 divides an input signal for every frame and calculates a power integration value in a current frame (step S101). Then, the power calculation unit 11 outputs the power integration value to the speech production section detection unit 12 and a spectrum signal of each frequency to the speech production section detection unit 12 and the enhancement unit 15.
The speech production section detection unit 12 judges based on the power integration value whether or not the current frame is included in the speech production section (step S102). When the current frame is not included in the speech production section (step S102—No), the processor 5 does not enhance the input signal. Then, the processor 5 finishes the speech enhancement processing. On the other hand, when the current frame is included in the speech production section (step S102—Yes), the speech production section detection unit 12 notifies the timer unit 13 of the judgment result.
The timer unit 13 measures an elapsed time t from a starting point of the speech production section to the current frame, according to the judgment result received from the speech production section detection unit 12 (step S103). Then, the timer unit 13 notifies the gain determination unit 14 of the elapsed time t.
The gain determination unit 14 judges whether or not the elapsed time t from beginning of the speech production section is between the adjustment start time β, inclusive, and the adjustment completion time β′, exclusive (step S104). When the elapsed time t does not reach the adjustment start time β (step S104—No), the gain determination unit 14 sets the gain G to 1.0 (step S105). In addition, when the elapsed time t reaches or exceeds the adjustment completion time β′ (step S104—No), the gain determination unit 14 sets the gain G to α (step S106). On the other hand, when the elapsed time t is between the adjustment start time β, inclusive, and the adjustment completion time β′, exclusive (step S104—Yes), the gain determination unit 14 sets the gain G to a value which is higher as the elapsed time t is longer (step S107). After step S105, S106, or S107, the gain determination unit 14 notifies the enhancement unit 15 of the gain G.
The enhancement unit 15 enhances the input signal of the current frame according to the gain G to obtain a corrected input signal (step S108).
Then, the speech enhancement device 1 finishes the speech enhancement processing.
As described above, since the speech enhancement device adjusts a gain according to an elapsed time from a starting point of a speech production section, the speech enhancement device may appropriately correct an input signal according to a change in volume of produced speech of a speaker in the speech production section. For example, even when speech of a long word is produced while causing the volume of produced speech to drop towards the ending of the word, the speech enhancement device may correct the input signal so that the speech of the speaker becomes clear. Since the speech enhancement device determines a gain depending on an elapsed time from beginning of a speech production section, the gain continuously changes unlike a case in which gain is determined for every short period of time. Such continuous change in the gain makes it less likely to generate a discontinuous part in a corrected input signal. Thus, the speech enhancement device may obtain a corrected input signal which may contribute to an improvement in the accuracy of the speech recognition.
Then, a speech enhancement device according to a second embodiment is described hereinafter. The speech enhancement device according to the second embodiment determines likelihood of human voice in a speech production section and increases a gain as the human voice likelihood is higher.
In
The processor 51 of the speech enhancement device according to the second embodiment is different from the processor 5 of the speech enhancement device according to the first embodiment in that the processor 51 has the speech likelihood measurement unit 16 and performs different processing of the gain determination unit 14. Thus, the speech likelihood measurement unit 16 and the gain determination unit 14 are described hereinafter. For other components of the speech enhancement device, see the description on the corresponding components of the first embodiment.
The speech likelihood measurement unit 16 determines speech likelihood, which is a degree representative of human voice likelihood, for every frame of an input signal included in a speech production section. In the second embodiment, a microphone 2 is installed to collect sound of speaker's voice. Thus, when power of an input signal is large, it is considered that the speaker is producing speech. Then, the speech likelihood measurement unit 16 determines speech likelihood τ based on a power integration value P of an input signal in a speech production section. In addition, in the second embodiment, the speech likelihood τ takes a value from 0 to 1 and indicates that an input signal more likely represents human voice as the value is larger.
On the other hand, when the power integration value P exceeds the lower limit threshold γ and is equal to or less than an upper limit threshold γ′, the speech likelihood measurement unit 16 linearly and monotonously increases the speech likelihood τ as the power integration value P is larger. Then, when the power integration value P exceeds the upper limit threshold γ′, the speech likelihood measurement unit 16 sets the speech likelihood τ to 1.0. More specifically, the speech likelihood measurement unit 16 calculates the speech likelihood τ according to the following expression:
τ=0.0P<γ
τ=(P−γ)/(γ′−γ)γ≦P<γ′
τ=1.0γ≦P′ (6)
In addition, the lower limit threshold γ is set to an average value of power integration values P of respective frames included in an immediate predetermined period, for example. The predetermined period is set to several seconds to several tens of seconds so that more than one speech production section is included, for example. Alternatively, the lower limit threshold γ may be a background noise estimated value noiseP′ calculated with the expression (2) or a value obtained by adding a predetermined offset value (1 to 3 dB, for example) to the background noise estimated value noiseP′. Alternatively, the lower limit threshold γ may also be a fixed value that is set in advance. In addition, the upper limit threshold γ′ is set to a value obtained by adding a predetermined value to the lower limit threshold γ. Note that a predetermined value is experimentally defined and set to +12 dB, for example, so that the predetermined value is a power integration value from which it is estimated that an input signal is certainly human voice.
The speech likelihood measurement unit 16 outputs the determined speech likelihood τ to the gain determination unit 14.
The gain determination unit 14 determines a gain G according to an elapsed time from a starting point of a speech production section, similar to the gain determination unit 14 according to the first embodiment. Then, the gain determination unit 14 corrects the gain G according to the elapsed time from the starting point of the speech production section, so that the gain G is higher as the speech likelihood τ is higher. In the second embodiment, the gain determination unit 14 corrects the gain G according to the following expression:
G′=1.0+τ(G−1.0) (7)
In the expression (7), G′ is a corrected gain. As apparent from the expression (7), when the gain G prior to correction is 1.0 or the speech likelihood is 0.0, the corrected gain G′ is also 1.0. More specifically, even when the corrected gain G′ is used, the input signal remains unadjusted. On the other hand, when the gain G prior to correction is larger than 1.0 and the speech likelihood τ is also larger than 0.0, the corrected gain G′ is also higher as the gain G is higher and the speech likelihood τ is higher. Therefore, an input signal in the speech production section is more enhanced, as the input signal comes closer to the trailing end of a speech production section and as the input signal more likely represents human voice.
The gain determination unit 14 outputs the corrected gain G′ for every frame to the enhancement unit 15. The enhancement unit 15 enhances the input signal in the speech production section, using the corrected gain G′ instead of the gain G in the second embodiment described above. More specifically, the enhancement unit 15 calculates a corrected frequency spectrum using the corrected gain G′ instead of the gain G in the expression (5).
When it is judged in step S104 that the elapsed time t is between the adjustment start time β, inclusive, and the adjustment completion time β′, exclusive, the speech likelihood measurement unit 16 determines the speech likelihood τ of the input signal in the current frame, based on power of the current frame (step S201). Then, the speech likelihood measurement unit 16 notifies the gain determination unit 14 of the speech likelihood τ.
The gain determination unit 14 sets the gain G so that the gain G is higher, as the elapsed time t is longer and as the speech likelihood τ is higher (step S202). Then, the gain determination unit 14 outputs the gain G to the enhancement unit 15. Subsequently, the processor 51 performs the processing after step S108.
According to the second embodiment, the speech enhancement device enhances an input signal more as the input signal included in a speech production section more likely represents human voice. Thus, the speech enhancement device may enhance human voice included in the input signal more than other speech. Accordingly, since human voice included in the input signal becomes clear, the speech enhancement device may further improve the recognition accuracy of the speech recognition processing which utilizes a corrected input signal.
In addition, the speech enhancement device may have a plurality of microphones. In this case, the speech enhancement device may detect a sound source direction, which is an incoming sound direction, from a phase difference in spectra of input signals collected by each of the microphones. Then, a speech enhancement device according to a third embodiment utilizes a plurality of microphones to detect a sound source direction and determines speech likelihood of an input signal in a speech production section according to the sound source direction. Then, depending on the speech likelihood of an input signal estimated from the sound source direction, the speech enhancement device corrects a gain which is set according to an elapsed time from the starting point of the speech production section.
The speech enhancement device 10 according to the third embodiment is different from the speech enhancement device according to the second embodiment in that the speech enhancement device 10 has two microphones and that a part of processing performed by the processor 52 is different. Thus, the microphones 2-1 and 2-2, and the processor 52 are described hereinafter.
The microphones 2-1 and 2-2 are spaced at a certain distance so that a sound source direction may be detected. For example, when the speech enhancement device 10 desires to selectively enhance an input signal including voice of a driver in a car compartment, the microphone 2-1 and the microphone 2-2 are arranged, for example, in front of a driver seat side by side in a direction almost parallel to a line connecting the driver seat and a front passenger seat and are arranged to face the driver seat. Then, the microphone 2-1 and the microphone 2-2 are arranged so that a distance d of the microphone 2-1 and the microphone 2-2 is a value (V/Fs) obtained by dividing sound speed V by a sampling frequency Fs of the analog/digital converter 4. When the distance of the microphones is wider than this condition, phase rotation occurs in a phase spectrum on a high frequency side, and the detection accuracy of a sound source direction degrades.
In addition, it is assumed that the microphone 2-1 is arranged to the left of the microphone 2-2, and thus hereinafter, an input signal collected by the microphone 2-1 is referred to as a left input signal and an input signal collected by the microphone 2-2 a right input signal.
Sound collected by the microphone 2-1 and sound collected by the microphone 2-2 are each amplified by the amplifier 3, then digitalized by the analog/digital converter 4, and inputted to the processor 52.
In the third embodiment, the speech production section detection unit 112 may also detect a speech production section based on any of a left input signal and a right input signal. For example, the speech production section detection unit 112 may detect a speech production section based on a left input signal or a right input signal which has a larger power integration value. Similar to the enhancement unit 15 according to the second embodiment, the enhancement unit 152 enhances any one of a left input signal and a right input signal or both by using a corrected gain G′ calculated by the gain determination unit 114.
The sound source direction detection unit 17 detects a direction of a sound source based on a left input signal and a right input signal, for every frame. For example, when a difference between an arrival time of a left input signal and an arrival time of a right input signal is δ, the sound source direction detection unit 17 calculates a sound source direction θ with the following expression. Note that a direction orthogonal to the arrangement direction of the microphone 2-1 and the microphone 2-2 is 0 degree.
θ=sin−1(vδ/d)=sin−1(Fsδ) (8)
where v represents a sound velocity, d represents a distance between the two microphones, Fs represents a sampling frequency.
In addition, the sound source direction detection unit 17 calculates, for example, a cross-correlation value of the left input signal and the right input signal and may make a time difference when the cross-correlation value is maximum a difference δ between the arrival time of a left input signal and the arrival time of a right input signal. Alternatively, the sound source direction detection unit 17 may calculate a difference δ in the arrival time δ from a difference between a phase of a spectrum signal of the left input signal and a phase of a spectrum of the right input signal. The sound source direction detection unit 17 outputs the sound source direction θ determined for every frame, to the speech likelihood measurement unit 16. The speech likelihood measurement unit 116 calculates speech likelihood for every frame in the speech production section, based on the sound source direction θ.
Like a case in which a microphone targets voice of a driver in a car compartment for sound collection, a direction of voice produced by a specific speaker is estimated in advance. Then, when the sound source direction θ is included in a range of the estimated speaker direction, the speech likelihood measurement unit 116 sets the speech likelihood relatively higher. In contrast, when the sound source direction θ is out of the range of the estimated speaker direction, the speech likelihood measurement unit 116 sets the speech likelihood relatively lower.
On the other hand, when the sound source direction θ is equal to or more than 0 and equal to or less than an upper limit threshold μ, the speech likelihood measurement unit 116 linearly and monotonously reduces the speech likelihood τ as the sound source direction θ is larger. Note that the upper limit threshold μ is set to 0.1 radian, for example. Then, when the sound source direction θ exceeds the upper limit threshold μ, the speech likelihood measurement unit 116 sets the speech likelihood τ to 0.0.
The speech likelihood measurement unit 116 outputs the speech likelihood τ for every frame in the speech production section to the gain determination unit 114. The gain determination unit 114 outputs a corrected gain G′ according to the expression (7), similar to the second embodiment. Then, the gain determination unit 114 outputs the corrected gain G′ to the enhancement unit 152. Then, the enhancement unit 152 uses the corrected gain G′ to enhance at last one of the left input signal and the right input signal.
When it is judged in step S104 that an elapsed time t is between adjustment start time β, inclusive, and adjustment completion time β′, exclusive, the sound source direction detection unit 17 detects the sound source direction θ from a difference between the arrival time of a left input signal and the arrival time of a right input signal (step S301). Then, the sound source direction detection unit 17 notifies the speech likelihood measurement unit 116 of the sound source direction θ. The speech likelihood measurement unit 116 determines speech likelihood τ of an input signal in a current frame based on the sound source direction θ (step S302). Then, the speech likelihood measurement unit 116 notifies the gain determination unit 114 of the speech likelihood τ.
The gain determination unit 114 sets a gain G so that the gain G is higher as the elapsed time t is longer and speech likelihood τ is higher (step S303). Then, the gain determination unit 114 outputs the gain G to the enhancement unit 152. Subsequently, the processor 52 performs the processing after step S108.
According to the third embodiment, since the speech enhancement device determines speech likelihood of an input signal in a speech production section based on a sound source direction determined from input signals collected by a plurality of microphones, the speech enhancement device may evaluate the speech likelihood appropriately. Therefore, the speech enhancement device may set an appropriate gain.
A speech enhancement device according to a fourth embodiment is described hereinafter. The speech enhancement device according to the fourth embodiment adjusts a gain according to a result of comparison of power of an input signal in a first half of a speech production section and power of an input signal in a second half. Note that the first half and the second half may not necessarily be exact 50% of the entire speech production section.
The speech enhancement device 20 according to the fourth embodiment is different from the speech enhancement device 1 according to the first embodiment in that the speech enhancement device 20 has the storage 6 and that a part of processing to be performed by the processor 53 is different. Thus, the storage 6 and the processor 53 are described hereinafter.
The storage 6 has a readable and writable volatile memory circuit. Then, the storage 6 stores an input signal outputted from the analog/digital converter 4 till speech enhancement processing ends. For every speech production section, the storage 6 also stores a power integration value of each frame in the speech production section.
The processor 53 has a power calculation unit 11, a speech production section detection unit 12, a timer unit 13, a gain determination unit 14, and an enhancement unit 15, similar to the processor 5 of the speech enhancement device 1 according to the first embodiment.
The speech production section detection unit 12 judges for every frame whether or not the frame is included in a speech production section, and stores a power integration value P of the frame that is judged to be included in the speech production section.
In addition, when the speech production section detection unit 12 judges that the speech production section ends, more specifically, when a last frame is included in the speech production section and a current section is not included in the speech production section, the speech production section detection unit 12 notifies the gain determination unit 14 that the speech production section ends.
The gain determination unit 14 reads a power integration value of each frame in a speech production section from the storage 6. Then, the gain determination unit 14 calculates an average value Pfav of power integration values of respective frames included in the first half of the speech production section and an average value Psav of power integration values of respective frames included in the second half of the speech production section.
The gain determination unit 14 determines an upper limit α of the gain G following the expression below, according to a result of comparison of the average value Pfav of power integration values of frames included in the first half of the speech production section and the average value Psav of power integration values of frames included in the second half of the speech production section.
α=(Pfav/Psav)0.5: when Pfav>Psav, and Psav≈0.0
α=1.0: In other cases (9)
As illustrated in the expression (9), when the average value Psav of power integration values of frames included in the second half of the speech production section falls below the average value Pfav of power integration values of frames included in the first half of the speech production section, the gain determination unit 14 sets the upper limit α of the gain G larger than 1.0. On the other hand, when the average value Psav of power integration values of frames included in the second half of the speech production section does not drop with respect to the average value Pfav of power integration values of frames included in the first half of the speech production section, the gain determination unit 14 sets the upper limit α of the gain G to 1.0. Therefore, in the fourth embodiment, when volume of speech produced by the speaker drops in the second half of the speech production section, the input signal is enhanced, whereas the input signal is not enhanced when the volume of speech produced by the speaker does not drop in the second half of the speech production section. Hence, in the fourth embodiment, excessive enhancement of an input signal is controlled and consequently distortion of the input signal is suppressed.
Note that the adjustment start time β may be set at any point in the first half of the speech production section, for example, a midpoint in the first half of the speech production section. In addition, the adjustment completion time β′ may be set at any point in the second half of the speech production section, for example, a midpoint in the second half of the speech production section. Alternatively, the adjustment start time β and the adjustment completion time β′ may be set similar to those in the embodiments described above.
Following the graph illustrated in
The power calculation unit 11 divides an input signal for every frame and calculates a power integration value of a current frame (step S401). Then, the power calculation unit 11 outputs the power integration value to the speech production section detection unit 12 and a spectrum signal of each frequency to the speech production section detection unit 12 and the enhancement unit 15.
Based on the power integration value, the speech production section detection unit 12 judges whether or not a speech production section ends (step S402). When the speech production section does not end (step S402—No), the speech production section detection unit 12 stores the power integration value in the storage 6. Then, the processor 53 finishes the speech enhancement processing. On the other hand, when the speech production section ends (step S402—Yes), the speech production section detection unit 12 notifies the gain determination unit 14 of the judgment result.
The gain determination unit 14 reads a power integration value of each frame in the speech production section from the storage 6 and calculate a power average value Pfav and a power average value Psav of first and second halves in the speech production section (step S403). Then, the gain determination unit 14 determines an upper limit α of the gain G according to Pfav/Psav.
The gain determination unit 14 determines the gain G according to the upper limit α and the elapsed time t from a starting point of the speech production section (step S405). Then, the gain determination unit 14 notifies the enhancement unit 15 of the gain G.
The enhancement unit 15 reads an input signal from the storage 6 and enhances an input signal in the speech production section according to the gain G to obtain a corrected input signal (step S406). Subsequently, the speech enhancement device 20 finishes the speech enhancement processing.
According to the fourth embodiment, since the speech enhancement device may adjust a gain according to a result of comparison of power in a first half and a power in a second half of a speech production section, the speech enhancement device may set the gain according to a degree of power drop in the second half of the speech production section. In addition, according to the fourth embodiment, since the speech enhancement device may adjust timing of when a gain begins to increase, according to length of a speech production section, the speech enhancement device may appropriately set gain adjustment timing according to an individual difference such as speech speed and the like.
Then, a speech enhancement device according to a fifth embodiment is described hereinafter. The speech enhancement device according to the fifth embodiment adaptively determines the adjustment start time β of a gain G by detecting attenuation of power in an input signal according to an elapsed time in a speech production section.
The speech enhancement device 30 according to the fifth embodiment is different from the speech enhancement device 1 according to the first embodiment in that the speech enhancement device 30 has the delay buffer 7. Furthermore, the speech enhancement device 30 according to the fifth embodiment is different, in a part of processing of the processor 54, from the speech enhancement device 1 according to the first embodiment. Thus, the following description is provided for the delay buffer 7, the processor 54, and parts related thereto.
The delay buffer 7 has a delay circuit configured to output an inputted input signal after delaying the inputted input signal by a predetermined delay time. In the fifth embodiment, the delay time is set to a time which it takes for the processor 54 to detect attenuation of an input signal, 200 msec, for example. Then, the delayed input signal outputted from the delay buffer 7 is inputted to the processor 54.
The attenuation judgment unit 18 judges for each frame in a speech production section whether or not attenuation occurs on an input signal at a leading part of the speech production section. Thus, the attenuation judgment unit 18 detects a maximum value Pmax of power integration values of respective frames from a starting point in a speech production section till a threshold determination period, as a reference value to determine an attenuation judgment threshold Th to detect power attenuation. Note that the threshold determination period is set to a period during which volume of speech produced by a speaker does not attenuate, 100 msec, for example, which corresponds to one to two vowels.
The attenuation judgment unit 18 sets as the attenuation judgment threshold Th a value obtained by subtracting a predetermined offset value (1.0 dB, for example) from the maximum value Pmax of the power integration values. Then, the attenuation judgment unit 18 compares the power integration value P with the attenuation judgment threshold Th for each frame from the starting point of the speech production section till after the threshold determination period elapses. Then, when the power integration value P is continuously less than the attenuation judgment threshold Th for a predetermined period T, the attenuation judgment unit 18 judges that the input signal has attenuated. Note that the predetermined period T is set to the delay time by the delay buffer 7 or a time obtained by multiplying the delay time by a safety coefficient less than 1 (0.9 to 0.95, for example), 200 msec, for example.
The attenuation judgment unit 18 notifies the gain determination unit 14 of a time point earlier by the predetermined period T than the time point when it was judged that the input signal attenuated, as an attenuation start time.
The gain determination unit 14 determines the gain G setting the attenuation start time as the adjustment start time β. Then, the gain determination unit 14 outputs the gain G to the enhancement unit 153.
The enhancement unit 153 uses the gain G from the attenuation start time to perform speech enhancement processing on the input signal inputted from the delay buffer 7.
The power calculation unit 11 divides an input signal for every frame and calculates a power integration value of a current frame (step S501). Then, the power calculation unit 11 outputs the power integration value to the speech production section detection unit 12 and the attenuation judgment unit 18 and a spectrum signal of each frequency to the speech production section detection unit 12 and the enhancement unit 153.
Based on the power integration value, the speech production section detection unit 12 judges whether or not the current frame is in the speech production section (step S502). If the current frame is out of the speech production section (step S502—No), the processor 54 finishes the speech enhancement processing. On the other hand, when the current frame is included in the speech production section (step S502—Yes), the speech production section detection unit 12 notifies the attenuation judgment unit 18 and the gain determination unit 14 of the judgment result.
The attenuation judgment unit 18 judges whether or not the threshold determination period from the beginning of the speech production section ends in the current frame (step S503—No). When the threshold determination period does not end (step S503—No), the processor 54 finishes the speech enhancement processing. On the other hand, when the threshold determination period ends (step S503—Yes), the attenuation judgment unit 18 determines the attenuation judgment threshold Th based on the maximum value Pmax of the power integration values in the threshold determination period (step S504).
The attenuation judgment unit 18 also judges whether or not a continuation period during which the power integration period P is kept less than the attenuation judgment threshold Th reaches the predetermined period T (step S505). When the continuation period does not reach the predetermined period T (step S505—No), the processor 54 finishes the speech enhancement processing. On the other hand, when the continuation period reaches the predetermined period T (step S505—Yes), the attenuation judgment unit 18 sets a time point earlier by the predetermined period T than the current frame as the attenuation start time. Then, the attenuation judgment unit 18 notifies the gain determination unit 14 of the attenuation start time.
The gain determination unit 14 sets the attenuation start time as the adjustment start time β (step S506). Then, the gain determination unit 14 sets the gain G higher as the elapsed time t from the starting point of the speech production section is longer, for each of frames after the adjustment start time β and before the adjustment completion time β′ (step S507). Then, the gain determination unit 14 notifies the enhancement unit 153 of the gain G.
The enhancement unit 153 enhances the delayed input signal, inputted from the delay buffer 7, according to the gain G to obtain a corrected input signal (step S508). Subsequently, the speech enhancement device 30 finishes the speech enhancement processing.
According to the fifth embodiment, the speech enhancement device may start speech enhancement processing of an input signal when the input signal begins to attenuate in a speech production section. Thus, the speech enhancement device may appropriately enhance the input signal in the speech production section.
Note that more than one of the embodiments described above may be combined. For example, the second or third embodiment may be combined with the fourth or fifth embodiment. Alternatively, the fourth embodiment and the fifth embodiment may be combined.
In addition, when the speech enhancement device has a plurality of microphones, the speech production section detection unit 12 may judge for every frame whether or not the sound source direction θ is included in an estimated speaker's direction range. Then, when the sound source direction θ is included in the estimated speaker's direction range, the speech production section detection unit 12 may judge that the frame is included in the speech production section.
Furthermore, the speech enhancement device according to each of the embodiments described above or a variation may be incorporated in a mobile phone, for example, correct an input signal generated by other device. In this case, the input signal corrected by the speech enhancement device is reproduced from a speaker that the device incorporating the speech enhancement device has.
Furthermore, a computer program configured to cause a computer to implement a function that a processor of the speech enhancement device according to the embodiments described above or the variation has may be provided in a form recorded in a computer-readable medium such as a magnetic recording medium or an optical recording medium. Note that the recording medium does not include a carrier.
A computer 100 has a user interface unit 101, an audio interface unit 102, a communication interface unit 103, a storage 104, a storage medium access device 105, and a processor 106. The processor 106 is connected with the user interface unit 101, the audio interface unit 102, the communication interface unit 103, the storage 104, and the storage medium access device 105 via a bus, for example.
The user interface unit 101 has an input device such as a keyboard and a mouse, for example, and a display unit such as a liquid crystal display. Alternatively, the user interface unit 101 may have a device such as a touch panel display, in which an input device and a display device are integrated. Then, according to user manipulation, for example, the user interface unit 101 outputs, to the processor 106, an operation signal to start speech enhancement processing on an input signal inputted via the audio interface unit 102.
The audio interface unit 102 has an interface circuit configured to connect the computer 100 to a speech input device which generates an input signal of a microphone and the like. Then, the audio interface unit 102 acquires an input signal from the speech input device and passes the input signal to the processor 106.
The communication interface unit 103 has a communication interface configured to connect the computer 100 to a communication network that complies with a communication standard such as Ethernet (registered trademark) and a control circuit of the communication interface. Then, the communication interface unit 103 outputs a data stream including a corrected input signal, which is received from the processor 106, to other device via the communication network. The communication interface unit 103 may also acquire a data stream including an input signal from other device connected to the communication network and pass the data stream to the processor 106.
The storage 104 has a readable and writable semiconductor memory and a read-only semiconductor memory, for example. Then, the storage 104 stores a computer program to execute speech enhancement processing which is performed on the processor 106 and data generated in the course of the processing or as a result of the processing.
The storage medium access device 105 is a device that accesses a storage medium 107 such as a magnetic disk, a semiconductor memory card, and an optical recording medium, for example. The storage medium access device 105 reads a computer program for speech enhancement processing, which is stored in the storage medium 107 and performed on the processor 106, and passes the computer program to the processor 106.
The processor 106 corrects the input signal received via the audio interface unit 102 or the communication interface unit 103 by executing the computer program for speech enhancement processing according to any of each of the embodiments described above or of the variation. Then, the processor 106 stores the corrected input signal in the storage 104 or outputs the corrected input signal to other devices via the communication interface unit 103.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2014-098021 | May 2014 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
4811404 | Vilmur et al. | Mar 1989 | A |
20040151303 | Park | Aug 2004 | A1 |
20100121634 | Muesch | May 2010 | A1 |
20100198593 | Yu | Aug 2010 | A1 |
20110054889 | Konchitsky | Mar 2011 | A1 |
20110125489 | Shin | May 2011 | A1 |
20140270200 | Usher et al. | Sep 2014 | A1 |
Number | Date | Country |
---|---|---|
56-84013 | Jul 1981 | JP |
Entry |
---|
Search Report dated Dec. 3, 2015 in corresponding United Kingdom Patent Application No. GB1507405.7, 3 pages. |
Number | Date | Country | |
---|---|---|---|
20150325253 A1 | Nov 2015 | US |