1. Field of the Invention
The present invention relates to a speech section detection apparatus, and more particularly to a speech section detection apparatus capable of reliably detecting a speech section even for a word containing a glottal stop sound or for a word containing a succession of “s” column sounds (sounds belonging to the third column in the Japanese Goju-on Zu syllabary table) or “h” column sounds (sounds belonging to the sixth column in the same table).
2. Description of the Related Art
In speech recognition, speech sections, based on which speech is recognized, must be extracted from a time-series signal captured through a microphone. There is proposed a method that takes a period during which the short-duration power of speech is greater than a predetermined threshold as a speech section but, with this method, it has been difficult to achieve sufficient accuracy for speaker-independent systems intended to recognize a large variety of words spoken by unspecified speakers.
The applicant has previously proposed a pitch period extraction apparatus and method that can detect with high accuracy a pitch, the highness or lowness of tone, in a time domain, from a speech signal (Japanese Unexamined Patent Publication No. 9-50297), but it is also possible to determine a speech section based on the pitch period.
However, in the case of a word A which contains a glottal stop sound in the word (for example, Japanese word “chisso”), a word B which contains a succession of “s” column sounds (sounds in the third column in the Japanese Goju-on Zu syllabary table) (for example, Japanese word “sushiya”), or a word C which contains a succession of “h” column sounds (sounds in the sixth column in the Japanese Goju-on Zu syllabary table) (for example, Japanese word “hihuka”), it has not been possible to avoid the possibility of erroneous detection resulting from a failure to detect all the constituent sounds of the word as one continuous speech section.
As can be seen from the figures, in the case of the “word A”, the sound in the first half of the word (“chi” in the Japanese word “chisso”) is detected in the speech section, but the sound in the last half (“sso” in the Japanese word “chisso”) is not detected.
In the case of the Japanese word “sushiya”, there is a break in the speech section between “sushi” and “ya”, while in the case of the Japanese word “hihuka”, there is a break between “hifu” and “ka”; in either case, the word is not detected as one continuous speech section.
Possible causes for such erroneous detection include the following.
A: In the word A, the fricative “ss” that follows the glottal stop, and in the word B, the fricative “sh” that follows the “s” column sound “su”, are not only low in level but also difficult to differentiate from noise, and as a result, it is difficult to detect the pitch period itself.
B: When there is no aspirated sound part or noise part preceding the word, and when the tone is low, the pitch period cannot be detected.
C: In the case of the word C, there is a relatively long pause between the series of “h” sounds (“hihu” in the Japanese word “hihuka”) and the succeeding sound (“ka” in the Japanese word “hihuka”).
D: Noise during a pause.
The present invention has been devised in view of the above problem, and it is an object of the invention to provide a speech section detection apparatus capable of reliably detecting a speech section even for a word containing a glottal stop sound or for a word containing a succession of “s” column sounds or “h” column sounds.
A speech section detection apparatus according to a first aspect of the invention comprises: preprocessing means for removing noise contained in a speech signal; speech pitch extracting means for extracting a speech pitch signal from the speech signal from which noise has been removed by the preprocessing means; gate signal generating means for generating a gate signal based on the speech pitch extracted by the speech pitch extracting means; and speech section signal generating means for generating a speech section signal based on the gate signal generated by the gate signal generating means. In this apparatus, the gate signal is controlled based on the speech pitch extracted from the speech signal, and the speech section signal is controlled based on this gate signal.
In a speech section detection apparatus according to a second aspect of the invention, the apparatus further comprises speech signal segmenting means for segmenting the speech signal, from which noise has been removed by the preprocessing means, into a plurality of speech sections based on the speech section signal generated by the speech section signal generating means. In this apparatus, the speech signal is segmented into a plurality of speech sections based on the speech section signal.
In a speech section detection apparatus according to a third aspect of the invention, the speech pitch extracting means comprises: subtraction processing means for applying subtraction processing, for removing any speech signal smaller than a prescribed amplitude, to the speech signal from which noise has been removed by the preprocessing means; constant amplitude means for making essentially constant the amplitude of the speech signal to which the subtraction processing has been applied by the subtraction processing means; negative peak emphasizing means for detecting a positive peak and a negative peak subsequent to the positive peak from the speech signal whose amplitude has been made essentially constant by the constant amplitude means, and for generating a speech signal whose negative peak is emphasized by subtracting the positive peak from the negative peak; and differentiating means for detecting the speech signal whose negative peak has been emphasized by the negative peak emphasizing means, and for differentiating the detected signal. In this apparatus, the speech pitch is extracted by processing the speech signal in a time domain.
In a speech section detection apparatus according to a fourth aspect of the invention, the subtraction processing means comprises: envelope difference calculating means for calculating a positive envelope and a negative envelope of the speech signal from which noise has been removed by the preprocessing means, and for calculating an envelope difference representing the difference between the positive envelope and the negative envelope; subtraction processing threshold value calculating means for calculating a subtraction processing threshold value by multiplying the envelope difference calculated by the envelope difference calculating means by a prescribed coefficient factor; and subtraction processing threshold value subtracting means for subtracting the subtraction processing threshold value from the amplitude of the speech signal when the amplitude of the speech signal from which noise has been removed by the preprocessing means is equal to or greater than the subtraction processing threshold value calculated by the subtraction processing threshold value calculating means. In this apparatus, the subtraction processing threshold value is calculated by multiplying the envelope difference of the speech signal by a prescribed factor.
In a speech section detection apparatus according to a fifth aspect of the invention, the subtraction processing means further comprises zero setting means for setting the amplitude of the speech signal to zero when the amplitude of the speech signal from which noise has been removed by the preprocessing means is smaller than the subtraction processing threshold value calculated by the subtraction processing threshold value calculating means. In this apparatus, when the amplitude of the speech signal is smaller than the subtraction processing threshold value, the amplitude of the speech signal is set to zero.
In a speech section detection apparatus according to a sixth aspect of the invention, the constant amplitude means comprises: envelope difference calculating means for calculating a positive envelope and a negative envelope of the speech signal from which noise has been removed by the preprocessing means, and for calculating an envelope difference representing the difference between the positive envelope and the negative envelope; maximum envelope difference holding means for holding a maximum envelope difference out of envelope differences previously calculated by the envelope difference calculating means; and constant-amplitude gain calculating means for calculating a constant-amplitude gain by dividing by the present envelope difference the maximum envelope difference held by the maximum envelope difference holding means. In this apparatus, the constant-amplitude gain is determined based on the envelope difference of the speech signal.
In a speech section detection apparatus according to a seventh aspect of the invention, the constant amplitude means further comprises: unity gain setting means for setting the constant-amplitude gain to unity gain when the constant-amplitude gain calculated by the constant-amplitude gain calculating means is equal to or larger than a predetermined threshold value. In this apparatus, when the constant-amplitude gain is equal to or larger than the predetermined threshold value, the constant-amplitude gain is set to unity gain.
In a speech section detection apparatus according to an eighth aspect of the invention, the gate signal generating means comprises gate signal opening means for opening the gate signal when an average value taken over a predetermined number of consecutive speech pitches extracted by the speech pitch extracting means becomes equal to or larger than a predetermined gate opening threshold value. In this apparatus, when the average value of the predetermined number of speech pitches becomes equal to or larger than the predetermined gate opening threshold value, the gate signal is opened.
In a speech section detection apparatus according to a ninth aspect of the invention, the gate signal generating means further comprises gate signal open state maintaining means for maintaining the gate signal in an open state once the gate signal is opened by the gate signal opening means, as long as the average value of the predetermined number of consecutive speech pitches extracted by the speech pitch extracting means does not become smaller than a gate closing threshold value which is smaller than the gate opening threshold value. In this apparatus, the gate signal is maintained in an open state as long as the average value of the predetermined number of consecutive speech pitches does not become smaller than the gate closing threshold value
In a speech section detection apparatus according to a 10th aspect of the invention, the gate signal generating means further comprises gate signal closing means for closing the gate signal when the average value of the predetermined number of consecutive speech pitches extracted by the speech pitch extracting means becomes smaller than the gate closing threshold value. In this apparatus, when the speech pitch average value becomes smaller than the gate closing threshold value, the gate signal is closed.
In a speech section detection apparatus according to an 11th aspect of the invention, the speech section signal generating means comprises: first prescribed period counting means for counting a first prescribed period from the time the gate signal generated by the gate signal generating means is opened; and speech section signal opening means for setting the speech section signal open by going back in time for a second prescribed period from the time the counting of the first prescribed period by the first prescribed period counting means is completed. In this apparatus, when the gate signal has remained open continuously for the first prescribed period, the speech section signal is set open by going back in time for the second prescribed period from the end of the first prescribed period.
In a speech section detection apparatus according to a 12th aspect of the invention, the speech section signal generating means further comprises: third prescribed period counting means for counting a third prescribed period from the time the gate signal generated by the gate signal generating means is closed; and speech section signal closing means for closing the speech section signal when the counting of the third prescribed period by the third prescribed period counting means is completed. In this apparatus, the speech section signal is closed when the third prescribed period has elapsed from the time the gate signal was closed.
In a speech section detection apparatus according to a 13th aspect of the invention, the speech section signal generating means further comprises speech section signal open state maintaining means for maintaining the speech section signal in an open state when the speech section signal is set open by the speech section signal opening means by going back in time for the second prescribed period before the counting of the third prescribed period by the third prescribed period counting means is completed. In this apparatus, the speech section signal is maintained in an open state when the third prescribed period and the second prescribed period overlap each other.
Further features and advantages of the present invention will be apparent from the following description with reference to the accompanying drawings, in which:
A gate signal generator 26 generates a gate signal based on a pitch detected by a pitch detector 25, and a speech section signal generator 27 generates a speech section signal based on the gate signal generated by the gate signal generator 26. Based on the speech section signal generated by the speech section signal generator 27, a word extractor 28 processes the digital signal stored in the memory 24 and extracts and outputs a word contained in the speech section.
In the present embodiment, the analog/digital converter 23, the memory 24, the pitch detector 25, the gate signal generator 26, the speech section signal generator 27, and the word extractor 28 are constructed using, for example, a personal computer, and the pitch detector 25, the gate signal generator 26, the speech section signal generator 27, and the word extractor 28 are implemented in software.
In step 32, an index i which indicates the order of storage in the memory 24 is set to “1”. Next, in steps 33 to 35, speech signals X(i) already stored in the memory 24 are sequentially shifted by the following processing.
X(i+1)←X(i)
When the shifting is completed, the newly read speech signal V is stored at the starting location X(1) in the memory 24, and the routine is terminated.
In the above embodiment, the high-frequency noise removal processing and the low-frequency noise removal processing are performed by software, but these may be performed by incorporating a hardware filter in the line amplifier 22.
In step 51b, it is determined whether the envelope value difference ΔE is smaller than a predetermined amplitude elimination threshold value r. If the answer is Yes, that is, if the envelope value difference ΔE is smaller than the threshold value r, the speech signal X(i) is set to “0” in step 51c, and the process proceeds to step 51d. On the other hand, if the answer in step 51b is No, that is, if the envelope value difference ΔE is not smaller than the threshold value r, the process proceeds directly to step 51d.
In step 51d, it is determined whether the present positive envelope value Ep is larger than the previous positive envelope value Epb. If the answer in step 51d is Yes, that is, if the present positive envelope value Ep is larger than the previous positive envelope value Epb which means that the positive envelope value has increased, then the index S is set to “1” in step 51e, and the process proceeds to step 51g. On the other hand, if the answer in step 51d is No, that is, if the present positive envelope value Ep is smaller than the previous positive envelope value Epb which means that the positive envelope value has decreased, then the index S is set to “0” in step 51f, and the process proceeds to step 51g.
In step 51g, it is detected whether or not the previous value Sb of the index S is “1” and the present index S is “0”, that is, whether or not a positive peak is detected. If the answer in step 51g is Yes, that is, if the positive peak is detected, the threshold value bc for the subtraction processing is calculated using the following equation in step 51h, and thereafter, the process proceeds to step 51i.
bc←α*ΔE
Here, α is a predetermined value, and can be set to a constant value “0.05” when using the speech section detection apparatus of the invention in an automobile. On the other hand, if the answer in step 51g is No, that is, if no positive peak is detected, the process proceeds directly to step 51i.
In step 51i, it is determined whether the speech signal X(i) is either equal to or greater than the subtraction processing threshold value bc, that is, whether the amplitude of the speech signal X(i) is large. If the answer in step 51i is Yes, that is, if the amplitude of the speech signal X(i) is equal to or larger that the threshold value bc, then in step 51j the value obtained by subtracting the subtraction processing threshold value bc from the speech signal X(i) is set as the subtraction-processed speech signal Xs(i), and the process proceeds to step 51l.
Xs(i)←X(i)−bc
On the other hand, if the answer in step 51i is No, that is, if the amplitude of the speech signal X(i) is smaller that the threshold value bc, Xs(i) is set to 0 in step 51k, and the process proceeds to step 51l. Here, the processing in step 51k may be omitted, and the process may proceed directly to step 51l when the answer in step 51i is No.
Finally, in step 51l, the previous positive envelope value Epb, the previous negative envelope value Emb, and the previous index Sb are undated, after which the routine is terminated.
Epb←Ep
Emb←Em
Sb←S
Ep=Epb·exp{−1/(τ·fs)}
where τ is a time constant, and fs is the sampling frequency.
Likewise, in step a2, the present negative envelope value Em is calculated by the following equation.
Em=Emb·exp{−1/(τ·fs)}
Next, in step a3, the maximum of the subtraction-processed speech signal Xs(i) and the present positive envelope value Ep calculated in step al is obtained, and the obtained value is taken as the new present positive envelope value Ep. Likewise, in step a4, the minimum of the subtraction-processed speech signal Xs(i) and the present negative envelope value Em calculated in step a2 is obtained, and the obtained value is taken as the new present negative envelope value Em.
In the final step a5, the envelope value difference ΔE is calculated by the following equation, and the routine is terminated.
ΔE=Ep−Em
Next, in step 52c, it is determined whether the conditions
Xs(i−2)<Xs(i−1)
Xs(i)<Xs(i−1) and
Xs(i−1)>0
are satisfied, that is, whether the subtraction-processed speech signal Xs(i−1) sampled Δt before is a positive peak.
If the answer in step 52c is Yes, that is, if the subtraction-processed speech signal Xs(i−1) is the positive peak, then in step 52d the maximum of the envelope value difference ΔE and the previously determined maximum envelope value difference ΔEmax is taken as the new maximum envelope value difference ΔEmax to update the maximum envelope value difference ΔEmax, and the process proceeds to step 52e. On the other hand, if the answer in step 52c is No, that is, if the speech signal Xs(i−1) is not a positive peak, the process proceeds directly to step 52e.
In step 52e, it is determined whether the envelope value difference ΔE calculated in step 52b is “0”. If the answer is No, that is, if ΔE is “0”, gain G is set to ΔEmax/ΔE in step 52f. Next, in step 52g, it is determined whether the gain G is either equal to or larger than a predetermined threshold value β (for example, 10); if the answer is Yes, the gain G is set to “1” in step 52h, and the process proceeds to step 52i. Here, the decision in step 52g may be omitted, and the process may proceed directly from step 52f to step 52i.
On the other hand, if the answer in step 52g is No, that is, if the gain G is smaller than the predetermined threshold value β, the process proceeds directly to step 52i. In the earlier step 52e, if the answer is Yes, that is, if ΔE is “0”, then the process proceeds to step 52h where the gain G is set to “1”, after which the process proceeds to step 52i.
Finally, in step 52i, the AGC-processed speech signal XG(i−1) is calculated by multiplying the subtraction-processed speech signal Xs(i−1) by the gain G, and the routine is terminated.
XG(i−1)←G*Xs(i−1)
XG(i−3)<XG(i−2)
XG(i−1)<XG(i−2) and
0<XG(i−2)
If the answer in step 53a is Yes, that is, if the positive peak is detected in the AGC-processed speech signal, the peak value XG(i−2) is stored as P in step 53b, and the routine is terminated. If the answer in step 53a is No, that is, if no positive peak is detected in the AGC-processed speech signal, the routine is terminated.
XG(i−3)>XG(i−2)
XG(i−1)>XG(i−2) and
0>XG(i−2)
If the answer in step 54a is Yes, that is, if the negative peak is detected in the AGC-processed speech signal, the clamping-processed speech signal XC(i−2) with its negative peak emphasized is calculate in step 54b by subtracting the peak value P from the AGC-processed speech signal XG(i−2), and the routine is terminated.
XC(i−2)←XG(i−2)−P
If the answer in step 54a is No, that is, if no negative peak is detected in the AGC-processed speech signal, the AGC-processed speech signal XG(i−2) is taken as the clamping-processed speech signal XC(i−2), and the routine is terminated.
XC(i−2)←XG(i−2)
XD(i−3)←E·exp{−Δt/(τ)}
where Δt is the sampling time, and τ is a predetermined time constant. E will be described later.
In step 55b, it is determined whether the absolute value of the clamping-processed speech signal XC(i−3) is greater than the absolute value of the detected output XD(i−3). If the answer in step 55b is No, that is, if the absolute value of XC(i−3) is not greater than the absolute value of XD(i−3), the detected output XD(i−3) is set as E in step 55c, and the process proceeds to step 55f.
If the answer in step 55b is Yes, that is, if the absolute value of XC(i−3) is greater than the absolute value of XD(i−3), then it is determined in step 55d whether there is a negative peak in the clamping-processed speech signal. That is, when the following conditions are satisfied, it is determined that XC(i−3) is the negative peak.
XC(i−4)>XC(i−3)
XC(i−2)>XC(i−3) and
0>XC(i−3)
If the answer in step 55d is Yes, that is, if the negative peak is detected in the clamping-processed speech signal, the negative peak value XC(i−3) is set as E in step 55e, and the process proceeds to step 55f. On the other hand, if the answer in step 55d is No, that is, if no negative peak is detected in the clamping-processed speech signal, the process proceeds to the step 55c described above.
In step 55f, the value stored as E is set as the detected signal XD(i−3), and in the next step 55g, the detected-signal change ΔXD is calculated by the following equation.
ΔXD←XD(i−3)−XD(i−4)
In step 55h, it is determined whether the absolute value of the detected-signal change ΔXD is either equal to or greater than a predetermined threshold value γ. If the answer in step 55h is Yes, that is, if the detected output has decreased greatly, then the speech pitch signal XP(i−3) is set to “−1” in step 55i, and the routine is terminated. On the other hand, if the answer in step 55h is No, that is, if the detected output has not decreased greatly, then the speech pitch signal XP(i−3) is set to “0” in step 55j, and the routine is terminated.
If the answer in step 160 is Yes, that is, if the speech pitch signal XP(i−3) is “−1”, and if the index j is unequal to (i−3), then the process proceeds to step 161 to calculate the pitch frequency f by the following equation.
f(i−3)=fs/{(i−3)−j}
Here, fs is the sampling frequency which is equal to 1/Δt.
In step 162, it is determined whether the pitch frequency f is higher than a maximum frequency 500 Hz; if it is higher than the maximum frequency, the pitch frequency f is set to “0” in step 163, and the process proceeds to step 164. On the other hand, if the answer in step 162 is No, the process proceeds directly to step 164. In step 164, the index j indicating the last time at which the speech pitch signal was “−1” is updated to (i−3).
Next, in step 165, after updating the pitch frequency as shown below, an average pitch frequency fm is calculated. In the present embodiment, the average pitch frequency is calculated by taking the arithmetic mean of three pitch frequencies, but the number of pitch frequencies used is not limited to three. Further, the calculation method for the average pitch frequency is not limited to taking the arithmetic mean, but other methods, such as a weighted average or moving average, may be used to calculate the average.
f3←f2
f2←f1
f1←f(i−3)
fm=(f3+f2+f1)/3
Then, in step 166, it is determined whether the average pitch frequency fm is either equal to or higher than a predetermined first threshold Th1 (for example, 200 Hz). If the answer in step 166 is Yes, that is, if the average pitch frequency fm is either equal to or higher than the first threshold Th1, it is determined that a speech section has begun here, and the gate signal g1 is set to “1” in step 167, after which the routine is terminated.
On the other hand, if the answer in step 166 is No, that is, if the average pitch frequency fm is lower than the first threshold Th1, then it is determined in step 168 whether the average pitch frequency fm is either equal to or higher than a predetermined second threshold Th2 (for example, 80 Hz). If the answer in step 168 is Yes, that is, if the average pitch frequency fm is either equal to or higher than the second threshold Th2, it is determined that the speech section is continuing, and the process proceeds to step 167 to maintain the gate signal g1 at “1”, after which the routine is terminated.
On the other hand, if the answer in step 168 is No, that is, if the average pitch frequency fm is lower than the second threshold Th2, it is determined that the speech section has ended, and the process proceeds to step 169 to reset the gate signal g1 to “0”, after which the routine is terminated.
As can be seen from these figures, the duration period of the speech signal coincides with the period that the gate signal g1 remains open, but if noise occurs after the voice stops, a noise-induced pitch frequency (marked by ◯ in
Dt←{(i−3)−j}/fs
Next, in step 191, it is determined whether the elapsed time Dt is longer than a predetermined threshold time Dtth (for example, 0.025 second) and whether the gate signal g1 is “1” (that is, the gate is open). If the answer in step 191 is Yes, that is, if the gate is open, and if a time longer than 25 milliseconds has elapsed from the last time at which the speech pitch signal was “−1”, then in step 193 the corrected gate signal g1 is set to “0” to close the gate and, at the same time, the index j is updated and f2 and f3 are reset, after which the routine is terminated.
On the other hand, if the answer in step 191 is No, that is, if the gate is closed, or if a time longer than 25 milliseconds has not yet elapsed from the last time at which the speech pitch signal was “−1”, then the first gate signal generation routine shown in
In the above embodiment, the reason that the threshold time Dtth is set to 25 milliseconds (a time longer than 25 milliseconds corresponds to a frequency lower than 40 Hz) is that the pitch frequency of a human voice being lower than 40 Hz is hardly possible. The corrected gate signal generated in the second gate signal generation routine is shown in
The speech section can be detected accurately by using the above corrected gate, but further accurate detection of the speech section can be achieved by solving the following problems.
1. As the gate is opened when the average value of three pitch frequencies becomes equal to or higher than the first threshold Th1, the open timing tends to be delayed.
2. It is not possible to discriminate between large-amplitude single-shot noise and a speech signal.
3. It is not possible to discriminate between an aspirated sound and noise.
4. It is not possible to detect a glottal stop sound since the amplitude of glottal stop sound is small.
The present invention solves the above problems by introducing a speech section signal which is controlled in the following manner by the gate signal (including the corrected gate signal). That is, to solve the problems 1, 2, and 3, when the gate signal has remained open for a time equal to or longer than a first prescribed period (for example, 50 milliseconds), the speech section signal is set open by going back in time (retroacting) for a second prescribed period (for example, 100 milliseconds) from the current point in time. To solve the problem 4, the speech section signal is maintained in the open state for a third prescribed period (for example, 150 milliseconds) from the moment the gate signal is closed.
If the answer in step 201 is Yes, that is, if the gate remains closed, closed state maintaining processing is performed in step 202, after which the process proceeds to step 207. If the answer in step 201 is No, that is, if the gate that was closed is now open, gate opening processing is performed in step 203, after which the process proceeds to step 207.
On the other hand, if the answer in step 200 is No, that is, if the gate was open, then it is determined in step 204 whether the gate signal g1 calculated this time is “1”, that is, whether the gate remains open. If the answer in step 204 is Yes, that is, if the gate remains open, open state maintaining processing is performed in step 205, after which the process proceeds to step 207. If the answer in step 204 is No, that is, if the gate that was open is now closed, gate closing processing is performed in step 206, after which the process proceeds to step 207.
In step 207, the speech section signal is output, and in the next step 208, the previously calculated gate signal g1b is updated to the gate signal g1 calculated this time, after which the routine is terminated.
If the answer in step 2b is Yes, that is, if 150 milliseconds have elapsed from the time the gate signal g1 was closed, then g2(i−3) as the speech section signal when the index indicating the processing time instant is (i−3) is set to “1” in step 2c, after which the routine is terminated. On the other hand, if the answer in step 2b is No, that is, if 150 milliseconds have not yet elapsed from the time the gate signal g1 was closed, the speech section signal g2(i−3) at the processing time instant (i−3) is set to “1” in step 2d, after which the routine is terminated.
If the answer in step 5b is No, that is, if 50 milliseconds have not yet elapsed from the time the gate signal g1 was opened, then g2(i−3) as the speech section signal when the index indicating the processing time instant is (i−3) is set to “0” in step 5c, after which the routine is terminated.
If the answer in step 5b is Yes, that is, if 50 milliseconds have elapsed from the time the gate signal g1 was opened, the index iB indicating the time instant that is 100 milliseconds, i.e., the second prescribed period, back from the processing time instant is calculated by the following equation.
iB←(i−3)−0.1/Δt
Here, the second term on the right-hand side indicates the number of samplings occurring in the 100-millisecond period. In step 5e, the index iB is set not smaller than zero in order to prevent going back into a region where no speech signal is present.
In step 5f, g2(iB) as the speech section signal when the index indicating the time instant is iB is set to “1”. In step 5g, it is determined whether the index iB is equal to the index (i−3) indicating the processing time instant, that is, whether the time has been made to go back for the second prescribed period. If the answer is No, that is, if the going back of time (retroaction) is not completed yet, the index iB is decremented in step 5h, and the process returns to step 5f. On the other hand, if the answer in step 5g is Yes, that is, if the going back of time is completed, the routine is terminated.
iB←(i−3)−0.1/Δt
In step 7b, the index iB is set not smaller than zero in order to prevent the time from going back into a region where no speech signal is present, and in step 7c g2(iB) is output, after which the routine is terminated.
W(iB)←X(iB)*g2(iB)
Here, X(iB) is the speech signal stored in the memory 24. In step 261, W(iB) is output, after which the routine is terminated.
As described above, according to the speech section detection apparatus in the first aspect of the invention, the gate signal is controlled based on the speech pitch extracted by processing the speech signal in time domain, and the speech section is detected based on the gate signal; accordingly, the speech section can be detected using simple configuration.
According to the speech section detection apparatus in the second aspect of the invention, it becomes possible to segment the speech signal into a plurality of speech sections, based on the speech section.
According to the speech section detection apparatus in the third aspect of the invention, as the speech section is detected based on the speech pitch extracted by processing the speech signal in time domain, the speech section can be detected in near real time.
According to the speech section detection apparatus in the fourth aspect of the invention, it becomes possible to suppress variations in the amplitude of the speech signal.
According to the speech section detection apparatus in the fifth aspect of the invention, it becomes possible to reliably remove noise contained in the speech signal.
According to the speech section detection apparatus in the sixth aspect of the invention, it becomes possible to reliably extract the speech pitch because the amplitude of the speech signal is made essentially constant.
According to the speech section detection apparatus in the seventh aspect of the invention, it becomes possible to prevent the introduction of noise by re-setting the constant-amplitude gain to unity gain when the constant-amplitude gain is equal to a predetermined threshold value.
According to the speech section detection apparatus in the eighth aspect of the invention, it becomes possible to prevent the gate signal from being erroneously opened by being affected by noise.
According to the speech section detection apparatus in the ninth aspect of the invention, it becomes possible to prevent the gate signal from being erroneously closed by being affected by noise.
According to the speech section detection apparatus in the 10th aspect of the invention, it becomes possible to reliably close the gate signal when the speech pitch is no longer extracted.
According to the speech section detection apparatus in the 11th aspect of the invention, it becomes possible to compensate for a delay in closing the gate signal and also to reliably eliminate noise by discriminating noise from an aspirated sound.
According to the speech section detection apparatus in the 12th aspect of the invention, it becomes possible to reliably detect a glottal stop sound whose amplitude is small.
According to the speech section detection apparatus in the 13th aspect of the invention, it becomes possible to prevent erroneous detection even when one speech section overlaps with another speech section.
The invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiment is therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Number | Name | Date | Kind |
---|---|---|---|
4959865 | Stettiner et al. | Sep 1990 | A |
5121428 | Uchiyama et al. | Jun 1992 | A |
5123048 | Miyamae et al. | Jun 1992 | A |
5596680 | Chow et al. | Jan 1997 | A |
5774837 | Yeldener et al. | Jun 1998 | A |
6782360 | Gao et al. | Aug 2004 | B1 |
6871176 | Choi et al. | Mar 2005 | B2 |
Number | Date | Country |
---|---|---|
9-50297 | Feb 1987 | JP |
Number | Date | Country | |
---|---|---|---|
20040193406 A1 | Sep 2004 | US |