This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2010-084328, filed on Mar. 31, 2010; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to speech dialog processing.
Technology that causes response speech to interrupt when an interrupt utterance by a user is detected while the response speech being output is disclosed in U.S. Pat. No. 5,155,760. According to the method of U.S. Pat. No. 5,155,760, the user does not have to continue to listen to response speech uselessly.
However, according to the method of U.S. Pat. No. 5,155,760, even when the user wants to continue to listen to response speech, the response speech may unintentionally be interrupted by erroneously detecting noise or the like as the start of utterance by the user.
In general, according to one embodiment, a speech dialog apparatus includes a speech detection unit that detects a start and an end of echo removed speech obtained by removing an echo of response speech contained in input speech; a response interruption control unit that outputs a response interruption command if the end is not yet detected when a predetermined period from the detection of the start passes; and a dialog control unit that causes a response speech output unit to interrupt output of the response speech upon receipt of the response interruption command from the response interruption control unit.
Exemplary embodiments of the speech dialog apparatus will be described with reference to appended drawings.
A speech dialog apparatus according to this embodiment determines that speech whose start and end are detected is not user speech but is noise when the end of speech is detected before a predetermined period T from the start of the speech passes, and does not interrupt response speech. Conversely, if the end of speech is detected for the first time after the predetermined period T from the start of the speech has passed, the speech dialog apparatus interrupts response speech determining that the speech whose start and end are detected is not noise but is user speech. Thus, while a function to interrupt response speech when the utterance of a user is started is realized, an occurrence of interruption of response speech due to erroneous detection of short noise as user speech can be reduced.
When a talk switch 2 is pressed by a user 1 and a session start command is sent to a dialog control unit 3 (step S1 in
Upon receipt of the command, the response speech output unit 4 generates a signal x (t) of the instructed response speech. The signal x (t) is amplified and output from a speaker 5 toward the user 1.
At this point, the dialog control unit 3 sends a speech input control command instructing the start of speech input to a speech input unit 6. Upon receipt of the command, the speech input unit 6 starts speech input via a microphone 7 so that speech uttered by the user 1 can be input (step S2 in
In conjunction with the input of speech by the speech input unit 6, an echo cancel unit 10 generates and outputs a signal e (t) that is a signal of echo removed speech obtained by canceling (removing), from a signal m (t) of microphone input speech by the speech input unit 6, an echo of the signal x (t) of the response speech (step S3 in
A speech detection unit 8 calculates an evaluation value S from the signal e (t) of echo removed speech by the echo cancel unit 10 to determine whether the user 1 has uttered based on the evaluation value S. If the start or end of speech uttered by the user 1 is detected or determined, the speech detection unit 8 also sends start/end information representing the detection/determination to a speech recognition unit 9 and a response interruption control unit 11 (step S4 in
If the start is detected by the speech detection unit 8 (“Yes” in step S5 of
The period T set to the timer as a timeout period by the response interruption control unit 11 may be set to the shortest utterance length of the user 1. For example, the Japanese Patent No. 4282704 suggests that the length may be about 200 ms. The length of reply speech of the user to response speech can be predicted to a certain extent based on content thereof. Thus, a table that associates the period T with each type of response speech of a system may be created so that the period T can be switched by referring to the table in accordance with response speech that changes as the dialog proceeds.
Next, conditional branching illustrated in Table (a) of
If a timeout of the timer occurs at the time when the first end, which is detected after detection of the start, is detected (the timer is “Timeout” and the first end is “Detected” in Table (a) of
If a timeout of the timer has not occurred when the first end, which is detected after detection of the start, is detected (the timer is “Not-yet-timeout” and the first end is “Detected” in Table (a) of
Otherwise, processing proceeds without doing anything (“A” in step S7 of
Subsequently, if the end is determined by the speech detection unit 8 (“Yes” in step S10 of
Upon receipt of the recognition result of the speech recognition unit 9, the dialog control unit 3 causes the response speech output unit 4 to stop response speech output. Furthermore, the dialog control unit 3 stops speech input of the speech input unit 6 and the operation of the echo cancel unit 10 linked thereto, performs a service/process for the user 1 in accordance with the recognition result, and proceeds with the dialog sequence (step S12 in
Thus, a speech dialog apparatus according to the embodiment operates in such a way that output of response speech is interrupted if the end is just detected or is not yet detected when the period T measured by a timer passes after the start is detected. As a result, as illustrated in
Incidentally, the speech detection unit 8 detects or determines the start/end of speech by an automaton illustrated in
In the start detection state 102, if a period D1 in which the evaluation value S is equal to or greater than the threshold Th1 continues for a start detection time Ts, the speech detection unit 8 detects the start time of the period D1 as the start of the speech interval and outputs start/end information indicating that the start is detected. Then, the automaton makes a transition to an end detection state 103. Conversely, if the period D1 does not continue for the start detection time Ts, the automaton is brought back to the noise state 101.
In the end detection state 103, if a period D2 in which the evaluation value S falls below an end detection threshold Th2 continues for an end detection time Te1 or longer, the speech detection unit 8 detects the start time of the period D2 as the end of the speech interval and immediately outputs start/end information indicating that the end is detected. Then, the automaton makes a transition from the end detection state 103 to an end determination state 104.
In the end determination state 104, if the period D2 further continues for an end determination time Te2 (>Te1) or longer, the speech detection unit 8 determines that the previously detected end is the end and outputs start/end information indicating that the end is determined. Then, the automaton makes a transition to the initial noise state 101. Conversely, if the period D2 does not continue for the end determination time Te2 or longer, the automaton is brought back to the previous end detection state 103. As a result, the end may be detected a plurality of times between the start detection and end determination.
Thus, the start/end information output by the speech detection unit 8 is output in the order of the start detection (the transition from the start detection state 102 to the end detection state 103), end detection (the transition from the end detection state 103 to the end determination state 104, which may occur a plurality of times in some cases), and end determination (the transition from the end determination state 104 to the noise state 101) and at least delays in accordance with Ts, Te1, and Te2 arise therebetween, respectively.
The flow of processing in
Before the start is detected, the processing proceeds like S3→S4→S5→S7→S10→S3. At this point, the start is not yet detected and the timer is also not activated ands thus, the processing always proceeds to “A” from S7.
If the processing proceeds to “Yes” after the start is detected in S5, the timer is activated in S6 and speech recognition is also started. Hereinafter, speech recognition continues until the speech recognition is stopped.
After the start is detected, the processing proceeds again like S3→S4→S5→S7→S10→S3 until the first end is detected or a timeout of the timer occurs.
The timer has been activated at this point. If a timeout of the timer occurs between the time when the start is detected and the time when the first end is detected, the processing branches to “B” in S7 so that the processing proceeds like S9→S3 determining that the detected speech sandwiched between the start and end is a speech whose length exceeds that of the period T.
If a timeout of the timer has not yet occurred when the first end is detected after the start is detected, the processing branches to “C” in S7 and the timer and speech recognition are stopped in S8 to wait for detection of a new start again determining that the detected speech sandwiched between the start and end is noise whose length is less than that of the period T. At this point, the automaton in
If a timeout of the timer occurs at the time when the first end is detected after the start is detected, the processing similarly branches to “B” in S7 so that the processing proceeds like S9→S3 determining that the detected speech sandwiched between the start and end is a speech whose length is equal to that of the period T.
Thus, the interruption of response speech is decided based on whether the elapsed time between the detection of the start and the detection of the first end is longer or shorter than the predetermined period T. The end determination needs an extra time of Te2=Te1 when compared with the end detection. A faster response speed is ensured by performing a response interruption determination at the time when the first end is detected.
In consideration of the possibility that the detected start/end is discarded as noise, a recognition result is configured to be output after waiting until the end is determined. The processing in this case proceeds like S3-22 S4→S5→S7→S10→S11→S12.
Furthermore, conditional branching illustrated in Table (b) of
The difference between Table (a) and Table (b) in
Next, the echo cancel unit 10 will be described.
A microphone signal m (t) input by a microphone signal input unit 22 is the signal m (t) of microphone input speech from the speech input unit 6. A reference signal x (t) input by a reference signal input unit 21 is the signal x (t) of response speech from the response speech output unit 4. An error signal e (t) output by an error signal output unit 25 is the signal e (t) of echo removed speech to be output by the echo cancel unit 10. A response interruption command input by a response interruption command input unit 27 is the response interruption command from the response interruption control unit 11.
The echo cancel unit 10 includes a filter unit 23 that imitates transfer characteristics of an echo path from the speaker 5 to the microphone 7 in
The echo replica signal y (t) is a signal imitating an echo of response speech mixed in the microphone signal m (t). The error signal e (t) from which an echo is removed can be generated, according to Formula 1, by subtracting the echo replica signal y (t) from the microphone signal m (t) using a subtracter 24. By outputting the error signal e (t) from the error signal output unit 25, the echo cancel unit 10 can send the signal e (t) of echo removed speech to the subsequent stage.
Echo cancel output calculation formulas are represented by Formula 1 and Formula 2.
where W (t) and X (t) are represented by column vectors in Formula 3.
W(t)=[w(0,t),w(1,t), . . . w(N−1,t)]T
X(t)=[x(t),x(t−1), . . . x(t−N+1)]T (3)
If transfer characteristics of the echo path from the speaker 5 to the microphone 7 are correctly provided to the filter unit 23 by preparations in advance, there is no need to execute an adaptive algorithm described later. However, in an operating environment in which transfer characteristics of the echo path change every minute, the adaptive algorithm that asymptotically finds correct transfer characteristics based on observed signals needs to be executed.
As a group of adaptive algorithms, the stochastic gradient algorithm that corrects the tap coefficient in the gradient (called the stochastic gradient) direction of an instantaneous square error e2 (t) regarding the tap coefficient is known. The tap coefficient correction formula of the stochastic gradient algorithm can be generalized by a recurrence formula of Formula 4 by setting the instantaneous value of an error signal at time t as e (t).
W(t+1)=W(t)+μ·γ·G(e(t))·X(t) (4)
where the positive number γ is a normalization coefficient, the positive number μ is a step size to control the scale of correction, and G (e (t)) is a function of the instantaneous value e (t), each of which is a scalar quantity. The second term of the right-hand side of Formula 4 represents the amount of a coefficient correction indicating how much to correct the tap coefficient value W (t) at time t. A tap coefficient correction unit 26 in
An algorithm obtained by applying a function G defined in Formula 5 and the normalization coefficient γ to Formula 4 is the NLMS algorithm (normalized LMS algorithm).
where XTX is the summation of power of N reference signal values from the current time to the (N−1) past. The NLMS algorithm is an algorithm that asymptotically determines the tap coefficients that minimize the mean-square value of error signals by using the instantaneous value e (t) of error at each time.
However, generation of the echo replica signal y (t) by Formula 2 and corrections of the tap coefficients by Formula 3 described above require a lot of computation costs. On the other hand, operations including Formula 1 to determine the error signal e (t) are needed only when response speech is output, that is, an echo of response speech is present. Thus, if response speech is interrupted, it is desirable to stop operations of at least Formula 1 and Formula 2 and if the tap coefficients are being corrected adaptively, it is desirable to further stop operations of Formula 3.
Thus, when a response interruption command from the response interruption control unit 11 is received via the response interruption command input unit 27, the filter unit 23 stops operations of Formula 2, the subtracter 24 stops operations of Formula 1, and the tap coefficient correction unit 26 stops operations of Formula 3.
However, the error signal e (t) output by the error signal output unit 25 will be indefinite after operations are stopped and thus, switching is performed by a signal switching unit 28 so that the microphone signal m (t) is output as the error signal e (t). Processing in this manner causes no problem because an echo of response speech is not superimposed on the microphone signal m (t) after the response speech being interrupted.
By stopping operations of echo cancel processing in the echo cancel unit 10 as described above, after the response speech being interrupted, the period in which echo cancel processing and speech recognition processing proceed concurrently can be suppressed to the period T. By making the period T shorter, a situation can more easily be created in which an arithmetic unit capable of performing one of the processing that is more time-consuming is applied. An advantage of being able to use arithmetic units of lower capabilities is thereby obtained so that apparatuses can be provided at lower prices.
Otherwise, the period in which echo cancel processing by the echo cancel unit 10 and speech recognition processing by the speech recognition unit 9 proceed concurrently will extend and computation costs will be the sum of both. As a result, a more expensive arithmetic unit having higher capabilities to perform both pieces of processing in real time will be needed.
Incidentally, the embodiment can be practiced in various modified forms.
For example, it is possible to allow the user to select the interruption method of response speech from a plurality of choices including the above interruption method in accordance with the operating environment or preferences of the user.
For example, if there is almost no noise in an operating environment of a speech dialog apparatus, erroneous detection due to noise is considered to occur very rarely. Thus, if, like conventional technology, the response interruption control unit 11 interrupts response speech when the start is detected, an advantage of being able to eliminate the period in which echo cancel processing and speech recognition processing proceed concurrently, that is, to reduce computation costs as much as possible is obtained. On the other hand, a case where there is sufficient arithmetic proficiency and there is no need to interrupt response speech according to preferences of the user can be considered. In this case, if the response interruption control unit 11 interrupts response speech when the end is determined, the user can continue to listen to the response speech until the user ceases to talk.
As illustrated in
The flow of processing in interruption mode 2 is as illustrated in
Next, the flow of processing in interruption mode 1 is illustrated in
When the talk switch 2 is pressed by the user 1 and a session start command is sent to the dialog control unit 3 (step S21 in
Upon receipt of the command, the response speech output unit 4 generates the signal x (t) of the instructed response speech. The signal x (t) is amplified and output from the speaker 5 toward the user 1.
At this point, the dialog control unit 3 sends a speech input control command instructing the start of speech input to the speech input unit 6. Upon receipt of the command, the speech input unit 6 starts speech input via the microphone 7 so that speech uttered by the user 1 can be input (step S22 in
In conjunction with the input of speech by the speech input unit 6, the echo cancel unit 10 generates and outputs the signal e (t) of echo removed speech obtained by canceling (removing) an echo of the signal x (t) of response speech from the signal m (t) of microphone input speech by the speech input unit 6 (step S23 in
The speech detection unit 8 calculates the evaluation value S from the signal e (t) of echo removed speech by the echo cancel unit 10 to determine whether the user 1 has uttered based on the evaluation value S. If the start or end of speech uttered by the user 1 is detected, the speech detection unit 8 also sends start/end information to the speech recognition unit 9 and the response interruption control unit 11 (step S24 in
When start/end information indicating the start detection is output by the speech detection unit 8 (“Yes” in step S25 of
Then, when start/end information indicating the end determination is output from the speech detection unit 8 (right branching in step S28 of
Upon receipt of the recognition result of the speech recognition unit 9, the dialog control unit 3 causes the response speech output unit 4 to stop response speech output. Furthermore, the dialog control unit 3 stops speech input of the speech input unit 6 and the operation of the echo cancel unit 10 linked thereto, performs a service/process for the user 1 in accordance with the recognition result, and proceeds with the dialog sequence (step S30 in
Next, the flow of processing in interruption mode 3 is illustrated in
When the talk switch 2 is pressed by the user 1 and a session start command is sent to the dialog control unit 3 (step S41 in
Upon receipt of the command, the response speech output unit 4 generates the signal x (t) of the instructed response speech. The signal x (t) is amplified and output from the speaker 5 toward the user 1.
At this point, the dialog control unit 3 sends a speech input control command instructing the start of speech input to the speech input unit 6. Upon receipt of the command, the speech input unit 6 starts speech input via the microphone 7 so that speech uttered by the user 1 can be input (step S42 in
In conjunction with the input of speech by the speech input unit 6, the echo cancel unit 10 generates and outputs the signal e (t) of echo removed speech obtained by canceling (removing) an echo of the signal x (t) of response speech from the signal m (t) of microphone input speech by the speech input unit 6 (step S43 in
The speech detection unit 8 calculates the evaluation value S from the signal e (t) of echo removed speech by the echo cancel unit 10 to determine whether the user 1 has uttered based on the evaluation value S. If the start or end of speech uttered by the user 1 is detected, the speech detection unit 8 also sends start/end information to the speech recognition unit 9 and the response interruption control unit 11 (step S44 in
When start/end information indicating the start detection is output by the speech detection unit 8 (“Yes” in step S45 of
Then, when start/end information indicating the end determination is output from the speech detection unit 8 (“Yes” in step S47 of
Upon receipt of the recognition result of the speech recognition unit 9, the dialog control unit 3 stops speech input of the speech input unit 6 and the operation of the echo cancel unit 10 linked thereto, performs a service/process for the user 1 in accordance with the recognition result, and proceeds with the dialog sequence (step S50 in
Embodiments can also be practiced as a computer program that performs processing illustrated in
More specifically, as illustrated in
The computer stores a speech dialog processing program that executes processing steps illustrated in
As illustrated in
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2010-084328 | Mar 2010 | JP | national |