Traditional speech dialog systems usually playback prompts as soon as the respective information is available to the system. This happens regardless of the current conversational situation the user may in at that time. For example, the driver of a vehicle can be in a conversation with a passenger, yet the navigation system may barge-in and interrupt the conversation. This may not only be perceived as “impolite” or annoying by the user, e.g., the driver, but the user might also miss the information being prompted.
Disclosed herein are systems and methods that are aware of an ongoing conversation and that are configured to make use of this awareness to intelligently schedule a speech prompt to an intended addressee.
An example embodiment of a method for intelligently scheduling a speech prompt in a speech dialog system includes monitoring an acoustic environment to detect an intended addressee's availability for a speech prompt having a measure of urgency corresponding therewith. Based on the intended addressee's availability, a time is predicted that is convenient to present the speech prompt to the intended addressee. The speech prompt is scheduled based on the predicted time and the measure of urgency.
Monitoring the acoustic environment can include detecting an acoustic signal associated with the acoustic environment to produce a detected acoustic signal, applying speech signal enhancement to the detected acoustic signal to produce an enhanced detected acoustic signal, and generating an enhanced speech signal and a speech activity signal as a function of the enhanced detected acoustic signal.
The method for intelligently scheduling the speech prompt can include detecting dialog from the speech activity signal. Alternatively, or in addition, the method can include capturing a video signal associated with the acoustic environment and applying visual speech activity detection to the video signal to generate a visual speech activity signal. The dialog can be detected from the speech activity signal, the visual speech activity signal, or both.
The method can include applying voice biometry analysis to the enhanced speech signal to detect involvement of the intended addressee in the dialog. The method can include applying one or more of automatic speech recognition, prosody analysis, and syntactic analysis to the enhanced speech signal to generate one or more speech analysis results. Pause prediction can be applied to the enhanced speech signal based on the one or more speech analysis results.
Predicting the time that is convenient to present the speech prompt can include estimating rudeness of interruption based on the pause prediction and dialog detection to generate a measure of rudeness. The measure of rudeness can be estimated using a cost function that includes cost for presence of an utterance, cost for presence of a conversation, and cost for involvement of the intended addressee in the conversation.
Scheduling the speech prompt can include trading off the measure of urgency and the measure of rudeness. The trading off can include computing an urgency-rudeness ratio as the ratio of the measure of urgency and the measure of rudeness. The prompt can be scheduled based on a comparison of the urgency-rudeness ratio to a threshold. The threshold may be pre-selected according to a particular application but the system may allow adjustment of the threshold, e.g., in response to user input or in response to timing considerations.
An example embodiment of a speech dialog system for intelligently scheduling a speech prompt includes a dialog manager, a scheduler configured to schedule the speech prompt, and a processor in communication with the dialog manager and scheduler. The dialog manager is configured to monitor an acoustic environment to detect an intended addressee's availability for a speech prompt having a measure of urgency corresponding therewith. The processor is configured to (i) predict a time that is convenient to present the speech prompt to the intended addressee based on the intended addressee's availability, and (ii) cause the scheduler to schedule the speech prompt based on the predicted time and the measure of urgency.
The system can include a microphone system configured to detect an acoustic signal associated with the acoustic environment to produce a detected acoustic signal. A speech processor, in communication with the dialog manager, can be configured to apply speech signal enhancement to the detected acoustic signal to produce an enhanced detected acoustic signal. The speech processor can be configured to generate an enhanced speech signal and a speech activity signal as a function of the enhanced detected acoustic signal. The dialog manager can be configured to detect dialog from the speech activity signal.
The system can include a camera that is configured to capture a video signal associated with the acoustic environment. A video processor, in communication with the dialog manager, can be configured to apply visual speech activity detection to the video signal to generate a visual speech activity signal. The dialog manager can be configured to detect the dialog from the speech activity signal and the visual speech activity signal.
The system can include a voice analyzer that is in communication with the dialog manager and that is configured to apply voice biometry analysis to the enhanced speech signal to detect involvement of the intended addressee in the dialog.
The system can include a speech recognition engine that is in communication with the processor and configured to apply one or more of automatic speech recognition, prosody analysis, and syntactic analysis to the enhanced speech signal to generate one or more speech analysis results.
The processor can be configured to apply pause prediction to the enhanced speech signal based on the one or more speech analysis results. For example, the processor can be configured to predict the time that is convenient to present the speech prompt by estimating rudeness of interruption based on the pause prediction and dialog detection to generate a measure of rudeness. The processor can be configured to cause the scheduler to schedule the speech prompt by trading off the measure of urgency and the measure of rudeness.
An example embodiment of a non-transitory computer-readable medium includes computer code instructions stored thereon for intelligently scheduling a speech prompt in a speech dialog system, the computer code instructions, when executed by a processor, cause the system to perform at least the following: monitor an acoustic environment to detect an intended addressee's availability for a speech prompt having a measure of urgency corresponding therewith; based on the intended addressee's availability, predict a time that is convenient to present the speech prompt to the intended addressee; and schedule the speech prompt based on the predicted time and the measure of urgency.
Embodiments have several advantages over prior approaches. Embodiments improve the situation of perceived impoliteness or rudeness that plagues traditional dialog systems by making the dialog system aware of ongoing conversations and by introducing “empathy” into the human machine conversation. Advantageously, a speech dialog system in accordance with an embodiment will be perceived as less annoying by the user. Further, prompts are more likely to be understood by the user. This can lead to a higher acceptance of the speech dialog system by the user. Also, this can increase the likelihood of successfully conveying the prompted information to the user.
Making human-machine communication as natural as possible has a high commercial potential because the feature of the dialog system's awareness of ongoing conversations is detectable to the end user and directly improves user experience.
The foregoing will be apparent from the following more particular description of example embodiments, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments.
A description of example embodiments follows.
Automatic speech recognition (ASR) systems typically are equipped with a signal preprocessor to cope with interference and noise, as described in WO2013/137900A1, entitled “User Dedicated Automatic Speech Recognition” and published Sep. 19, 2013. Often multiple microphones are used, e.g., microphones arranged in an array, particularly for distant talking interfaces where the speech enhancement algorithm is spatially steered towards the assumed direction of the speaker (beamforming). Consequently, interferences from other directions can be suppressed. This improves the ASR performance for the desired speaker, but decreases the ASR performance for others. Thus, the ASR performance depends on the spatial position of the speaker relative to the microphone array and on the steering direction of the beamforming algorithm.
Embodiments of the invention can include an improved system that employs advanced methods, including ASR and syntactic analysis, to predict goods points in time when it acceptable (e.g., “polite”) for the system to speak. Unlike the prior approach, the improved system is listening at all times with a large vocabulary, not just selected key words, similar to a “just talk” mode.
As illustrated in
The dialog system 200 may use audio information from the available microphone systems in the vehicle to schedule the speech prompt. In a simple case, the microphone system includes a single microphone, which may be located near the driver 202. Speech signal enhancement typically includes applying noise reduction to the detected audio signal from the microphone. Speech can be detected based on energy in the audio signal. For example, if the total energy in a time frame is above a background noise energy, a speech is signal considered to be detected in the time frame. In a more sophisticated setting, the system my focus on the tonal part of the detected audio signal to determine whether speech is present or not. The system may also use detection of fricatives in the detected audio signal as an indication that speech is present. When multiple microphones are available, for example, two microphones in an overhead console of the vehicle, the system may employ beamforming steered towards the driver 202 or toward the passenger 204, depending on who is detected to be speaking. For example, the dialog system may have access to a signal that indicates a high value when the driver 202 is detected to be speaking and a low value when the driver is detected not to be speaking. Similarly, a speech activity signal may be available for the co-driver 204. The speech activity signal(s) can be used to detect dialog. The system can look for relative timing and other patterns among the speech activity signals of the driver and co-driver. Alternating patterns of speech activity can be indicative of dialog, and such information can be made available for further processing.
Tonal information of the detected audio signal can be used to predict when somebody who is speaking is about to stop talking. It is known from linguistics and psychology that humans use tonal and syntactic information to predict pauses in the speech of their counterpart and these methods can be modeled based on computer analysis of the tonal qualities of the speech, as further described herein. This may allow the system to predict when it is a good time to interrupt and prompt. When multiple microphones are available, such as illustrated in
The camera 210 can be used to measure cognitive load based on observation of the driver 202. For example, the camera 210 can provide a video signal, which can be used to observe the driver 202 as the driver is operating the vehicle. In addition, other modalities of monitoring the driver may be available in the vehicle or may become available, such as heart rate monitoring or other physiological monitoring. For example, a wearable device on the driver, such as a smartwatch or fitness tracker, can provide such monitoring information. The information may be available to the dialog system through wireless connectivity, e.g., Bluetooth® technology, of the wearable device. The speech dialog system may consider a measure of cognitive load of the driver 202 in making a determination when to prompt and how to present information relevant to the driver when prompting.
With access to multiple microphones and speech signal enhancement (SSE), the system can determine if passengers 206, 208 are talking in the back seat as opposed to the driver 202 being in a conversation with the co-driver 204 or another passenger. If only passengers in the back seat are talking, the system may not want to wait to prompt the driver with important information. The SEE technology may also provide scene information of who is currently speaking based on voice biometrics and/or other available information. If the information indicates that the driver 202 is engaged in a conversation, the system may first call attention before delivering a prompt, to increase the likelihood that the prompt will not interrupt the ongoing conversation, that the driver will pay attention to the information being delivered, or both. The system may trade off (or weigh) perceived rudeness of the interruption against urgency of the information to be presented to the user. If the system cannot determine a good point in the conversation at the current time to present information, the system may choose to wait until a later time. However, if faced with a prompt having a high measure of urgency or if the urgency of a prompt increases to a certain threshold, the system may decide to interrupt the conversation, at the risk of being perceived as rude.
An advantage of a speech dialog system according to an embodiment of the present invention is that the system waits until there appears a reasonable gap in the detected conversation between users. The system trades off urgency versus politeness in order to determine when to prompt and how to prompt. If it is possible to wait a moment, the prompt can be put in a queue until it is possible to prompt without interrupting any user. If it cannot be avoided to interrupt an ongoing conversation the system can choose a polite way to first make the user aware of an important message to be prompted.
Speech Signal Enhancement (SSE) is typically applied as preprocessing for speech dialog systems. A prominent application of SSE is being the automotive use case. An integral part of SSE is the detection of speech activity. This is true for both single- as well as multi-microphone systems. For multi-microphone SSE, it is possible to detect which passenger is currently speaking. This also allows for the detection of a conversation, e.g., between the driver and co-driver, or between the driver and another passenger. An SSE module may provide information about speech activity to a dialog manager so that the prompting behavior of the dialog-system can be controlled accordingly. The dialog manager may consider the information about an ongoing dialog among the passengers in order to display a prompt only when none of the passengers are talking (by looking for gaps in the conversation, or predicting such gaps based on tonal and/or syntactic information). The prompts may be queued and scheduled according to their urgency, and, in particular, so as to not interrupt any detected speech in the vehicle. In case speech is detected and an urgent prompt is scheduled, the system may, for instance, ask for attention before prompting the scheduled message.
As illustrated in
As shown in
In general, the processor 320 is configured to predict a time that is convenient to present the speech prompt to the intended addressee based on the intended addressee's availability, and cause the scheduler 315 to schedule the speech prompt based on the predicted time and a measure of urgency. The processor 320 can be configured to predict the time that is convenient to present the speech prompt by estimating rudeness of interruption based on the pause prediction and dialog detection to generate a measure of rudeness. The processor 320 can schedule or cause the scheduler 315 to schedule the speech prompt by trading off the measure of urgency and the measure of rudeness. As further described herein, the measure of rudeness can be estimated using a cost function that includes cost for presence of an utterance, cost for presence of a conversation, and cost for involvement of the intended addressee in the conversation. Scheduling the speech prompt can include trading off the measure of urgency and the measure of rudeness. The trading off can include computing an urgency-rudeness ratio as the ratio of the measure of urgency, e.g., U(k), and the measure of rudeness, e.g., R(k). The prompt can be scheduled based on a comparison of the urgency-rudeness ratio to a threshold T.
The arrangement of system illustrated in
A camera or computer vision (CV) software can be used to determine if someone is speaking or not, also to detect if someone may be too distracted to listen.
Instead of just using SSE or voice activity detection (VAD) to find “speaking pauses,” the system can also employ automatic speech recognition (ASR) and natural language understanding (NLU) on what is spoken, parse what is spoken and predict good points in time when it is socially acceptable to interrupt. This can be based on the Transition Relevance Place (TRP) theory. Previously, TRP theory has been used for the reverse case, i.e., predicting when it is likely that users interrupt the system, as described in U.S. Pat. No. 9,026,443, which is incorporated herein by reference. For example, it is generally considered to be more acceptable to interrupt at the end of syntactic phrases or sentences than in the middle of such units. As described in U.S. Pat. No. 9,026,443, when a human listener wants to interrupt a human speaker in a person-to-person interaction, the listener tends to choose specific contextual locations in the speaker's speech to attempt to interrupt. People are skilled at predicting these Transition Relevance Places (TRPs). Cues that are used to predict such TRPs include syntax, pragmatics (utterance completeness), pauses and intonation patterns. Human listeners tend to use these TRPs to try to acceptably take over the next speaking turn, to avoid being seen as exhibiting “rude” behavior.
Based on the intended addressee's availability, a time is predicted 410 that is convenient to present the speech prompt to the intended addressee and the speech prompt is scheduled 415 based on the predicted time and the measure of urgency.
Example: Spatial Voice Activity Based Dialog Detection
It is assumed that SSE provides voice activity information for at least two speakers. The speakers are distinguished spatially (driver and passenger seat for instance). The voice activity information is furthermore available on a frame basis (e.g., every 10 ms). In a first step the frame-based speech activity information can be processed to remove short pauses and hence to provide coarse information about the presence of an utterance per speaker. Secondly, the “utterance present information” of all speakers is considered jointly in their temporal sequence. A dialog among two speakers can be detected based on the “utterance transition from one speaker to another within a predefined amount of time.” For example, an utterance from speaker 1 is followed by an utterance of speaker 2, whereas the gap between both is no longer than 3 seconds, for instance. This also includes simultaneous utterances of the two speakers. A transition back to speaker 1 is of course an indication that this dialog continuous. Utterance transition may also take place among several speakers, which may be used to monitor how many speakers are involved in the dialog. In particular, the information is available on who is involved in the conversation. Generally speaking, conversations can be detected based on tracking the temporal sequence of utterance transitions.
Example: Measuring Rudeness of Interruption
To quantify how ‘rude’ it would be to interrupt speech as part of a conversation, or even without a detected conversation, a cost function can be used. This cost function can include:
A possible metric to combine these factors is:
The resulting value would also lie in the same interval [0 1] as all individual contributions. Values close to 1 indicate a high level of rudeness. The involvement of the prompt-addressee is “floored” to a minimum value αI
Example: Trading Off Rudeness vs Urgency
Given that the urgency Un(k) of each scheduled prompt is available in the system, it can be traded-off against the Rudeness Rn(k). Note that Un(k) is also speaker dependent. The urgency is also scaled between zero and 1 to allow for a meaningful comparison with rudeness. The decision to display a prompt can be made based on requiring the Urgency-Rudeness Ratio to exceed some chosen threshold:
The threshold T can be used to adjust the “politeness” of the system. It may furthermore be considered to trigger a prompt only if the Urgency-Rudeness Ratio has exceeded the threshold for some time in order to achieve robustness.
It should be understood that the example embodiments described above may be implemented in many different ways. In some instances, the various methods and machines described herein may each be implemented by a physical, virtual or hybrid general purpose or application specific computer having a central processor, memory, disk or other mass storage, communication interface(s), input/output (I/O) device(s), and other peripherals. The general purpose or application specific computer is transformed into the machines that execute the methods described above, for example, by loading software instructions into a data processor, and then causing execution of the instructions to carry out the functions described, herein.
As is known in the art, such a computer may contain a system bus, where a bus is a set of hardware lines used for data transfer among the components of a computer or processing system. The bus or busses are essentially shared conduit(s) that connect different elements of the computer system, e.g., processor, disk storage, memory, input/output ports, network ports, etc. that enables the transfer of information between the elements. One or more central processor units are attached to the system bus and provide for the execution of computer instructions. Also attached to the system bus are typically I/O device interfaces for connecting various input and output devices, e.g., keyboard, mouse, displays, printers, speakers, etc., to the computer. Network interface(s) allow the computer to connect to various other devices attached to a network. Memory provides volatile storage for computer software instructions and data used to implement an embodiment. Disk or other mass storage provides non-volatile storage for computer software instructions and data used to implement, for example, the various procedures described herein.
Embodiments may therefore typically be implemented in hardware, firmware, software, or any combination thereof.
In certain embodiments, the procedures, devices, and processes described herein constitute a computer program product, including a computer readable medium, e.g., a removable storage medium such as one or more DVD-ROM's, CD-ROM's, diskettes, tapes, etc., that provides at least a portion of the software instructions for the system. Such a computer program product can be installed by any suitable software installation procedure, as is well known in the art. In another embodiment, at least a portion of the software instructions may also be downloaded over a cable, communication and/or wireless connection.
Embodiments may also be implemented as instructions stored on a non-transitory machine-readable medium, which may be read and executed by one or more processors. A non-transient machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine, e.g., a computing device. For example, a non-transient machine-readable medium may include read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; and others.
Further, firmware, software, routines, or instructions may be described herein as performing certain actions and/or functions of the data processors. However, it should be appreciated that such descriptions contained herein are merely for convenience and that such actions in fact result from computing devices, processors, controllers, or other devices executing the firmware, software, routines, instructions, etc.
It also should be understood that the flow diagrams, block diagrams, and network diagrams may include more or fewer elements, be arranged differently, or be represented differently. But it further should be understood that certain implementations may dictate the block and network diagrams and the number of block and network diagrams illustrating the execution of the embodiments be implemented in a particular way.
Accordingly, further embodiments may also be implemented in a variety of computer architectures, physical, virtual, cloud computers, and/or some combination thereof, and, thus, the data processors described herein are intended for purposes of illustration only and not as a limitation of the embodiments.
The teachings of all patents, published applications and references cited herein are incorporated by reference in their entirety.
While example embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the embodiments encompassed by the appended claims.