The present invention relates generally to voice dialog systems.
Voice-based interaction with PC's and mobile devices is becoming more widespread and the mobile devices are able to accept longer, and more complex, speech expressions. However, when errors occur in speech input, the user is often not informed of the problem until he or she has completely finished speaking and is not provided adequate information as to where the problem occurred. As a result, the user is often required to repeat long commands or try to reword inputs, without adequate information as to where the misunderstanding occurred, in order to make the speech input understood.
In previous voice input systems, processing of a speech utterance begins after the utterance is complete. These systems do not provide real-time feedback to the end user. Instead, feedback and recovery from errors is addressed in terms of dialog, as described above. Adaptive natural language processing systems can change operation in response to inputs, but they do not address real-time feedback related to speech understanding. Although these systems are able to process the speech inputs in real-time, error recovery is addressed only in terms of a response that follows a complete utterance, rather than real-time feedback.
Other voice dialog system use ‘barge-in’ responses, where the system immediately interrupts the user when the recognition processing fails. However, these systems are perceived by users as being rude and do not enable normal dialog recovery processes, such as in-context repetition or rewording of commands.
Some voice output systems that use avatars synchronize motion of the avatar with the output speech, so as to give the impression that the avatar is producing the speech (as in a cartoon).
The accompanying figures, in which like reference numerals refer to identical or functionally similar elements throughout the separate views and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and to explain various principles and advantages all in accordance with the present invention.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.
Before describing in detail embodiments that are in accordance with the present invention, it should be observed that the embodiments reside primarily in combinations of method steps and apparatus components related to the provision of visual feedback to a user of a voice dialog system. Accordingly, the apparatus components and method steps have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element proceeded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
It will be appreciated that embodiments of the invention described herein may be comprised of one or more conventional processors and unique stored program instructions that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions related to the provision of visual feedback to a user of a voice dialog system described herein. The non-processor circuits may include, but are not limited to, audio input and output devices, signal conditioning circuit, graphical display units, etc. As such, these functions may be interpreted as a method to perform visual feedback to the user. Alternatively, some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the two approaches could be used. Thus, methods and means for these functions have been described herein. Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation.
Multiple avatars, or the parameters that specify them, may be stored in a database or memory 116. The avatars may be selected according the voice dialog application being executed or according to the user of the voice dialog system.
Facial expressions and/or body poses (body expressions) depicted by the avatar are used to communicate, in real-time, the ongoing status of the voice input processing. For example, different facial expressions and/or body poses may be used to indicate, for example: (1) that the system is in an idle mode, (2) that the system is actively listening to the user, (3) that processing is proceeding normally and with adequate confidence levels, (4) that processing is proceeding but confidence measures are below some system threshold—a potential misunderstanding has occurred, (5) that errors have been detected and the user should perform a recovery process, such as repeating the voice utterance, (6) that a non-recoverable failure has occurred, or (7) that the system desires an opportunity to respond to the user.
The apparatus provides real-time feedback to a user regarding speech recognition in a natural, user-friendly manner. This feedback may provided both when a new state is entered and within a state. Feedback within a state may be used to convey a sense of ‘liveness’ or ‘awareness’ to the user. This may advantageous during a listening state, for example.
In some embodiments, an avatar that gives an appearance of listening is used so as to make voice interaction with a dialog system more comfortable for a user.
In some embodiments, different avatars are used to represent different applications that the user is interacting with, or to indicate personalization (i.e., each individual user has their own avatar. Presence of a user's personal avatar indicates that the voice dialog system recognizes who's speaking and that personalization factors may be used. Data representing the different avatars may be stored in a computer memory, for example.
The avatar provides rapid feedback to the user, without interfering with the flow of speech and helps to mediate dialog flow. This can avoid the need for a user to repeat long segments of speech. Additionally, the avatar enables the user to easily discern if the dialog system is attentive.
The dialog status data may be representative, for example, of an idle state, an actively listening state, a normal operation state (with confidence above a threshold), a normal operation state (with confidence below a threshold), a recoverable error state, a non-recoverable failure state or a system response requested state.
The avatar expression may be designed to correspond to those that commonly mediate person-person conversation.
In the foregoing specification, specific embodiments of the present invention have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present invention. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims. The invention is defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims as issued.