The present invention is related to the field of multimodal devices and more particularly, to systems and methods for prompting and cueing user input for multimodal interfaces using speech input.
Computing devices containing multimodal interfaces have been proliferating. A multimodal interface as used herein refers to an interface that includes both voice processing and visual presentation capabilities. For example, numerous cellular telephones can include a graphical user interface and be capable of responding to speech commands and other speech input. Other multimodal devices can include personal data assistants, notebook computers, video telephones, teleconferencing devices, vehicle navigation devices, and the like.
Traditional methods for vocally interacting with multimodal devices typically involve first audibly or textually prompting a user for speech input. Responsive to this prompting, the device receives a requested speech input. Next, an audible or textual confirmation of the speech input can be presented to the user. Such interactions are typically slow due to the need of such methods to serially relay messages between the user and the multimodal devices. The inefficiency of these methods of prompting and confirmation can result in considerable user frustration and dissatisfaction.
Such interactions, typical of conventional systems, fail to take advantage of the capabilities of visual displays in multimodal devices to provide alternative approaches to prompt or cue multi-token speech input for speech recognition purposes. Accordingly, there is a need for systems and methods utilized with multimodal devices that enable such devices to use the capabilities of visual interfaces advantageously to provide simple, efficient, and accurate mechanisms for prompting and cueing user to provide multi-token input speech.
The present invention is directed for systems and methods for providing more efficient and accurate prompting for multi-token speech from users of devices using multimodal interfaces. The present invention utilizes an enhanced visual interface which allows minimal prompting for users to provide multi-token input speech. The user can then interact with the multimodal interface accordingly, relying on the visual cues, rather than direct prompting, to construct a multi-token speech input for multiple fields.
In a first embodiment of the invention, a method for prompting user input for a multimodal interface includes providing the multimodal interface to a user, where the interface includes a visual interface having a plurality of input regions, where each input region includes at least one input field. The method also includes selecting one of the input regions and processing a multi-token speech input provided by the user, where the processed speech input includes at least one value for an input field of the selected input region. The method also includes storing the one value in the one input field.
In a second embodiment of the invention, a system for prompting voice interactions for a multimodal interface is provided. The multimodal interface includes a controller element that generates the multimodal interface, where the interface comprises a visual interface having a plurality of input regions, and wherein each input regions includes at least one input field. The controller element can also select one of the input regions and process a multi-token speech input received by the controller element, where the processed speech input includes at least one value for at least one input field of the selected input region. The controller element can also store at least one value in at least one input field.
In a third embodiment of the invention, a computer-readable storage is provided. The storage having stored thereon, a computer program having a plurality of code sections executable by a computer for causing the machine to provide a multimodal interface to a user, where the interface comprises a visual interface having a plurality of input regions, and where each input region comprises at least one input field. The code sections can also cause the machine to select one of the input regions and process a multi-token speech input provided by the user, where the processed speech input comprises at least one value for at least one input field of the selected input region. The code sections can also cause the machine to store at least one value in at least one input field.
In some embodiments, the user can be prompted to provide a multi-token speech input. The prompt can comprise visually indicating within the multimodal interface the selected input region. Alternatively, the prompt can comprise visually indicating within the multimodal interface at least one incomplete input field of the selected input region. Additionally, the prompt can comprise audibly indicating the selected input region.
In other embodiments, the selected input region can be determined by processing a selection speech input from the user. Alternatively, the selected input region can be chosen from among one or more of the plurality of input regions having at least one incomplete input field.
There are shown in the drawings, embodiments which are presently preferred, it being understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.
a) depicts an exemplary user interface layout, according to another embodiment of the present invention.
b) depicts an exemplary user interface operating according to still another embodiment of the present invention.
c) depicts another exemplary user interface operating according to yet another embodiment of the present invention.
With reference now to the various figures in which like elements are identically numbered throughout, a description of the various embodiments of the invention will now be provided. While the invention is disclosed in the context of a limited number of arrangements, it can be appreciated that the invention can include numerous modifications from the presented embodiments.
The invention disclosed herein provides alternative methods and systems for prompting and cueing voice interactions via a visual interface of a multimodal device. The voice interactions can occur via a multimodal interface that includes a speech recognition engine for receiving speech input and using one or more visual indicia within the multimodal interface to signal a user as to the contents of speech input currently requested by the multimodal device.
For example, a system 100 can be multimodal computing device such as, but not limited to, personal data assistant (PDA) devices equipped with a microphone, cellular telephone devices with a display screen, computing tablet devices, personal computer devices, or vehicle navigation system devices. The system 100 can also be a multimodal computing device comprising a series of connected computing devices with both audio and visual capabilities, such as telephony devices communicatively linked to a liquid crystal display, or teleconference devices linked to a television or other visual presentation device. Additionally, the speech recognition capabilities of the system 100 can be remotely or locally located. For example, the speech recognition capabilities for a cellular telephone can reside within a voice recognition network element that is communicatively linked to the cellular telephone. Alternately, a personal data assistant can have embedded speech recognition capabilities.
The system 100 described herein and as shown in the schematic
As illustrated in
The system 100 can also comprise one or more manual input devices 114 coupled to the processing element 102, including, but not limited to, touch-screen devices, keyboards or other direct input devices, and pointing devices, to allow a user to manually input information. Additionally, the system 100 can also comprise a microphone or other speech input device 116, coupled to a speech recognition element 118, which in turn is coupled to the processing element 102, to accept user speech input and to process the user speech input to discern one or more words for the processing element 102 to use as values for input fields defined in an interface.
In conventional computing devices, values for input fields, such as in an electronic form, are typically entered using an input device, such as a keyboard or pointing device. For multimodal devices, the typical entry method disclosed in the prior art is for the multimodal device to provide audible or textual prompts to guide users in providing the necessary input speech for the device. One method has been to prompt a user for a single speech input or token, on an input field by input field basis, which typically results in a long and tedious entry of multiple fields. Another method has been to allow entry to be completed using a multi-token entry, allowing the entry of several fields at the same time. However, in such multi-token speech approaches, as the number of items being entered into the device increases, the more thorough a prompt is generally required, which can become longer and again more tedious for the user. In addition, as the prompt becomes longer, a distracted user may fail to provide a complete or proper response. Additionally, even when a textual prompt is provided, methods in the prior art generally still require a field by field approach or require a separate input interface.
However, the present invention does not rely on the generation of complicated audible or textual cues to cue a user as to how to provide an appropriate multi-token input to generate values for input fields of a visual interface, such as a form as illustrated in
A visual user interface 202 for use in the present invention, as illustrated in
The present invention can be further used to identify an active input region 210 of the visual interface to the user. An active input region can include one or more of the input fields 204. The processing element 102 and/or the speech recognition element 118 can then be configured to expect a multi-token speech input based on the input fields 204 of the active input region 210. For example, as shown in
Therefore, by using the additional indicia rather than just a textual or audible prompting for one or more fields, the present invention provides an enhanced visual interface 202 which provides visual multi-token speech cues to the user, as the illustrated by the visual interface in
Additionally, because the input fields 204 required for a multi-token speech input are already visually identified in the interface 202, a simpler audible or textual prompting of the user to provide a speech input for the active input region 210 can also be provided. For example, in the illustrated interface in
Additionally, the interface 202 can be further configured to cue a user to enter information for incomplete input fields 212 by highlighting one or more of the incomplete input fields 212 existing in an input region 206, as shown in the exemplary interface in
Together, highlighting incomplete input fields 212 and/or visually defining active input regions 210 of an interface 202 can be used to provide alternate methods for prompting and cueing users to provide multi-token speech input during an interaction with a multimodal device, such as in the exemplary method 300 for the system 100, as shown in
A speech input device 116 can then be configured to receive a speech input from the user at step 304. The speech input provided by the user can be a multi-token entry comprising of one or more words which can define values for the various input fields 204 of the active input region 210. Upon receipt of a user speech input, a speech recognition element 118 can then be configured to process the input speech received according to at least one grammar and provide the output to the processing element 102. Although a single grammar for the entire interface 202 can be provided, a different grammar for each input field 206, 208, can be provided to allow more efficient processing of a received multi-token speech input. Using the grammar appropriate for the active input region 210, the speech recognition element 118 can then be used to identify any values included in the multi-token speech input. However, a multi-token speech input can also be processed based on a least one additional grammar. For example, an additional grammar used can be one for the processing of user speech input to identify an input region 206, 208 of the visual interface 202. In such embodiments, the processing element 102 can be configured to determine, based on the output of the speech processing element 118, at step 306, whether the input speech comprises a multi-token speech input for the active input region 210 or an identification of an input region 206, 208 the user wishes to switch to.
If the processing element 102 determines that the input speech is a switch request then at step 308, the processing element 102 identifies which input region 206, 208 is being selected by the user input speech and instructs the display device 106 to display as active the newly identified input region, step 310. Afterwards, the system 100 can continue to wait for new user speech input to be received at step 304, which can be subsequently processed based on the grammar for the newly active input region 210. In some embodiments, the processing element 102 can further be configured to provide a textual or audible prompt to identify the newly active input region to the user.
However, if the processing element 102 instead determines that the speech input is a multi-token speech input for the active input region 210 then at step 312, the processing element 102 can use the processed input speech from the speech recognition element 118 and store values for the input fields 204 in the active input region 210 at step 314. However, if the processing element 102 determines that the input speech comprises neither a switch request nor a multi-token input for the highlighted input region, the system 100 can be configured to continue to wait for further speech input at step 304. Additionally, in some embodiments, if the processing element 102 determines that the input speech neither comprises a switch request or a multi-token request for the active input region 210, or if the speech recognition module is otherwise unable to process the user input speech, a grammar associated one or more other input regions 206, 208 can be used to process the input speech instead. In such embodiments, if the processing element 102 determines that based on the output of the speech recognition element 118, a grammar associated with another input region 206, 208, other than the active region 210, is appropriate, the processing element 102 can be configured to automatically make the other input region 206, 208 active and the speech recognition element 118 can process the input speech according to the grammar of the newly active input region 210 instead.
After the processing element 102 has stored one or more values for the input fields 204 of the active input region 210 based on the input speech, the processing element 102 can be configured to determine, at step 316, whether any input fields 204 of the active input region 210 still require an entry. In such embodiments, the absence or existence of incomplete input fields 212 can be used to signal the device to make active a different input region 206, 208 or to prompt the user for more information for the currently active input region 210. As illustrated, when the processing element 102 determines that there are missing values for one or more input fields 204 at step 318, the processing element 102 can configure the interface, at step 320, to highlight or identify only incomplete input fields 212, as illustrated in
However, if the processing element 102 determines in step 318 that there are no incomplete input fields 212 in the active input region 210, the processing element 102 can be configured to make active a different input region 206, 208 of the interface 202. In some embodiments, the processing element 102 can be configured to prompt the user to identify a next input region 206, 208 to make active. In other embodiments, as illustrated in the exemplary method 300 at step 322, the device can be configured to review all the input regions 206, 208 of the interface 202 and choose the next input region 206, 208 to make active only an input regions 206 still having incomplete input fields 212. If the processing element 102 can identify at least one other input region having one or more incomplete input fields 212 at step 324, then the processing element 102 can be configured, at step 326, to select the associated input region 206 to make active. Additionally, the processing element 102 can instruct the display device 106 to provide indicia to identify in the interface 202 one or more incomplete input fields 212, as previously discussed. The system 100 can then wait for further speech input from the user at step 304 to provide the additional information. If no incomplete input fields are found in step 324, the interaction necessary for the visual interface 202 is complete (step 328) and the system 100 can continue with its operation.
The present invention can be realized in hardware, software, or a combination of hardware and software. The present invention can be realized in a centralized fashion in one computer system or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system or other apparatus adapted for carrying out the methods described herein is suited. A typical combination of hardware and software may be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.
The present invention also can be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which when loaded in a computer system is able to carry out these methods. Computer program in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: a) conversion to another language, code or notation; b) reproduction in a different material form.
This invention can be embodied in other forms without departing from the spirit or essential attributes thereof. Accordingly, reference should be made to the following claims, rather than to the foregoing specification, as indicating the scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
6018711 | French-St. George et al. | Jan 2000 | A |
6233560 | Tannenbaum | May 2001 | B1 |
6405168 | Bayya et al. | Jun 2002 | B1 |
6587820 | Kosaka et al. | Jul 2003 | B2 |
20020062213 | Kosaka et al. | May 2002 | A1 |
20040006474 | Gong et al. | Jan 2004 | A1 |
20040138890 | Ferrans et al. | Jul 2004 | A1 |
20040236574 | Ativanichayaphong et al. | Nov 2004 | A1 |
20050033582 | Gadd et al. | Feb 2005 | A1 |
20050091059 | Lecoeuche | Apr 2005 | A1 |
20050187789 | Hatlestad et al. | Aug 2005 | A1 |
20050203747 | Lecoeuche | Sep 2005 | A1 |
20050273759 | Lucassen et al. | Dec 2005 | A1 |
20060235694 | Cross et al. | Oct 2006 | A1 |
20060247925 | Haenel et al. | Nov 2006 | A1 |
20070266077 | Wengrovitz | Nov 2007 | A1 |
20090254347 | Moore et al. | Oct 2009 | A1 |
Number | Date | Country | |
---|---|---|---|
20080162143 A1 | Jul 2008 | US |