The present invention relates generally to voice communications with machines. More particularly, the present invention relates to voice communication with a machine based on a guide containing input elements.
There are various ways for communicating with a machine such as a computer. Widely used ways include using QWERTY keyboards and mice. A limitation with QWERTY keyboard is that it is more difficult to accommodate non-roman alphabet languages due to the huge number of alternatives and variations of characters.
Another way for communicating with a machine is by using voice utterances or commands. However, even with current advances in speech processing technologies, it is still a challenge to process voice utterances from different users having varying pronunciations while catering for large vocabularies with high degrees of accuracy. Further, speech recognition capability does not exist for several languages. Current speech recognition systems favor voice commands that are very distinct and typically perform efficient voice recognition when the pre-defined voice database is relatively small or if significant data collection is carried out. Further, in many parts of the world, a significant proportion of the population is illiterate. Many of these people can only speak colloquially and often rely heavily on visual aides such as signs and pictures for communication. These limitations inhibit a large group of people from benefiting from the use of electronic devices and voice services in their daily living. This is increasingly becoming a problem as the use of technologies becomes the norm in a progressive society.
Accordingly, there is a need to provide a simple alternative for users to interact with electronic devices using substantially limited voice commands.
A system and method of voice communication with a machine are provided. The system includes a guide for containing at least one input element disposed in an arrangement, the arrangement having a coordinate system for locating the input element, and a processor for processing a user selection of the input element.
Embodiments of the invention are herein described, purely by way of example, with reference to the accompanying drawings, in which:
A system and method of voice communication with a machine according to embodiments of the invention are described hereinafter with reference to the accompanying drawings. The system and method enable users to effectively communicate with the machine by using voice utterances or commands to select an input element from a guide containing one or more input elements. The input elements can include alphabet, words, symbols, pictures, signs, computer control commands, and the like various ways of presenting information and combinations thereof.
In an embodiment, the system and method use a relatively small number of voice commands (i.e. vocabulary) for selecting the input elements from the guide. The system includes a small list of pre-defined labels. The pre-defined labels are used as indices of a coordinate system for locating the input elements which are arranged in a table or matrix in the guide. The pre-defined labels include a text form (typically used for displaying) and an audio form, wherein each text form label corresponds to an audio form label. The pre-defined labels can be in the form of colors, numbers, characters, words, images, symbols, and the like easy to recognize and distinguish forms of reference that can be represented as audio input.
A method 100 of effective voice communication with a machine according to an embodiment is shown in
Step 104 involves receiving audio input of the coordinates from the user and decoding the indices of the coordinates to determine the input element the user desires to select. The process of decoding the indices involves comparing the audio characteristic of the indices against the pre-defined labels using an audio recognizer. Once the indices are determined, these are used as search parameters for identifying the selected input element according to a display data structure. The display data structure is also created to keep track of the location of each of the input elements in the matrix. The display data structure stores the location of the input elements using either text form labels or audio form labels.
Upon finding the desired input element, the selected input element can be buffered in a step 106 for further processing depending on the intended user application. The selected input element can also be output to the user as a feedback mechanism in step 108. In step 108, the selected input element can be output to a display or by playing back the audio of the selected input element or a combination of both.
An example of the input elements is shown in
In the above example, if the matrix is not large enough to accommodate all the possible input elements in one guide, a “next screen” element 206B (as seen in
In an embodiment, if a user is interested in seeking information relating to an input element, a new matrix containing information relating to the selected input element can be provided. For example, the user is interested in words starting with the letter “R” 206C. Upon uttering the coordinates (3, 2), a new matrix can be displayed containing words starting with the letter “R”. The words displayed can also be accompanied by pictures and sounds for added information. Further, this feature is useful for composing text messages in languages such as Hindi, Thai, and the like written languages where generic characters can be augmented with accent marks or post-character modifier strokes to form a complete word. Thus, the first or primary matrix can contain the generic characters and the secondary matrix can contain enhanced or variations of the selected generic character.
A further example can be seen in
It is noted that it is not necessary that every input element in the first matrix 200B has a second matrix associated with it. Further, it is clear that the secondary matrix can also trigger a third matrix to be presented, and the third matrix can trigger a fourth matrix and so on. This cycle can be continued as needed depending on the user application.
In the above example, the display used for showing the input elements 206 to the user can be either an electronic display or a hardcopy material display such as a piece of paper, printed signboard, plastic sheet, metal plate, concrete, block of wood, and the like material upon which information can be presented thereon. Therefore, in the case of a hardcopy material display, the “next screen” element 206B as seen in
In another embodiment, a matrix 200D uses colors as column-index 210 and row-index 216. Using colors as indices is beneficial for illiterate users, users with limited knowledge of the language, such as tourists, or young users who have yet to learn to read. Take for example, a tourist in a foreign country looking for a hotel to stay. The tourist can simply select the hotel input element 224 by uttering the coordinates of the hotel input element 224 in term of colors. In this case, the coordinates are (BLUE 214, RED 220).
A system 300 for enabling voice communication with a machine according to an embodiment is shown in
The input processor 302 includes an audio recognizer 303 and a user selection processor 320. The audio recognizer 303 receives an audio input 301 from a user and processes the audio input 301 to provide a text equivalent which is subsequently used by the user selection processor 320. The audio recognizer 303 processes speech inputs from the user. For example, for speech inputs, typically an utterance from the user, the audio recognizer 303 processes the speech inputs which include matching the speech inputs with the labels in audio form 308. Upon finding a match, a text equivalent of the speech inputs is obtained from the text form labels 307 and is provided to the user selection processor 320 for further processing. The audio recognizer 303 is a known art. Therefore, the operation details and components thereof are not further described. Any number of variations and techniques of the audio recognizer 303 can be used.
The display processor 310 retrieves pre-defined input elements from an input element database 304 and arranges the input elements in a matrix on a display 312. The matrix includes a coordinate system having a column-index and a row-index. The column and row indices are pre-defined labels provided in the label database 306. Examples of different matrices are shown in
The display processor 310 also creates a display data structure 314 every time a matrix is generated for display. The display data structure 314 contains information about the matrix displayed. The information includes the labels used for the column and row indices, the input elements and the coordinates or position of each of the input elements in the matrix. The display data structure 314 stores the information in text form. In the case where colors or symbols are used as indices, the display data structure 314 contains the equivalent texts representing the colors and symbols used. The display data structure 314 is subsequently used by the user selection processor 320 for determining the input elements selected by the user.
In an alternative embodiment, the display data structure 314 may store the information in audio form. Thus, if the labels used are words or phrases, the phonemes are stored, and if the labels used are sounds, the waveform features are stored. In this case, the audio recognizer simply passes the extracted phonemes or waveform features directly to the user selection processor 320 without first finding the text equivalent thereof.
The user selection processor 320 determines the input elements selected by the user by matching the inputs received from the audio recognizer 303 against the information in the display data structure 314. As described in the foregoing, the outputs received from the audio recognizer 303 can be either in text form or in phonemes or in waveform features depending on which of the embodiments of the display data structure 314 is used. Where the display data structure 314 stores the information of the matrix using text, the user selection processor 320 matches the text received from the audio recognizer 303 with the text in the display data structure 314 to decipher the user selected input elements. However, if the display data structure 314 stores the information of the matrix using audio, the user selection processor 320 matches the phonemes or waveform features received from the audio recognizer 303 with the phonemes or waveform features in the display data structure 314, respectively.
The output from the user selection processor 320 is stored in a buffer 330 for further processing depending on the intended application. Further, the output from the user selection processor 320 can be displayed on the display 312 as feedback to the user.
In an alternative embodiment, a system 400 for enabling voice communication with a machine is shown in
The input processor 402 includes an audio recognizer 403 and a user selection processor 406. The audio recognizer 403 receives an audio input 401 from a user and processes the audio input 401 to provide a text equivalent which is subsequently used by the user selection processor 406. The audio recognizer 403 processes speech inputs from the user. For example, for speech inputs, typically an utterance from the user, the audio recognizer 403 can extract phonemes from the speech inputs and matches the phonemes with the labels in audio form 412. Alternatively, the audio recognizer 403 can translate the speech inputs into text which is subsequently matched with the text form label 412. Upon finding a match, the answer is provided to the user selection processor 406 for further processing. The audio recognizer 403 is a known art. Therefore, the operation details and components thereof are not further described. Any number of variations and techniques of the audio recognizer 403 can be used.
The input guide 404 contains input elements, like the exemplary input elements shown in
The system 400 also includes at least an input data structure 414. The input data structure 414 is for containing information about the location of each of the input elements in the matrix in the input guide 404. Each input guide 404 has a corresponding input data structure 414. Similar to the display data structure 314 in
The user selection processor 406 determines the input elements selected by the user by matching the inputs received from the audio recognizer 403 against the information in the input data structure 414. As described in the foregoing, the outputs received from the audio recognizer 403 can be either in text form or in phonemes or in waveform features depending on which of the embodiments of the input data structure 414 is used. Where the input data structure 414 stores the information of the matrix using text, the user selection processor 406 matches the text received from the audio recognizer 403 with the text in the input data structure 414 to decipher the user selected input elements. However, if the input data structure 314 stores the information of the matrix using audio, the user selection processor 406 matches the phonemes or waveform features received from the audio recognizer 403 with the phonemes or waveform features in the input data structure 414, respectively.
The output from the user selection processor 406 is stored in a buffer 416 for further processing depending on the intended application. Further, the output from the user selection processor 406 can be presented back to the user as a feedback in audio form through a speaker (not shown) coupled to the system 400.
In the foregoing, embodiments of the invention are described with reference to
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IN2005/000057 | 2/23/2005 | WO | 00 | 8/6/2007 |