The present invention relates to providing input into a computing device. More particularly, the present invention relates to a multimodal method of providing input that includes speech recognition and identification of desired input from a set of alternatives to improve efficiency.
Small computing devices such as personal information managers (PIM), devices and portable phones are used with ever increasing frequency by people in their day-to-day activities. With the increase in processing power now available for microprocessors used to run these devices, the functionality of these devices are increasing, and in some cases, merging. For instance, many portable phones now can be used to access and browse the Internet as well as can be used to store personal information such as addresses, phone numbers and the like.
In view that these computing devices are being used for ever increasing tasks, it is therefore necessary to enter information into the computing device easily and efficiently. Unfortunately, due to the desire to keep these devices as small as possible in order that they are easily carried, conventional keyboards having all the letters of the alphabet as isolated buttons are usually not possible due to the limited surface area available on the housings of the computing devices. Likewise, handwriting recognition requires a pad or display having an area convenient for entry of characters, which can increase the overall size of the computing device. Moreover though, handwriting recognition is a generally slow input methodology.
There is thus an ongoing need to improve upon the manner in which data, commands and the like are entered into computing devices. Such improvements would allow convenient data entry for small computing devices such as PIMs, telephones and the like, and can further be useful in other computing devices such as personal computers, televisions, etc.
A method and system for providing input into a computer includes receiving input speech from a user and providing data corresponding to the input speech. The data is used to search a collection of phrases and identify one or more phrases from the collection having a relation to the data. The one or more phrases are visually rendered to the user. An indication is received of a selection from the user of one of the phrases and the selected phrase is provided to an application operating on the computing device.
The combined use of speech input and selection of visually rendered possible phrases provides an efficient method for users to access information, particularly on a mobile computing device where hand manipulated input devices are difficult to implement. By allowing the user to provide an audible search query, the user can quickly provide search terms, which can be used to search a comprehensive collection of possible phrases the user would like to input. In addition, since the user can easily scan a visually rendered list of possible phrases, the user can quickly find the desired phrase, and using for example a pointing device, select the phrase that is then used as input for an application executing on the computing device.
Before describing aspects of the present invention, it may be useful to describe generally computing devices that can incorporate and benefit from these aspects. Referring now to
An exemplary form of a data management mobile device 30 is illustrated in
Referring now to
RAM 54 also serves as a storage for the code in the manner analogous to the function of a hard drive on a PC that is used to store application programs. It should be noted that although non-volatile memory is used for storing the code, it alternatively can be stored in volatile memory that is not used for execution of the code.
Wireless signals can be transmitted/received by the mobile device through a wireless transceiver 52, which is coupled to CPU 50. An optional communication interface 60 can also be provided for downloading data directly from a computer (e.g., desktop computer), or from a wired network, if desired. Accordingly, interface 60 can comprise various forms of communication devices, for example, an infrared link, modem, a network card, or the like.
Mobile device 30 includes a microphone 29, and analog-to-digital (A/D) converter 37, and an optional recognition program (speech, DTMF, handwriting, gesture or computer vision) stored in store 54. By way of example, in response to audible information, instructions or commands from a user of device 30, microphone 29 provides speech signals, which are digitized by A/D converter 37. The speech recognition program can perform normalization and/or feature extraction functions on the digitized speech signals to obtain intermediate speech recognition results. Speech recognition can be performed on mobile device 30 and/or using wireless transceiver 52 or communication interface 60, speech data can be transmitted to a remote recognition server 200 over a local or wide area network, including the Internet as illustrated in
In addition to the portable or mobile computing devices described above, it should also be understood that the present invention can be used with numerous other computing devices such as a general desktop computer. For instance, the present invention will allow a user with limited physical abilities to input or enter text into a computer or other computing device when other conventional input devices, such as a full alpha-numeric keyboard, are too difficult to operate.
The invention is also operational with numerous other general purpose or special purpose computing systems, environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, regular telephones (without any screen) personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The following is a brief description of a general purpose computer 120 illustrated in
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices. Tasks performed by the programs and modules are described below and with the aid of figures. Those skilled in the art can implement the description and figures provided herein as processor executable instructions, which can be written on any form of a computer readable medium.
With reference to
Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, FR, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 150 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 151 and random access memory (RAM) 152. A basic input/output system 153 (BIOS), containing the basic routines that help to transfer information between elements within computer 120, such as during start-up, is typically stored in ROM 151. RAM 152 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 140. By way of example, and not limitation,
The computer 120 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 120 through input devices such as a keyboard 182, a microphone 183, and a pointing device 181, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 140 through a user input interface 180 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 184 or other type of display device is also connected to the system bus 141 via an interface, such as a video interface 185. In addition to the monitor, computers may also include other peripheral output devices such as speakers 187 and printer 186, which may be connected through an output peripheral interface 188.
The computer 120 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 194. The remote computer 194 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 120. The logical connections depicted in
When used in a LAN networking environment, the computer 120 is connected to the LAN 191 through a network interface or adapter 190. When used in a WAN networking environment, the computer 120 typically includes a modem 192 or other means for establishing communications over the WAN 193, such as the Internet. The modem 192, which may be internal or external, may be connected to the system bus 141 via the user input interface 180, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 120, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Many known techniques of information retrieval can be used. In one embodiment, to accelerate the retrieval process, an index 220 of information to be searched and possibly retrieved is created. For instance, the index 220 can be based on content 222 available on the computing device (e.g. addresses, appointments, e-mail messages, etc.) as well as input 224 otherwise manually entered into the computing device, herein mobile device 30. Although illustrated wherein the index 220 is illustrated as functioning for both content 222 and input 224, it should be understood that separate indexes can be provided if desired. The use of separate indexes or an index 220 adapted to reference information based on categories allows a user to specify a search in only certain categories of information as may be desired.
Index 220 can take many forms. In one preferred embodiment, index 220 comprises pre-computed phonetic lattices of the words in content 222 and or input 224. Conversion of words in content 222 and input 224 to phonetic lattices is relatively straight forward by referencing a dictionary in order to identify component phonemes and phonetic fragments. Alternatives pronunciations of words can be included in the corresponding lattice such as the word “either”, namely one node of the lattice beginning with the initial pronunciation of “ei” as “i” (as in “like”) and another node beginning with the alternate initial pronunciation of “ei” as “ee” (as in “queen”), both followed by the “ther”. Another example is the word “primer”, which has alternate pronunciations of “prim-er”, with “prim” pronounced similar to “him”, or “pri-mer” with “pri” pronounced similar to “high”.
The voice search server 206 includes a lattice generation module 240 that receives the results from the speech recognizer 200 and/or 208 to identify phonemes and phonetic fragments according to a dictionary. Using the output from speech recognizer 204, lattice generation module 240 constructs a lattice of phonetic hypotheses, wherein each hypothesis includes an associated time boundary and accuracy score.
If desired, approaches can be used to alter the lattice for more accurate and efficient searching. For example, the lattice can be altered to allow crossover between phonetic fragments. Additionally, penalized back-off paths can be added to allow transitions between hypotheses with mismatching paths in the lattice. Thus, output scores can include inconsistent hypotheses. In order to reduce the size of the lattice, hypotheses can be merged to increase the connectivity of phonemes and thus reduce the amount of audio data stored in the lattice.
The speech recognizer 200, 208 operates based upon a dictionary of phonetic word fragments. In one embodiment, the fragments are determined based on a calculation of mutual-information of adjacent units v and w, (which may be phonemes or combinations of phonemes). Mutual information MI can be defined as follows:
Any pairs (v, w) having a MI above a particular threshold can be used as candidates for fragments to be chosen for the dictionary. A pair of units can be eliminated from a candidate list if one or both of the constituent units are part of a pair with a higher MI value. Pairs that span word boundaries are also eliminated from the list. Remaining candidate pairs v w are replaced in a training corpus by single units v-w. The process for determining candidate pairs can be repeated until a desired number of fragments is obtained. Examples of fragments generated by the mutual information process described above are /-k-ih-ng/ (the syllable “-king”), /ih-n-t-ax-r/ (the syllable “inter-”), /ih-z/ (the word “is”) and /ae-k-ch-uw-ax-l-iy/ (the word “actually”).
Voice search engine 206 accesses index 220 in order to determine if the speech input includes a match in content 222 and/or 224. The lattice generated by voice search engine 206 based on the speech input can be a phonetic sequence or a grammar of alternative sequences. During matching, lattice paths that match or closely correspond to the speech input are identified and a probability is calculated based on the recognition scores in the associated lattice. The hypotheses identified are then output by voice search engine 206 as potential matches.
As mentioned, the speech input can be a grammar corresponding to alternatives that define multiple phonetic possibilities. In one embodiment, the grammar query can be represented as a weighted finite-state network. The grammar may also be represented by a context-free grammar, a unified language model, N-gram model and/or a prefix tree, for example.
In each of these situations, nodes can represent possible transitions between phonetic word fragments and paths between nodes can represent the phonetic word fragments. Alternatively, nodes can represent the phonetic word fragments themselves. Additionally, complex expressions such as telephone numbers and dates can be searched based on an input grammar defining these expressions. Other alternatives can also be searched using a grammar as the query, for example, speech input stating “Paul's address”, where alternatives are in parentheses, “Paul's (address|number)”.
In a further embodiment, filtering can applied to the speech input before searching is performed to remove command information. For instance, speech input comprising “find Paul's address”, “show me Paul's address”, or “search Paul's address” would each yield the same query “Paul's address”, where “find”, “show me” and “search” would not be used in pattern matching. Such filtering can be based on semantic information included with the results received from the speech recognizer 200, 208.
It is also worth noting that a hybrid approach to searching can also be used. In a hybrid approach, phonetic fragment search can be used for queries that have a large number of phones, for example seven or greater phones. For short phones, a word-based search can be used.
For example, the score of the path from node p to node q is represented as s1. If a query matches node r, paths associated with scores s7 and s8 will be explored to node t to see if any paths match. Then, paths associated with scores s10 and s11 will be explored to node u. If the paths reach the end of the query, a match is determined. The associated scores along the paths are then added to calculate a hypothesis score. To speed the search process, paths need not be explored if matches share identical or near identical time boundaries.
The result of the search operation is a list of hypotheses (W, ts, te, P(W ts te|O) that match the query string W in a time range from ts to te. A probability P(W ts te|O), known as the “posterior probability” is a measure of the closeness of the match. W is represented by a phoneme sequence and O denotes the acoustic observation expressed as a sequence of feature vectors ot. Summing the probabilities of all paths that contain the query string W from ts to te yields the following equation:
Here, W− and W+ denote any word sequences before ts and after te, respectively and W′ is any word sequence. Furthermore, the value p(Otste|W−WW+) is represented as:
p(Otste|W−WW+)=p(o. . . t
Using speech input to form queries with visual rendering of alternatives and selection therefrom provides a very easy and efficient manner in which to enter desired data for any computing device, and particularly, a mobile device for the reasons mentioned in the Background section.
At step 406, the one or more text phrases are visually rendered to the user.
Having indicated which text phrase is desired at step 408, the desired text phrase can be inserted provided to an application for further processing at step 410. Typically, this includes inserting the selected phrase in a field of a form being visually rendered on the computing device. In the example of
The combined use of speech input and selection of visually rendered alternatives provides an efficient method for users to access information, since the user can provide a semantically rich query audibly in a single sentence or phrase without worrying about the exact order of words grammatical correctness of the phrase. The speech input is not simply converted to text and used by the application being executed on the mobile device, but rather is used to form a query to search known content on the mobile device having such or similar words. The amount of content that is searched can now be much more comprehensive since it need not all be rendered to the user. Rather, the content ascertained to be relevant to the speech input is rendered in a list of alternatives, through a visual medium. The user can easily scan the list of alternatives and choose the most appropriate alternative.
Although the present invention has been described with reference to preferred embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.