The present invention relates generally to speech recognition systems, and particularly to methods and systems for querying an electronic dictionary using spoken input.
Many mobile devices and desktop applications enable users to query electronic dictionaries. A dictionary may comprise, for example, a thesaurus or lexicon that provides definitions of words or phrases. In, other applications, bilingual or multilingual dictionaries provide translation of words from one language to another.
A number of data entry methods are known in the art for entering a word or phrase to be looked-up in the dictionary. In some applications, the user types the query word using a keyboard or keypad. For example, Ectaco, Inc. (Long Island City, N.Y.) offers a number of handheld electronic dictionaries and translators. One exemplary product is described in www.ectaco.com/dictionaries/view_info.php3?refid=831&pagelang=23&dict_id=92. Other applications use speech recognition methods, in which the user vocally pronounces the query word. For example, Ectaco, Inc., offers a multilingual translator called “UT-103 Universal Translator” that supports voice input. Additional details regarding this product can be found at www.universal-translator.net.
Some dictionary applications use Optical Character Recognition (OCR) methods for entering queries. For example, Wizcom Technologies, Ltd. (Jerusalem, Israel), offers a family of translators and dictionaries called “Quicktionary.” The Quicktionary products are pen-shaped handheld devices that use OCR methods to scan and analyze printed text. Additional details regarding the Quicktionary products can be found at www.wizcomtech.com. Another example of the use of OCR techniques is described by Elgan in “Nothing Lost in Translation,” HP World Magazine, (5:6), June 2002. This article is also available at www.interex.org/hpworldnews/hpw206/pub_hpw_features1.jsp. According to this method, the user takes a picture of the required word using a digital camera. An OCR module produces a string comprising the letters of the word, which is then used for querying the dictionary.
Generally speaking, data entry methods are prone to errors. Therefore, some applications use methods for detecting errors or reducing the possibility of erroneous data entry. One way of reducing the probability of error is using two or more different data entry methods for the same word. This approach is sometimes referred to as “multimodal” data entry. For example, some speech recognition applications use alphanumeric data entry from a telephone keypad. Such a technique is described by Parthasarathy in “Experiments in Keypad-Aided Spelling Recognition,” The 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2004), Quebec, Canada, May, 2004. The author describes several schemes for augmenting speech input with input from a telephone keypad in a call-center application.
Another example is a flight reservation system that uses keypad entry for error detection, described by Filisko and Seneff in “Error Detection and Recovery in Spoken Dialogue Systems,” Proceedings of the Human Language Technology Conference, North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT-NAACL 2004), Workshop on Spoken Language Understanding for Conversational Systems, Boston, Mass., May, 2004, pages 31-38.
Some applications use letter spelling or phonetic spelling as a mode for data entry. The paper by Filisko and Seneff cited above also describes a “speak and spell” method, in which the user is asked to spell words as an error recovery measure. Another application, in which a user enters a target word using phonetic spelling, is described in U.S. Pat. No. 6,321,196. Spelling a word phonetically means representing each letter in the word to be spelled by a commonly understood word. For example, one may phonetically spell the work “key” by saying “kilo echo yankee.” The inventor describes a speech recognition system in which the user says a sequence of words selected from a given vocabulary without being restricted to a pre-specified phonetic alphabet. The system recognizes the spoken words, associates letters with these words and then arranges the letters to form the target word.
Another spelling-based application is described in U.S. Pat. No. 5,995,928. The inventors describe a speech recognition system capable of recognizing a word based on a continuous spelling of the word by a user. The system continuously outputs an updated string of hypothesized letters, based on the letters uttered by the user. The system compares each string of hypothesized letters to a vocabulary list of words and returns a best match for the string.
In some speech recognition applications, the user is presented with several alternative results following the automatic recognition process. For example, U.S. Pat. No. 5,027,406 describes a method for creating word models in a natural language dictation system. After the user dictates a word, the system displays a list of the words in the active vocabulary which best match the spoken word. By keyboard or voice command, the user may choose the correct word from the list or may choose to edit a similar word if the correct word is not on the list. Alternatively, the user may type or speak the initial letters of the word.
Another user-assisted method is described in U.S. Patent Application Publication 2002/0064257 A1. The inventors describe a voice-activated dialing system that uses a DTMF (dual-tone multi-frequency) entry device to narrow the possibilities for the selection of a phonetically based name. The user enters a DTMF signature of a name and the signature is used by a dictionary to generate likely possibilities for the word. The user is asked to confirm whether the suggested name is the name entered.
There is therefore provided, in accordance with an embodiment of the present invention, a method for querying an electronic dictionary using letters of an alphabet enunciated by a user. The method includes accepting a speech input from the user, the speech input including a sequence of spelled letters enunciated by the user that spell a query word. The speech input is analyzed to determine one or more sequences of the letters that approximate the sequence of spelled letters. The one or more sequences of the letters are post-processed so as to produce a plurality of recognized words approximating the query word. The electronic dictionary is queried with the plurality of recognized words so as to retrieve a respective plurality of dictionary entries. A list of results including the plurality of recognized words and the respective plurality of dictionary entries is presented to the user.
In an embodiment, analyzing the speech input includes applying at least one of an acoustic model and a language model to the speech input. Additionally or alternatively, applying the language model includes representing at least part of the dictionary in terms of a finite state grammar (FSG). Further additionally or alternatively, applying the language model includes assigning probabilities to the sequences of the letters based on a probabilistic language model.
In another embodiment, post-processing the sequences includes defining two or more letter classes including subsets of the letters in the alphabet that have similar sounds, and constructing sequences of the letters by substituting at least one of the letters belonging to the same letter class as at least one of the letters of the query word, so as to produce the plurality of recognized words.
In yet another embodiment, querying the dictionary includes accepting a user command including at least one of a typed input and a voice command, and modifying at least one letter of one of the recognized words based on the user command.
In still another embodiment, presenting the list of results includes assigning likelihood scores to the recognized words on the list and sorting the list based on the likelihood scores. Additionally or alternatively, presenting the list of results includes converting at least part of the list to a speech output, and playing the speech output to the user. Further additionally or alternatively, presenting the list of results includes accepting a user command including at least one of a typed input and a voice command, and scrolling through the list responsively to the user command.
In an embodiment, accepting the speech input includes receiving the speech input via an audio interface associated with a mobile device including at least one of a mobile telephone, a portable computer and a personal digital assistant (PDA), and presenting the list includes providing the list via an output of the mobile device.
In another embodiment, accepting the speech input includes sending the speech input from the mobile device to a remote server that serves one or more users, and presenting the list of results includes transmitting the list of results from the remote server to the mobile device for presentation to the user.
Apparatus and a computer software product for querying an electronic dictionary are also provided.
There is additionally provided, in accordance with an embodiment of the present invention, a system for querying an electronic dictionary using letters of an alphabet enunciated by a user. The system includes a remote server including a memory, which is coupled to store the electronic dictionary.
The system includes one or more spelling processors, which are coupled to accept a speech input from the user, the speech input including a sequence of spelled letters enunciated by the user that spell a query word, to analyze the speech input so as to determine one or more sequences of the letters approximating the sequence of spelled letters, to post-process the one or more sequences of the letters so as to produce a plurality of recognized words approximating the query word, to query the electronic dictionary stored in the memory with the plurality of recognized words so as to retrieve a respective plurality of dictionary entries, and to generate a list of results including the plurality of recognized words and the respective plurality of dictionary entries.
The system also includes a user device, including a client processor, which is coupled to receive the speech input from the user and to send the speech input to the remote server, and which is coupled to receive, responsively to the speech input, the list of results. The user device includes an output device, which is coupled to present the list of results generated by the spelling processor to the user.
The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
Embodiments of the present invention provide improved methods and systems that allow users of mobile devices to query an electronic dictionary using spelling recognition. Instead of pronouncing the query word as a whole, as implemented in conventional speech recognition systems, the user vocally spells the query word letter by letter. A spelling processor in the mobile device captures and processes the spelled word. A list of possible recognized words is produced, according to predefined models. A list of results, comprising the recognized words along with the corresponding dictionary entries, is presented to the user. The user can then scroll through the results and identify the correct word and dictionary entry.
In comparison with conventional speech recognition methods that recognize the entire word, spelling recognition typically achieves better recognition performance. Embodiments of the present invention provide a method and a system that are particularly suitable for users who are not familiar with the language in question, such as tourists or foreigners. Such users may not know the correct pronunciation of words but can easily spell them out. Users with speech impairments, whose pronunciation of words may be difficult to understand, may also benefit from the disclosed methods.
On the other hand, reliable letter-by-letter spelling recognition is a non-trivial task that introduces other types of error mechanisms, as will be explained below. The disclosed methods address these error mechanisms by defining appropriate models that determine the list of alternative recognized words. The list is typically sorted by relevance, using relevance measures that are based on the same error mechanisms and/or the model being used.
Some embodiments of the present invention also provide a quick and simple user interface for users of mobile devices. The user interface combines spelling recognition with keypad functions and/or voice commands. This multimodal functionality enables quick and smooth operation of the dictionary application by both ordinary users and users with special needs.
Additionally, the disclosed user interface enables the user to query the dictionary without having to move his or her eyes from the written text. For blind users who read text written in Braille, the user interface enables querying the dictionary without moving the user's fingers away from the page.
In a disclosed embodiment, the result list is converted to speech and played to the user using a text-to-speech (TTS) generator. This implementation is also particularly suitable for blind users and for users who operate the system while driving or carrying out other tasks that require continuous visual attention.
In another embodiment, the dictionary query system is implemented in a remote server configuration using distributed speech recognition (DSR).
The mobile device typically comprises a microphone 27 for accepting speech from the user and a keypad 28 for accepting user input. A display 30 presents textual information to the user. In some embodiments, mobile device 26 also comprises a speaker 31 for playing synthesized speech to the user, as will be explained below.
The electronic dictionary application may comprise a thesaurus or a lexicon, in which case querying the dictionary means retrieving a definition of a word. Alternatively, the dictionary may comprise a bilingual or multilingual dictionary, in which case querying the dictionary means retrieving a translation of the word to another language. Additional dictionary applications comprise dictionaries that are specific to particular professional disciplines and phrasebooks that translate phrases from one language to another. Other dictionary applications will be apparent to those skilled in the art, and can be implemented using the methods described hereinbelow. In the context of the present patent application and in the claims, the term “dictionary” pertains to any such dictionary application. The term “dictionary entry” refers to the definition or the translation of a word or phrase, as relevant to the particular application.
The spelling processor is typically implemented as a software process that runs on a central processing unit (CPU) of the mobile device. The spelling processor queries an electronic dictionary 36, which is stored in a memory of the mobile device, and retrieves dictionary entries corresponding to the recognized words. The spelling processor typically displays the list of results using an output device such as display 30. Additionally or alternatively, the output device comprises a text to speech (TTS) generator 38 that converts the list of results, or parts of it, to speech and plays it to the user. Again, a detailed description of the method and the associated user interfaces is given in the description of
A post processor 41 in spelling processor 36 accepts the letter sequences and associated probabilities from recognizer 39. The post processor queries dictionary 36 with the recognized words and produces an ordered list of results. The list comprises the recognized words and the associated dictionary definitions of these words. The configuration of spelling processor 34 shown in
A centralized dictionary configuration is sometimes preferred because it enables the use of larger dictionaries. Large dictionaries, or dictionaries holding large and detailed entries, may significantly exceed the memory storage capabilities of typical mobile devices. Additionally, maintaining and updating information in a centralized dictionary data structure is often easier than managing multiple dictionaries distributed between multiple users.
The configuration shown in
In the remote server configuration, mobile device 26 comprises a client processor 42 that accepts the speech input from the user via microphone 27 and sampler 32 (not shown in this figure). Processor 42 compresses the captured and digitized speech and transmits it, typically in a compact form, such as a stream of compressed feature vectors, to spelling processor 34 in server 40. The spelling processor decompresses the feature vectors, processes the decompressed speech and queries dictionary 36, according to the method of
Mobile device 26 and server 40 are linked by a communication channel. The channel is used to send compressed speech to the server, send result lists to the mobile device and exchange miscellaneous control information. The communication channel may comprise any suitable medium, such as an Internet connection, a telephone line, a wireless data network, a cellular network, or a combination of several such media.
Typically, spelling processor 34 and client processor 42 comprise general-purpose computer processors, which are programmed in software to carry out the functions described herein. The software may be downloaded to the computers in electronic form, over a network, for example, or it may alternatively be supplied to the computers on tangible media, such as CD-ROM. Further alternatively, the spelling processor may be a standalone unit, or it may alternatively be integrated with other computing functions of mobile device 26 or server 40. Additionally or alternatively, at least some of the functions of the spelling processor may be implemented using dedicated hardware. Client processor 42 may also be integrated with other computing functions of mobile device 26.
(If the disclosed method is implemented using a remote server configuration, as shown in
Speech recognizer 39 and post processor 41 in spelling processor 34 (
However, in specific cases, such as users with speech impairments or users with heavy accents, the use of learned user-specific speech characteristics may improve the quality of recognition. In some embodiments, speech recognizer 39 extracts additional information from the digitized speech, to be used in the recognition process as will be explained below.
In some embodiments, the speech recognizer uses a suitable acoustic model for assigning a likelihood score to each identified spelled letter. Each likelihood score quantifies the likelihood that the particular letter was indeed iterated by the user.
The speech recognizer uses a language model, which may be based in whole or in part on the dictionary being used. Using the language model, the speech recognizer generates one or more letter sequences that represent possibly-recognized words in response to the captured input speech.
In some embodiments, the language model comprises a graph representing the dictionary, which is commonly referred to as a Finite State Grammar (FSG). Finite state grammars (sometimes also referred to as finite-state networks) are described, for example, by Rabiner and Juang in “Fundamentals of Speech Recognition,” Prentice Hall, April 1993, pages 414-416. The nodes of the FSG represent letters of the alphabet. (In typical implementations, each letter of the alphabet appears several times in the graph.) Arcs between nodes represent adjacent letters in legitimate words. In other words, each word in the dictionary is represented as a trajectory or path through the graph.
In some embodiments, only part of the dictionary is represented as a FSG. In many practical cases, FSG-based models are used for small to medium size vocabularies and dictionaries, typically up to several thousands of words.
When using the FSG, the speech recognizer typically compares the sequence of spelled letters of the digitized speech to the different trajectories through the FSG. In some embodiments, the speech recognizer assigns likelihood scores to the trajectories. The speech recognizer produces the letter sequences and the associated likelihood scores.
In other embodiments, the language model comprises a probabilistic language model, which assigns probabilities to different letter sequences in the vocabulary. Probabilistic language models are described, for example, by Young in “A Review of Large-Vocabulary Continuous-Speech Recognition,” IEEE Signal Processing Magazine, September 1996, pages 45-57. Probabilistic language models are typically used when the size of the dictionary is very large, making it difficult to represent every word in the model explicitly. In these embodiments, speech recognizer 39 produces one or more letter sequences that resemble the sequence of spelled letters, with associated likelihood scores in accordance with the probabilistic language model.
In yet another embodiment, the speech recognizer represents the different letter sequences produced by the probabilistic language model in terms of a lattice. The lattice is a graph comprising the possible sequences of letters, with each sequence assigned a respective likelihood score, according to the probabilistic language model.
Following the speech recognition process, speech recognizer 39 provides to post processor 41 one or more letter sequences with associated likelihood scores, as described above.
In one embodiment, when speech recognizer 39 uses a FSG as the language model, the letter sequences provided to post processor 41 are already legitimate words that appear in dictionary 36.
In another embodiment, in which speech recognizer 39 uses a probabilistic language model with lattice output, as described above, post processor 41 selects a subset of the letter sequences in the lattice, having the highest likelihood scores. Since not all of the possible letter sequences in the lattice necessarily correspond to legitimate dictionary words, post processor 41 typically queries dictionary 36 with the selected letter sequences, and discards words that do not appear in the dictionary.
In yet another embodiment, in which speech recognizer 39 uses a probabilistic language model, speech recognizer 39 outputs only the letter sequence having the maximum likelihood score (referred to hereinbelow as the highest ranking sequence). Post processor 41 constructs a list of alternative letter sequences based on the highest ranking sequence by using letter classes, as explained below.
Spelled letters can be classified into letter classes based on their pronunciation characteristics. During speech recognition, some spelled letters may be mistaken for one another. For example, the spelled letters /b/, /c/, /d/, /e/, /g/, /p/, /t/, /v/ and /z/ all belong to the same letter class (referred to as the “e-class”) . These letters all have similar vowel sounds when spelled. In some cases, the speech recognizer may erroneously mistake one such letter for another. Similarly, the speech recognizer may erroneously interchange letters belonging to the “a-class” (/a/, /h/, /j/, /k/), the “i-class” (/i/, /y/) and the “u-class” (/u/, /q/).
The probabilities of mistaking one letter for another are typically represented as a matrix, which is called a “confusion matrix.” The probability of interchanging letters belonging to different letter classes is assumed to be small. When using letter classes, the post processor constructs the list of alternative letter sequences by replacing each letter of the best ranking sequence with similarly-sounding letters, according to the letter classes described above. The post processor typically ranks the list, for example by computing likelihood scores based on the confusion matrix.
For example, assume that the best ranking sequence, as recognized by speech recognizer 39, is /c/, /a/ and /t/, assuming the user has spelled the word “cat.” Using the letter classes described above, the post processor constructs a list of alternative letter sequences defined by [{e-class}, {a-class}, {e-class}] (i.e., all 9×4×9=324 three-letter strings, in which the first letter belongs to the e-class, the second letter belongs to the a-class and the third letter again belongs to the e-class). In some embodiments, the alternative letter sequences may also comprise a different number of letters, or letters from other letter classes. For example, the query word “cat” can also be recognized as “beat.”
Obviously, only a few of the alternative letter sequences produced in the above example (such as “bat”, “the”, “pad” and the original “cat”) are meaningful words. Most are meaningless strings. Note also that the pronunciation of the entire words may be very different from the pronunciation of the query word. As an extreme example, the sound of the word “the” is very different from the sound of the word “cat”. Nevertheless, these two words are both considered legitimate alternative letter sequences by the spelling processor, because the spelled sequence /t/, /h/, /e/ does sound similar to the spelled sequence /c/, /a/, /t/. The post processor maintains (or produces in the first place) only the letter sequences that correspond to meaningful words. The post processor may differentiate between meaningful and meaningless letter sequences by querying dictionary 36 or by using any suitable grammatical rules, which are part of the language model being used.
In order to minimize the probability of false recognition, the spelling processor may request the user's assistance in determining which one of the recognized letter sequences, or recognized words, is the original query word entered by the user. For this purpose, the post processor prepares a list of results, at a list preparation step 56. In some embodiments, the post processor produces the list of results in accordance with one of the language models described above. In some embodiments, the post processor sorts the list of results in descending order of relevance. The relevance score of a particular recognized word is typically determined in accordance with the language model being used, as described above. Alternatively, the list can be sorted alphabetically, or using any other suitable criterion.
(If the disclosed method is implemented using a remote server configuration, as shown in
The spelling processor presents the list of results to the user, at a presentation step 60. Typically, the list of recognized words is displayed as text on display 30 of the mobile device. The user may scroll through the list using keypad 28 until he or she finds the intended query word and the corresponding dictionary entry. Alternatively, only the first word on the list is displayed together with its dictionary entry. If the first recognized word on the result list is incorrect, the user may scroll down and select the next word. Any other suitable presentation method can be used, depending upon the particular application and the capabilities of keypad 28 and display 30 of the mobile device. Additionally, the user can also edit the displayed recognized words at any time using the keypad, so as to enter part or all of the intended query word.
In another embodiment, the list of results is converted to speech using TTS generator 38 and played to the user through speaker 31. The user can indicate, either using the keypad or by uttering a voice command, when the correct word is being played. After selecting the correct word, the TTS generator plays the corresponding dictionary entry.
Although the disclosed methods mainly address spelling-based dictionary lookup in mobile devices, the same methods can be used in a variety of additional applications. For example, the disclosed methods can also be used in desktop or mainframe computer applications that require high quality word recognition. Such applications include, for example, directory assistance services and name dialing applications.
It will thus be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art.