This application relates to enhanced human-machine interface (HMI), and more specifically two methods for improving user experience when interacting through voice and/or text. The two disclosed methods include a hybrid approach for human input transcription, as well as a robust text to speech (TTS) method capable of dynamic tuning of the speech synthesis process.
Automatic speech transcription of human input such as voice or text, is challenging due to the seemingly infinite domain of possible combinations, slang phrases, abbreviations, invented or derived phrases, and cultural dialects. Modern cloud-based recognition tools provide a powerful and affordable solution to the aforementioned problems. Nonetheless, they are typically inadequate when applied within a specific domain of application. As a result, efficient post-processing methods are required to map the recognition output provided by the aforementioned tools to a subset of words in a specific domain of interest.
Modern text to speech (TTS) technologies offer fairly accurate results where the targeted vocabulary is from a well-established and constrained domain. However, they might perform poorly when applied to more challenging domains containing new or infrequently used words, proper names, or derived phrases. Incorrect pronunciations of such words/phrases can make the product appear simple and naïve. On the other hand, many application domains, such as entertainment and sports, contain words that are transient and short lived in nature. Such volatile environments make it infeasible to employ manual tuning to keep pronunciation vocabularies up-to-date. Accordingly, automatic updating of the pronunciation vocabulary of TTS methods can significantly improve their flexibility and robustness in the aforementioned application domains.
Two methods for improving the user experience while interacting through voice and/or text are presented. The first disclosed method is a hybrid word look-up approach to match the potential words produced by a recognizer with a set of possible words in a domain database. The second disclosed method enables dynamic update of pronunciation vocabulary in an on-demand basis for words that are unknown to a speech synthesis system. Together, the two disclosed methods yield a more accurate match for words inputted by a user, as well as more appropriate pronunciation for words spoken by the voice interface, and thus a significantly more user-friendly and natural human machine interaction experience.
An ensemble of word matching methods 16 computes the distance between each potential word and each of the possible words. In an exemplary embodiment of the disclosed method, the distance is computed as a weighted aggregate of word distances in a multitude of spaces including the phonetic encoding, such as metaphone and double metaphone, string metric, such as Levenshtein distance, etc. The words are then sorted according to their computed aggregate distances and only a predefined number of top words are outputted as a set of candidate words and fed to a clustering method 20.
A set of candidate words are grouped into two segments by a clustering method 20. The first segment includes candidate words that are considered to be a likely match for the input user voice whereas the second segment contains the unlikely matches. The former category words are identified based on their previously computed aggregate distance by selecting the words that have a distinctly smaller distance. Consequently, the rest of the words are categorized as the second category. In a preferred embodiment of the clustering method a well-known image segmentation approach, called Otsu method, can be used to identify a distinct set of words.
Before being presented to the user as a set of recognized words, a set of distinct words may be rearranged according to one or more of its associated metadata. The metadata are stored along with the set of possible words on a domain database 18 and include features such as frequently of usage, and user-defined or dynamically computed priority/importance, for each word. The rearrangement of words is particularly useful in disambiguation of distinct words with very close distinction level(s).
For those words identified as alien, a cloud-based resource 14, such as the Collins online dictionary interface, is inquired to obtain one or more pronunciation phonetics suggestions. The obtained pronunciation phonetics could be represented using a phonetics markup language such as IPA or SAMPA. The suggested phonetics are presented to a human agent 46, e.g. a word is displayed on a screen while its suggested pronunciation is played out, to verify their validity. Alternatively, the suggested phonetic pronunciations can be validated using a software agent running on a local server 48. The confirmed pronunciation phonetics, along with their corresponding (previously) alien words, are then added to the domain database 18. This may be done in realtime (i.e. with the user possibly waiting a few seconds while the system confirms the pronunciation with the human agent 46, if there is not already sufficient words to be read to the user while the human verification is performed). Alternatively, this may be done offline, in which the case the user is presented with the best phonetic pronunciation available at the time, which is later validated by the human agent 46 and stored in the domain database 18.
The word-lookup system 10 may be a computer, smartphone or other electronic device with a suitably programmed processor, storage, and appropriate communication hardware. The cloud services 14 and domain database 18 may be a server or groups of servers in communication with the word-lookup system 10, such as via the Internet.
In accordance with the provisions of the patent statutes and jurisprudence, exemplary configurations described above are considered to represent a preferred embodiment of the invention. However, it should be noted that the invention can be practiced otherwise than as specifically illustrated and described without departing from its spirit or scope.
Number | Date | Country | |
---|---|---|---|
61830789 | Jun 2013 | US |