Speech synthesis, or text-to-speech (TTS) conversion, requires that pronunciations be determined for each word in the text. The process controlling the conversion, known as a speech engine, typically has access to one or more pronunciation dictionaries, or lexical files, that store pronunciations of text words that are expected to be processed by the speech engine. For example, one pronunciation dictionary may be a dictionary of common words and another pronunciation dictionary may be provided to the search engine by a particular software application for words that are unique to the application, while the application is running. However, it can be expected that some words are not in a given set of pronunciation dictionaries, so methods are also included in the speech engine for generating pronunciations for unknown words without using a pronunciation dictionary. These methods are error-prone.
TTS is a highly desirable feature in many situations, of which two examples are when a cellular telephone is being used by a driver, and when a cellular phone is used by a sight impaired person. Thus, TTS is valuable in electronic devices having limited resources, so there is a challenge to minimize the size of pronunciation dictionaries used in such resource limited devices, while at the same time minimizing pronunciation errors for unknown words.
The two examples described above are situations of a client device (a cellular telephone) that are typically operated in a radio communication system, by which the client devices can be connected to the world-wide-web. The world-wide-web consortium (W3C) is developing a standard for pronunciation dictionaries for speech applications written using such tools as VoiceXML (located at URL www.w3.org/TR/lexicon-reqs).
The present invention is illustrated by way of example and not limitation in the accompanying figures, in which like references indicate similar elements, and in which:
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.
Before describing in detail the text to speech (TTS) conversion techniques in accordance with the present invention, it should be observed that the present invention resides primarily in combinations of method steps and apparatus components related to TTS conversion. Accordingly, the apparatus components and method steps have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
Referring to
The client device 105 comprises a processor 155 that is coupled to a memory 150, a speaker 160, a network interface 165, and a user interface 170. The processor 155 may be a microprocessor, a digital signal processor, or any other processor appropriate for use in the client device 105. The memory 150 stores program instructions that control the operation of the processor 155, and may use conventional instructions to do so, in a manner that provides a plurality of largely independent functions. Some of the functions are those typically classified as applications. Many of the functions may be conventional, but certain of them described herein are unique at least in some aspects. The memory 150 also stores information of temporary, short lived, and long lived duration, such as cache memory and tables. Thus memory 150 may comprise storage devices of differing hardware types, such as Random Access Memory, Programmable Read Only Memory, Flash memory, etc. The speaker 160 may be a speaker as is found in conventional client devices such as cellular telephones. The network interface 165 may be a radio transceiver as found in a cellular telephone, or when the client device is, for example, a Bluetooth connected device, the network interface would be a Bluetooth transceiver. The network interface 165 could alternatively be a wireline interface for a client device that operates via a personal area network to a client device (not shown) that is connected by a radio network 110 to the world-wide-web, or could alternatively be a wireline interface for a client device that is connected directly to the world-wide-web 115. The world-wide-web 115 could alternatively be a sizable private network, such as a corporate network supporting several thousand users in a local area. The user interface 170 may be a small or large display and a small or large keyboard. The server device 120 is preferably a device with substantial memory capacity in relationship to the client device 105. For example, the server typically will have a large hard drive or drives (for example, 20 gigabytes of storage).
Referring to
In one embodiment of the present invention, an application may present a set of text words (without associated pronunciations) to the word synthesis dictionary 220 in the memory 150. The set of text words may be a set of text words commonly used by the application, which are expected to be used by the application within a relatively short period of time—while the application is running (for example, anywhere from a part of a minute to many minutes), or, alternatively, they may be a set of text words that comprises a speech text. A speech text in the context of this application is a set of text words that are planned for imminent sequential presentation through the speaker 160. For example, the sentence “The number entered is 847-576-9999” prepared for presentation to user in response to the user's entry of a phone number would be speech text. The digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 are examples of text words which would more likely be the set of digits anticipated for use by an address application. By a technique described below, the pronunciations of words not in the client device's word synthesis dictionary 220 are obtained remotely. For this purpose, the speech engine 210 is coupled to the network transmission function 225 for transmitting words over the network that are not in the client device's word synthesis dictionary 220.
Referring to
When the client device 105 receives the set of word pronunciations at step 335, the client device 105 makes a determination whether the set of word pronunciations is associated with a speech text at step 337. At step 340, a determination whether the speech text has already been presented (synthesized). When the speech text has not yet been synthesized, the set of word pronunciations is used by the speech engine 210 at step 345 to provide a synthesis of the speech text, thereby reducing interpretation errors. When the speech text has already been synthesized at step 340 (as in the case in which the delay to receive the set of word pronunciations exceeds a minimum specified delay time, or the case in which a command to present the speech text is received before the set of word pronunciations is received), or when the set of word pronunciations is determined not to be for a speech text at step 337, the client device 105 at step 350 determines whether the set of pronunciations is to be stored in the memory 150 of the client device 105 as an addition to the word synthesis dictionary of the client device 105. Such storage may be for a predetermined time, e.g., while the application that requested the set of word pronunciations is active, or for example, based on limits of the memory 150, or, for example, based on a priority of the application and memory limits and/or time, etc. When the set of pronunciations is to be stored in the memory 150, they are stored at step 355. The process ends at step 360.
It will be appreciated that the present invention provides a unique technique for providing pronunciations of textwords in a client device having a restricted word synthesis dictionary capacity (e.g., less than one megabyte), thereby reducing misinterpretation errors.
In the foregoing specification, the invention and its benefits and advantages have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present invention. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims.
As used herein, the terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
A “set” as used in the following claims, means a non-empty set. The term “another”, as used herein, is defined as at least a second or more. The terms “including” and/or “having”, as used herein, are defined as comprising. The term “coupled”, as used herein with reference to electro-optical technology, is defined as connected, although not necessarily directly, and not necessarily mechanically. The term “program”, as used herein, is defined as a sequence of instructions designed for execution on a computer system. A “program”, or “computer program”, may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, a shared library/dynamic load library and/or other sequence of instructions designed for execution on a computer system.
Number | Date | Country | |
---|---|---|---|
60503685 | Sep 2003 | US |