Increasingly, as communication technologies improve, long distance travel becomes more affordable and the economies of the world have become more globalized, contact between people who have different native languages has increased. However, as contact between people who speak different native languages increase, new communication difficulties can arise. Even when both persons can communicate in one language, problems can arise. One such problem is that it may be difficult to determine how a person's name is pronounced merely by reading the name because different languages can have different pronunciation rules for a given spelling. In situations such as business meetings, conferences, interviews, and the like, mispronouncing a person's name can be embarrassing. Conversely, providing a correct pronunciation of a person's name can be a sign of respect. This is particularly true when the person's name is not necessarily easy to pronounce for someone who does not speak that person's native tongue.
Part of the problem, as discussed above, is that different languages do not necessarily follow the same pronunciation rules for written texts. For example, a native English speaker may be able to read the name of a person from China, Germany, or France, to name a few examples, but unless that person is aware of the differing pronunciation rules between the different countries, it may still be difficult for the native English speaker to correctly pronounce the other person's name. To further complicate matters, names that might be common in one language can be pronounced differently in another language, despite having an identical spelling. Furthermore, knowing all of the pronunciation rules may not lead a correct pronunciation of a name that is pronounced differently from what might be expected by following a language's pronunciation rules. What is needed, then, is a way to provide an indication of the correct pronunciation of a name.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
In one illustrative embodiment, an automated method of providing a pronunciation of a word to a remote device is disclosed. The method includes receiving an input indicative of the word to be pronounced. A database having a plurality of records each having an indication of a textual representation and an associated indication of an audible representation is searched. The method further includes providing at least one output to the remote device of an audible representation of the word to be pronounced.
In another illustrative embodiment, method of providing a database of pronunciation information for use in an automated pronunciation system is disclosed. The method includes receiving an indication of a textual representation of a given word. The method further includes creating an indication of an audio representation of the given word. The indication of an audio representation is associated with the indication of a textual representation. The associated indications are then stored in a record.
In yet another embodiment, a system adapted to provide an audible indication of a proper pronunciation of a word to a remote device is disclosed. The system includes a database having a plurality of records. Each of the records has a first data element indicative of a textual representation of a given word and a second data element indicative of an audible representation of the given word. The system further includes a database manager for communicating information with the database. A text to speech engine capable of receiving a textual representation of a word and providing an audible representation of the input is included in the system. In addition, the system has a communication device. The communication device is capable of receiving an input from the remote device indicative of a textual representation of a word and providing the remote device an output indicative of an audible representation of the input.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
System 10 includes a text-to-speech (TTS) engine 16, which, in one embodiment is configured to synthesize a textual input into an audio file. The TTS engine 16 illustratively receives a textual input from the database manager 14. The textual input, in one illustrative embodiment, is a phoneme string received from database 12 as a result of a query of the database 12 by database manager 14. Alternatively, the textual string may be a phoneme generated by the database manager 14 or a textual string representing the spelling of a name. The TTS engine 16 provides an audio file that represents a pronunciation of the given name for each entry provided to it by the database manager 14. Alternatively, the TTS engine 16 can provide a phoneme string as an output from a textual input. The database manager 14 may receive that output, associate it with the textual input and store it in the database 12.
The data communication link 17 of system 10 is illustratively configured to communicate over a wide area network (WAN) 18 such as the Internet to send and receive data between the system 10 and externally located devices such as the client device 20. In one illustrative embodiment, the client device 20 is a mobile telephone. Alternatively, the client device 20 can be any type device that is capable of accessing system 10, including, without limitation, personal computing devices, such as desktop computers, personal data assistants, set top boxes, and the like. Client device 20, in one illustrative embodiment, communicates with the system 10 via the WAN 18 to provide the system 10 with information as required. The types of information provided to the system 10 can include a request for a pronunciation or information related to pronunciation of a specific name. Details of the types of information that can be provided from the client device 20 to the system 10 will be provided below.
System 10 illustratively provides, in response to a request from the client device 20, information related to the pronunciation of a particular name to the client device 20. In one illustrative embodiment, the system 10 provides the audio file created by the TTS engine 16 that represents the audio made by pronouncing the particular name. The client device 20 can then play the audio to provide an indication of a suggested pronunciation of the particular name. In some cases, one name can have more than one suggested pronunciation. For example, the text representation of a name in one language may be pronounced one way while the same exact representation can be pronounced differently in another language. As another example, the same text representation of a name can have more than one pronunciation in the same language.
An example of a screen view 300 of a visual display (28 in
Once the user has provided an input indicative of a desire to send the inputted information to the system 10, the client device 20 sends such information to the system 10 as is detailed in block 104. The input is compared against information stored in the system 10, as is detailed in block 106. The name input into the client device 20 and sent to the system 10 is compared against entries in the database 12 to determine whether there are any entries that match the name provided.
Referring to
A meta field 58 can include information related to the record 50 itself. For example, the meta field 58 can include information as to how many times the particular record 50 has been chosen as an acceptable pronunciation for the name in question by users. The meta field 58 can also illustratively include information about the source of the pronunciation provided. For example, the meta field may have information about a user who provided the information, when the information was provided and how the user provided the information. Such information, in one embodiment is used to pre-determine a priority of pronunciations when a particular name has more than one possible pronunciation.
Reviewing the exemplary database 12 provided in
Records 50d, 50e, and 50f each have the name3 name string located in their respective name fields 52. In addition, it can be seen that records 50e and 50f have the same data in their origin field 54. Thus, more than one pronunciation is associated with the same location. This is represented in the pronunciation fields 56 of records 50e and 50f. Information in the meta field 58 of each record 50 will provide an indication of the popularity of one pronunciation relative to another. These indications can be used to order the pronunciations associated with a particular record 50 provided to the client device 20 or, alternatively, to determine whether a particular pronunciation is, in fact, provided to the client device 20.
It is to be understood that the representation of the database 12 provided in
Returning again to
Once the matching records 50 are prioritized, if any of the matching records 50 have phoneme strings in their pronunciation records 56, those phoneme strings are sent to the TTS engine 16, which illustratively synthesizes the phoneme string into an audio file. Alternatively, of course, the information in the pronunciation record 56 can be associated with an audio file that is either previously synthesized by the TTS engine 16 from a phoneme string or received as an input from the client device 20. The input of an audio file from the client device 20 is discussed in more detail below.
Once any phoneme strings are synthesized into an audio file by the TTS engine 16, the one or more audio files associated with the one or more records 50 are sent to the client device 20, as is illustrated by block 116. In one illustrative embodiment, the audio files and associated data are provided to the client device 20 in order of their priority. Origin data from origin field 54 related to the origin of the pronunciation is also illustratively sent to the client device 20, although alternatively, such origin data need not be sent.
Alternatively, if it is determined that no entries in the database 12 match the name input by the user into the client device 20, the database manager 14 illustratively attempts to determine the nationality or language of the name provided by employing an algorithm within the database manager 14. In one illustrative embodiment, the database manager 14 determines one or more possible locations of origin for the inputted name. The name and pronunciation rules associated with the locations of origin are illustratively employed by the database manager 14 to create a phoneme string for the name in each language or location of origin determined the database manager 14 as is illustrated in block 120. Each of the phoneme strings is stored in the database 12 as is shown in block 122.
Each of the phoneme strings generated by the database manager 14 is then illustratively provided to the TTS engine 16 as is shown in block 124. The TTS engine 16 illustratively creates an audio file, which provides an audio representative of a pronunciation of the name provided using the pronunciation rules of a given language or location for each provided phoneme string. The resulting audio file for each phoneme string is illustratively associated with the text string of the given record 50 and provided back to the client device 20. This is illustrated by block 116.
Given the list of possible pronunciations illustratively shown in display 302, the user selects one of them and the client device 20 plays the audio file associated with the selection through the audio output device 26 for the user. The user can then choose whether to select that audio file as a pronunciation for the given name.
Once the user has chosen a pronunciation, the client device illustratively queries whether the user is satisfied with the pronunciation is provided. This is represented by decision block 154 in
If the user determines that the pronunciation is incorrect, the user illustratively provides feedback indicating a proper pronunciation, shown in block 156 and discussed in more detail below. The information provided by the user is stored in the database 12 as a new record, including the name field 52, origin field 54 (determined by the previous selection as discussed above) and the new pronunciation field 56. In addition data related to the user who provides the information and when the information is provided can be provided to the meta field 58. In one illustrative embodiment, any user of the system 10 will be queried to provide feedback information relative to the quality of a pronunciation. Alternatively, the system 10 may allow only select users to provide such feedback. Once the new pronunciation is created, it is stored in database 12. This is indicated by block 158.
Once it has been determined that the user wishes to provide feedback relative to the pronunciation of a previously chosen name (as is shown in block 156 of
Returning to block 204, if it is determined that the method selected by the user is not the method of amending the phoneme string, the method next determines whether the method selected is choosing a similar sounding word. This is can be an advantageous method when the user is not proficient with providing phoneme strings representative of a given word or phone. If it is determined at block 214 that method of choosing a similar sounding word is the chosen method, the user is prompted to provide a similar block, shown in block 216 and screen 312 shown in
If it is determined at block 210 that the audio file is sufficiently “accurate”, the database manager 14 saves the phoneme string associated with the similar word in the database 12, which is shown in block 212. Conversely, if the user determines that the audio file is not sufficiently close to the desired word (as determined at decision block 210), the method 200 returns to block 202 to determine a method of amending the pronunciation.
As an example of the use a similar word to create a proper pronunciation, consider the Chinese surname “Xin”. The user can enter the word “shin” and using English rules, the database manager 14 converts the word shin to a phoneme string and provides the phoneme string to the TTS engine 16. The resultant audio file is so similar to the correct pronunciation of the name Xin that it is, for all intents and purposes a “correct” pronunciation.
Returning to block 214, if it is determined that the method selected is not the similar word method, it is assumed that the method to be implemented is to have the user record a pronunciation.
As discussed above with respect to method 200, method 250, provides three different possible methods for the user to provide input to change the pronunciation of the textual string: editing the phoneme string, providing a word similar in pronunciation, or recording an audio file of the pronunciation. The method for editing the phoneme string or providing a word similar in pronunciation are illustratively the same for method 250 as for method 200. It should be understood, of course, that variations in either of the methods for editing the phoneme string of providing a word similar in pronunciation can be made to method 250 without departing from the scope of the discussion.
Method 250 illustratively provides an alternative method incorporating a recorded audio file of the pronunciation of a textual string. At block 220, the user records a pronunciation for the textual string. The recording is then provided by the client device to the server. At block 252, the server provides voice recognition to convert the recording into a textual string. Any acceptable method of performing voice recognition may be employed. The textual string is then converted to a sound file and the sound file is returned to the client device. The user then evaluates the sound file to determine whether the sound file is accurate. This is illustrated at block 210. Based on the user's evaluation, the phoneme is either provided to the database as at block 212 or the user selects a new method of amending the pronunciation of the textual input as at block 202. It should be appreciated that in any of the methods of changing the pronunciation of a textual string discussed above, additional steps may be added. For example, if the speech recognition provides an unacceptable result, rather than returning to block 202, the client device can alternatively attempt to provide another audible recording or modify the textual string to provide a more acceptable sound file.
The embodiments discussed above provide important advantages. Systems and methods discussed above provide a way for users to receive an audio indication of the correct pronunciation of a name that may be difficult to pronounce. In addition, the system can be modified by some or all users to provide additional information to the database 12. The system is accessible via a WAN through mobile devices or computers, thereby providing access to users in almost any situation.
Embodiments are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with various embodiments include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
Embodiments may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Some embodiments are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 410 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 410 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 410. The database 12 discussed in the embodiments above may be stored in any of the storage media listed above. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 430 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 431 and random access memory (RAM) 432. A basic input/output system 433 (BIOS), containing the basic routines that help to transfer information between elements within computer 410, such as during start-up, is typically stored in ROM 431. RAM 432 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 420. For example, program modules related to the database manager 14 or the TTS engine 16 may be resident or executes out of ROM and RAM, respectively. By way of example, and not limitation,
The computer 410 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 410 through input devices such as a keyboard 462, a microphone 463, and a pointing device 461, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 420 through a user input interface 460 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 491 or other type of display device is also connected to the system bus 421 via an interface, such as a video interface 490. In some embodiments, the visual display 28 can be a monitor 491. In addition to the monitor, computers may also include other peripheral output devices such as speakers 497, which may be used as an audio output device 26 and printer 496, which may be connected through an output peripheral interface 495.
The computer 410 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 480. The remote computer 480 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 410. The logical connections depicted in
When used in a LAN networking environment, the computer 410 is connected to the LAN 471 through a network interface or adapter 470. The network interface can function as a data communication link 32 on the client device or data communication link 17 on the system 10. When used in a WAN networking environment, such as for example the WAN 18 in
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Number | Name | Date | Kind |
---|---|---|---|
5040218 | Vitale et al. | Aug 1991 | A |
5212730 | Wheatley et al. | May 1993 | A |
5752230 | Alonso-Cedo | May 1998 | A |
5787231 | Johnson et al. | Jul 1998 | A |
5890117 | Silverman | Mar 1999 | A |
6012028 | Kubota et al. | Jan 2000 | A |
6078885 | Beutnagel | Jun 2000 | A |
6178397 | Fredenburg | Jan 2001 | B1 |
6272464 | Kiraz et al. | Aug 2001 | B1 |
6389394 | Fanty | May 2002 | B1 |
6963871 | Hermansen et al. | Nov 2005 | B1 |
7047193 | Bellegarda | May 2006 | B1 |
7292980 | August et al. | Nov 2007 | B1 |
7567904 | Layher | Jul 2009 | B2 |
20020103646 | Kochanski et al. | Aug 2002 | A1 |
20040153306 | Tanner et al. | Aug 2004 | A1 |
20050060156 | Corrigan et al. | Mar 2005 | A1 |
20050159949 | Yu et al. | Jul 2005 | A1 |
20050273337 | Erell et al. | Dec 2005 | A1 |
20060129398 | Wang et al. | Jun 2006 | A1 |
20070043566 | Chestnut et al. | Feb 2007 | A1 |
20070219777 | Chu et al. | Sep 2007 | A1 |
20070255567 | Bangalore et al. | Nov 2007 | A1 |
20080059151 | Chen et al. | Mar 2008 | A1 |
Entry |
---|
Sharma, “Speech Synthesis”, Jun. 2006, Thesis Report, Electrical and Instrumentation Engineering Department Thapar Institute of Engineering & Technology, India, pp. 1-77. |
Llitjós, Ariadna Gont, Black, Alan W., “Evaluation and Collection of Property Name Pronunciations Online”, 2002, pp. 247-254. |
Maison, Benoît, et al., Pronunciation Modeling for Names of Foreign Origin, pp. 429-434, 2003. |
Oshika, Beatrice T., et al., “Improved Retrieval of Foreign Names from Large Databases,” pp. 480-487, 1988 IEEE. |
Jannedy, Stefanie, et al., “Name Pronunciation in German Text-to-Text Speech Synthesis.” |
Number | Date | Country | |
---|---|---|---|
20080208574 A1 | Aug 2008 | US |