1. Field
The disclosed embodiments generally relate to voice synthesis and, more particularly, to voice conversion for audio user interfaces in communication devices.
2. Brief Description of Related Developments
Voice conversion can be defined as the modification of speaker-identity related features of a speech signal. Commercial usage of voice conversion techniques is at its infancy. In one application, voice conversion may be utilized to extend the language portfolio of Text-To-Speech (TTS) systems using branded voices in a cost efficient manner. In this context, voice conversion may, for instance, be used to make a branded synthetic voice speak in languages that the original voice talent cannot speak. In addition, voice conversion may be deployed in several types of entertainment applications and games, and communication.
A plurality of voice conversion techniques are known in the art, in many of which, a speech signal is represented by a source-filter model of speech. In these contexts speech is understood to consist of a source component originating from the vocal cords, which is then shaped by a filter imitating the effect of the vocal tract. The source component is frequently denoted as an excitation signal, as it excites the vocal tract filter. A separation (or de-convolution) of a speech signal into the excitation signal on the one hand, and the vocal tract filter on the other hand can for instance be accomplished by cepstral analysis or Linear Predictive Coding (LPC). A voice conversion platform is described in the above referenced application, US Publication No. 2006/0235685, incorporated herein by reference.
It would be advantageous to adapt such voice conversion techniques to enhance the audio user interface of communication devices by expanding the use of voice based presentations.
In the basic embodiment of this application, a voice conversion processing framework, as described in US Publication 20060235685, or other modules using similar techniques is operatively associated with the central processing unit (CPU) and audio processor of a communication device. The voice conversion processor is used to enhance the features of the audio portion of the user interface of the communication device. This is accomplished by using source voice signals available in memory or from other speech processing features to convert a default source speech to a target speech. Such target voice signals may be provided by network supplied applications or by applications that are part of the operating system of the communication device.
In another embodiment of this application, a feature is provided that generates an audio presentation of a text message using text to speech (TTS) synthesis. According to this embodiment, an audio field is established as part of a contact listing or profile, using tools, similar to audio name identification tools that are currently available in communication devices, in particular a mobile communication device. This audio file is used to convert the default voice used for the text message reading feature into a target speech customized by the user, for example the voice of the sender.
In another embodiment of the application, voice conversion techniques are used to customize speech related ring tones, such as caller ID announcements, ring tones using in part TTS generated voice synthesis, ring tones generated by user recording, and other sources. The target speech source could be the user's voice or the voice of a friend, or celebrity. A wide variety of target sources may be made available.
These aspects and other features of the embodiments are explained in the following description, with reference to the accompanying drawings, in which:
Although aspects of the disclosed embodiments will be described with reference to the embodiments shown in the drawings and described below, it should be understood that these aspects could be embodied in many alternate forms. In addition, any suitable size, shape or type of elements or materials could be used. Computer operated devices may be constructed having one or several processors and one or several program product modules stored in one or several memory elements. For illustration, computer components may be described as individual units by function. It should be understood, that in some instances, these functional components may be combined. The operation of the communication device of this application uses conventional stored-program processor elements and may include, for example, processor, and memory that perform processing and storage operations in connection with operation of the device.
An example of a framework 1 for accomplishing voice conversion is shown in
According to the embodiments of this application, communication device 2 is further equipped with a voice conversion unit 1, which may be implemented according to the frameworks 1a of
In the embodiment of
The voice conversion feature may be accessed through a menu item selectable by the user through interaction with the user interface or in particular with the audio portion of the user interface. Multiple software modules may be stored in memory 25 having program product code that causes one or more of the cooperating processors, i.e. CPU 22, audio processor 23 and voice conversion unit 1 to cooperate to convert a source speech signal to a target speech signal using available voice conversion techniques, where the target speech signal is selected from memory 25. A particular target speech would be selected from a listing of available alternatives to the default source speech signal. A series of prompts, that could be presented as audio speech clips, would direct the user in the selection process.
The basic embodiment of this application described above can be adapted for use with a variety of source voice signals that may be generated in conjunction with the operating software system, controlling the function of a particular communication device, for example a cellular phone or other mobile communication device. In addition, this embodiment may be adapted to provide multiple choices for use as the target voice signals in converting the default source voice signals of the device. Alternate embodiments in which this flexibility is applied are discussed below.
The basic method of operation of the embodiment of
Another embodiment of this application is illustrated in
In a mode in which the target voice signal is selected automatically from the contact listing audio clip 32. Messages from unknown contacts or from contacts that do not have any associated audio related information can be read using the default voice without conversion.
The audio related field in the contacts list can be a small audio clip. The clip could contain speech recorded from the particular contact, if the user wants to use the voice of the sender for reading the messages coming from her/him. Alternatively, the audio clip could contain speech from some other speaker that the user wants to link to the messages coming from the particular contact. The user may be prompted with respect to this feature, when the user is editing the information related to a particular contact.
In a further embodiment, the audio clip could be analyzed and used to choose between several generic voice target choices, for example based on gender. The audio clip could be used to identify the gender of the speaker and then a gender specific target voice signal could be sent to voice conversion unit 1. In other embodiments, the analysis could be more detailed, e.g. one possibility is to measure the average pitch of the speaker and to scale the pitch of the target voice signal accordingly in the message audio presentation. Another embodiment could measure the rough locations of the formats of the message and to use that data in encoding the voice used in the message reader application. In another embodiment, the analysis could contain a full-scale training of a voice conversion model for the particular speaker. This might require a second user interface to allow training voice conversion models, based on large amounts of data, to be input to voice conversion unit 1. The user could be offered a chance to link such a model to the contacts list through the audio related field.
The analysis of the audio clip can be done right after the user has added this information to her/his contacts list. In this way, the message reader itself is not abnormally complex and operation is not retarded. The result of the analysis can be stored in many formats (e.g. as some parameter settings or as a full voice conversion model).
In one alternative embodiment, the same voice can be attached to many contacts without storing duplicate information. The field in the contacts list could be a link to an audio file, making it possible to link many contacts to the same file.
In operation, when the message processor 30 receives a new message to be read, it may be adapted to check the sender field of the message and look for a match in the contacts list 32, stored in memory 25. If a match is found and an audio speech clip exists, that information may be selected and used as a target speech signal for voice conversion of the default text reader voice. If there is no audio information, or if the sender is not included in the contacts list, the message is read using the default voice according to normal execution of the message reader feature.
In another embodiment of this application, voice conversion techniques are used to customize speech in ring tones. As shown in
This embodiment can be implemented using existing ring tone features and voice conversion techniques. Slightly different implementations may be needed for different use cases. The usage in the existing name synthesis may require performing the conversion in the parametric domain used by the format synthesizer. The usage together with high-quality TTS synthesis is best handled by a voice conversion system that operates inside the acoustic synthesis module of the TTS system. The conversion of natural speech in ring tones may be handled using a voice conversion system that does not require any linguistic information. In any of these application modes, the conversion processing may be simplified by storing voice encoding parameters for use directly in the voice conversion software and thereby avoiding this step as part of the voice conversion platform. This could be accomplished at either or both ends, i.e. source and target, of the voice conversion process.
A ring tone application, according to this application is illustrated in
In operation, in the embodiment of
The ring tone application software is stored in memory 25 of communication device 2 and consists of program code adapted to cause the ring tone processor to generate a voice based signal for use by the audio processor 23 to be used as a voice base source signal for conversion. The voice source signal is processed in voice conversion unit 1 according to voice conversion techniques. A target selection software module is stored in memory 25 and may be selected through user interaction with the user interface of the communication device. The target selection software contains program code adapted to cause the voice conversion unit 1 to convert the source voice signal from the ring tone application, i.e. ring tone processor 40, based on a selected target voice signal from target voice file 43. The converted target signal is broadcast by speaker 24.
Although several processor and software modules are described above for illustration, it should be understood that these features and functions can be combined into one or more processors adapted to run one or more program products accessible in one or more memory sources.
It should be understood that the foregoing description is only illustrative of the embodiments. Various alternatives and modifications can be devised by those skilled in the art without departing from the embodiments. Accordingly, the disclosed embodiments are intended to embrace all such alternatives, modifications and variances that fall within the scope of the appended claims.
This application is a continuation in part application based on U.S. patent application Ser. No. 11/107,344, filed Apr. 15, 2005, US Publication No. 2006/0235685 and claims priority from this application with respect to common subject matter. The disclosure of application Ser. No. 11/107,344 is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 11107344 | Apr 2005 | US |
Child | 11963159 | US |