The present invention relates generally to speech synthesis systems. More particularly, this invention relates to generating variations in synthesized speech to produce speech that sounds more natural.
A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. The following notice applies to the software and data as described below and in the drawings hereto: Copyright© 2002, Apple Computer, Inc., All Rights Reserved.
Speech is used to communicate information from a speaker to a listener. In a computer-user interface, the computer generates synthesized speech to convey an audible message to the user rather than just displaying the message as text with an accompanying “beep.” There are several advantages to conveying audible messages to the computer user in the form of synthesized speech. In addition to liberating the user from having to look at the computer's display screen, the spoken message conveys more information than the simple “beep” and, for certain types of information, speech is a more natural communication medium.
Due to the nature of computer systems, the same message may occur many times. For example, the message “Attention! The printer is out of paper” may be programmed to repeat several times over a short period of time until the user replenishes the printer's paper tray. Or the message “Are you sure you want to quit without saving?” may be repeated several times over the course of using a particular program. In human speech, when a person says the same words over and over again, he or she does not produce exactly the same acoustic signal each time the words are spoken. In synthesized speech, however, the opposite is true; a computer generates exactly the same acoustic signal each time the message is spoken. Users inevitably become annoyed at hearing the same predictable message spoken each time in exactly the same way. The more often a particular message is spoken in exactly the same way, the more unnaturally mechanical it sounds. In fact, studies have shown that listeners tune out repetitive sounds and, eventually, a repetitive spoken message will not be noticed.
One way to overcome the problems of sound repetition is to alter the way the computer produces the acoustic signal each time the message is spoken. Altering a computer-generated sound each time it is produced is known in the art. For example, alteration of the sound can be achieved by changing the sample playback rate, which shifts the overall spectrum and duration of the acoustic signal. While this approach works well for non-speech sounds, it does not work well when applied to speech sounds. In human speech, the overall spectrum of sound stays the same because a human speaker's vocal tract length does not vary. Thus, in order to sound like human speech, the overall spectrum of the sound of synthesized speech needs to stay the same as well. Another prior art example of altering a computer-generated sound each time it is produced is found in computer-generated music. In computer music a small random variation in the timing of each note is sometimes made to achieve a less mechanical sound. However, as with changing the sample playback rate, changing the timing of the components of speech does not work well for speech sounds because, unlike music, speech does not consist of easily identifiable note-onset and note-duration events. Rather, speech consists of tonal patterns of pitch, syllable stresses, overlapped gestures of the articulators (tongue, lips, jaw, etc.), and timing to form the rhythmic speech patterns that comprise the spoken message. Thus, it is not so clear exactly what parameters in speech synthesis should be varied to achieve a more natural sound. A more detailed analysis of the components of speech is required.
Speech is the acoustic output of a complex system whose underlying state consists of a known set of discrete phonemes that every human speaker produces. A phoneme is the basic theoretical unit for describing how speech conveys linguistic meaning. As such, the phonemes of a language comprise a minimal theoretical set of units that are sufficient to convey all meaning in the language. For American English, there are approximately 40 phonemes, which are made up of vowels and consonants. Each phoneme can be considered to be a code that consists of a unique set of articulatory gestures.
If speakers could exactly and consistently produce these phoneme sounds, speech would amount to a stream of underlying discrete codes. However, because of many different factors including, for example, agents, gender, and coarticulatory effects, every phoneme has a variety of acoustic manifestations in the course of flowing speech. Thus, from an acoustical point of view, the phoneme actually represents a class of sounds that convey the same meaning.
The variations in the way the phonemes are produced between people and even between utterances of the same person are referred to as prosody. Examples of prosody include tonal and rhythmic variations in speech, which provide a significant contribution to the formal linguistic structure of speech communication and are referred to as the prosodic features. The acoustic patterns of prosodic features are heard in changes in the duration, intensity, fundamental frequency, and spectral patterns of the individual phonemes that comprise the spoken message.
There are two distinctive components of prosody—i.e., linguistic components of prosody and paralinguistic components of prosody. The linguistic components of prosody are those that can change the meaning of a spoken phrase. In contrast, paralinguistic components of prosody are those that do not change the meaning of a series of spoken words. For example, when speaking the phrase “it's raining,” a rising intonation asks for a confirmation and, perhaps, conveys surprise or disbelief. On the other hand, a falling intonation may express confidence that the rain is indeed falling. The distinction between the rising and falling intonations is an example of varying a linguistic prosodic feature. By contrast, one could speak the phrase “it's raining” with a somewhat higher (or lower) overall pitch range, depending upon whether the listener is far away (or nearby), and this change in overall pitch range does not change the meaning of the spoken words. Such a change in pitch without altering meaning is an example of a paralinguistic prosodic feature.
The fundamental frequency contours of speech have been classified according to their communicative function. In English, a rising contour generally conveys to the listener that a question has been posed, that some response from the listener is required, or that more information is implied to follow within the current topic. Conversely, a falling contour generally conveys the opposite. Numerous subtle and not-so-subtle variations in the fundamental frequency contours signal other information to the listener as well, such as sarcasm, disbelief, excitement or anger. Unlike the phonemes, the prosodic features reflected in the acoustic patterns may not be discrete. In fact, it is often difficult or impossible to determine which features of prosody are discrete and which are not.
The human ear is extremely sensitive to minor changes in certain components of speech, and remarkably tolerant of other changes. For example, the tonal and rhythmic variations of speech are finely controlled by humans and, as noted above, convey considerable linguistic information. Thus, random variations in the pitch or duration of each phoneme, syllable or word of a spoken message can destructively interfere with the overall tonal and rhythmic pattern of the speech, i.e. the prosody. Even a 9-millisecond difference in the closure duration of an inter-vocal stop can shift the perception from voiced to voiceless, changing for example the word “rapid” into “rabid.” Therefore, simply changing the parameters for the timing of sound components may result in undesirable alterations in the prosodic features of the phonemes that comprise the speech and cannot be successfully applied to speech synthesis.
Another example of altering computer-generated sounds is disclosed in U.S. Pat. No. 5,007,095 to Nara et al., which describes a system for synthesizing speech having improved naturalness.
A method and apparatus for generating speech that sounds more natural using paralinguistic variation is described herein. According to one aspect of the present invention, a method for generating speech that sounds more natural comprises generating synthesized speech having certain prosodic features and applying a paralinguistic variation to the acoustic sequence representing the synthesized speech without altering the linguistic prosodic features. According to one aspect of the present invention, the application of the paralinguistic variation is correlated with a previous randomly applied paralinguistic variation to reflect a gradual change in the computer voice, while still maintaining a random quality. According to one aspect of the present invention, the application of the paralinguistic variation is correlated over time. According to one aspect of the present invention, the application of the paralinguistic variation is correlated with other paralinguistic variations, sometimes in accordance with a predetermined paragraph prosody.
According to one aspect of the present invention, a machine-accessible medium has stored thereon a plurality of instructions that, when executed by a processor, cause the processor to alter synthesized speech by applying a paralinguistic variation to the acoustic sequence representing the synthesized speech without altering the linguistic prosodic features. According to another aspect of the invention, the application of the paralinguistic variation is correlated with a previous randomly applied paralinguistic variation to reflect a gradual change in the computer voice, while still maintaining a random quality. According to one aspect of the present invention, the instructions cause the processor to correlate the application of the paralinguistic variation over time. According to one aspect of the present invention, the instructions cause the processor to correlate the paralinguistic variation with other paralinguistic variations, sometimes in accordance with a predetermined paragraph prosody.
According to one aspect of the present invention, an apparatus for applying a paralinguistic variation to an acoustic sequence representing synthesized speech without altering the prosodic features of the synthesized speech includes a speech synthesizer and a paralinguistic variation processor. The speech synthesizer generates synthesized speech having certain prosodic features and the paralinguistic variation processor applies paralinguistic variations to the acoustic sequence representing the synthesized speech without altering the prosodic features. According to one aspect of the present invention, the paralinguistic variation processor correlates the paralinguistic variations with a previous randomly applied paralinguistic variation to reflect a gradual change in the computer voice, while still maintaining a random quality. According to one aspect of the present invention, the paralinguistic variation processor correlates the application of the paralinguistic variation over time. According to one aspect of the present invention, the paralinguistic variation processor correlates the paralinguistic variation with other paralinguistic variations, sometimes in accordance with a predetermined paragraph prosody.
In yet another embodiment, an apparatus for applying a paralinguistic variation to an acoustic sequence representing synthesized speech without altering the prosodic features of the synthesized speech comprises analog circuitry.
A method and an apparatus for generating paralinguistic variations in a speech synthesis system to produce more natural sounding speech are provided. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be evident, however, to one skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
In one embodiment the prosodic generation 112 uses a paragraph prosody 134 in conjunction with the phoneme duration model 128 and the phoneme pitch model 130 to provide an overall prosodic pattern for a set of text inputs 104 that comprise a dialog, or other sequence of computer-generated speech. An overall prosodic pattern is beneficial because it can be used to guide the user to respond to the computer-generated speech in a certain way. For example, in a computer-user interface, a task may be automated using a series of voice commands, such as changing the desktop background. The task may involve generating multiple occurrences of speech that prompt the user to enter several commands before the task is completed. The paragraph prosody 134 is used to provide prosodic features to the phonemes that result in speech that helps to guide the user through the task. The overall tonal and rhythmic pattern of the generated speech, i.e. the prosodic features, can help a user to determine whether an additional input is required, whether they must make a choice among alternatives, or when the task is complete.
Referring again to
The speech synthesis system 100 may be hosted on a processor, but is not so limited. For an alternate embodiment, the speech synthesis system 100 may comprise some combination of hardware and software that is hosted on a number of different processors. For another alternate embodiment, a number of the components of the speech synthesis system 100 may be hosted on a number of different processors. Another alternate embodiment has a number of different components of the speech synthesis system 100 hosted on a single processor.
In yet a another embodiment, the speech synthesis system 100 is implemented, at least in part, using analog circuitry. For example, the speech synthesis system 100 may be implemented as analog electronic circuits that produce a time-varying electric signal. In one embodiment, a voltage controlled oscillator (VCO) is coupled with one or more voltage controlled filters (VCFs), wherein the output of the VCO is provided to the VCFs. Control inputs to the VCFs can be used to produce different phonemes that represent a sentence that is to be spoken. A time-varying signal can be input to the VCO, and the pattern of voltage (as a function of time) represents the desired pitch contour for the spoken sentence. In such an embodiment, a second input could be provided to the VCO, this second input presenting a slowly-varying random value that is added to the pitch contour to change its overall pitch range in a paralinguistic manner. In a similar fashion, there may be slowly varying inputs to the VCFs that modify, for example, the center-frequency and/or bandwidths of the filter resonances to slightly vary the articulation in random ways.
In yet a further embodiment, various components of the speech synthesis system 100 may be implemented mechanically. For example, the pitch could be generated by a mechanical model of a human larynx, where air is forced through two stretched pieces of rubber. This can produce a pitched buzzing sound having a frequency that is determined by the tightness of the stretched rubber pieces. The buzzing sound could then be passed through a series of tubes whose diameters can be varied over the lengths of the tubes. The tubes, which would resonate at frequencies determined by their respective cross-sectional areas, can produce audible speech. In such an implementation, paralinguistic variations may be achieved using a mechanism that adjusts the tension in the stretched rubber pieces and/or by a mechanism that varies the diameters of the acoustic tubes.
These elements 401-425 perform their conventional functions known in the art. Collectively, these elements are intended to represent a broad category of hardware systems, including but not limited to general purpose computer systems based on the PowerPC® processor family of processors available from Motorola, Inc. of Schaumburg, Ill., or the Pentium® processor family of processors available from Intel Corporation of Santa Clara, Calif.
It is to be appreciated that various components of hardware system 400 may be re-arranged, and that certain implementations of the present invention may not require nor include all of the above components. For example, a display device may not be included in system 400. Additionally, multiple buses (e.g., a standard I/O bus and a high performance I/O bus) may be included in system 400. Furthermore, additional components may be included in system 400, such, as additional processors (e.g., a digital signal processor), storage devices, memories, network/communication interfaces, etc.
In the illustrated embodiment of
The first category is a slight random variation in the overall pitch range 710 within which the linguistically-motivated speech melody is mapped from its rule-generated symbolic transcription to the continuously-varying fundamental frequency values. The linguistically-motivated speech melody is a prosodic feature of the input text 104, and refers to the specific intonational tune of the spoken message, e.g. a question tune, a neutral declarative tune, an exclamation tune, and so on. The mapping of the rule-generated symbolic transcription to the continuously varying fundamental frequency values may include application of the prosody model 114 and, more specifically, the phoneme pitch model 130 and intonation rules 132 to provide pitch information for the phonemes that comprise the message. In one embodiment, a slight variation is achieved by raising the overall pitch range one semitone by applying a logarithmic transformation of log 12√{square root over (2)} to the acoustic sequence 601 of synthesized speech signals. The logarithmic transformation of the signal alters the sound of the synthesized speech while preserving the prosodic features representative of the text input 104 such as the linguistically-motivated speech melody. Other types of transformations to the overall pitch range that preserve the linguistic prosodic features of the synthesized speech may be employed without exceeding the scope of the present invention.
The second category is a random variation in the overall speaking rate 720 of the spoken message. The overall speaking rate of a spoken message can be modeled independently of the relative durations of the speech segments (e.g. phonemes) within that message. Moreover, it has been shown that listeners perceive the overall speaking rate independently of the relative durations of the speech segments within the message. Therefore, changes to the overall speaking rate of a spoken message may be achieved without altering the linguistic prosodic features of phoneme duration as generated according to the prosody model 114 and, more specifically, according to the phoneme duration model 128. In one embodiment a random variation is achieved by either slightly speeding up or slowing down the overall speaking rate of a spoken message by applying a mathematical transformation to the acoustic sequence 601 of synthesized speech signals. In one embodiment the mathematical transformation may be a linear transformation such as a factor of 1.25 to increase the speaking rate by 25 percent. The linear transformation of the signal alters the sound of the synthesized speech while preserving the prosodic features representative of the text input 104 such as the relative duration of the phonemes. Other types of transformations to the overall speaking rate that preserve the linguistic prosody components of the synthesized speech may be employed without exceeding the scope of the present invention.
Thus, a method and apparatus for a speech synthesis system using random paralinguistic variation has been described. Whereas many alterations and modifications of the present invention will be comprehended by a person skilled in the art after having read the foregoing description, it is to be understood that the particular embodiments shown and described by way of illustration are in no way intended to be considered limiting. References to details of particular embodiments are not intended to limit the scope of the claims.
Number | Name | Date | Kind |
---|---|---|---|
4908867 | Silverman | Mar 1990 | A |
5007095 | Nara et al. | Apr 1991 | A |
5384893 | Hutchins | Jan 1995 | A |
5652828 | Silverman | Jul 1997 | A |
5732395 | Silverman | Mar 1998 | A |
5749071 | Silverman | May 1998 | A |
5751906 | Silverman | May 1998 | A |
5832433 | Yashchin et al. | Nov 1998 | A |
5832435 | Silverman | Nov 1998 | A |
5860064 | Henton | Jan 1999 | A |
5875427 | Yamazaki | Feb 1999 | A |
5890117 | Silverman | Mar 1999 | A |
6064960 | Bellegarda et al. | May 2000 | A |
6101470 | Eide et al. | Aug 2000 | A |
6185533 | Holm et al. | Feb 2001 | B1 |
6208971 | Bellegarda et al. | Mar 2001 | B1 |
6226614 | Mizuno et al. | May 2001 | B1 |
6289301 | Higginbotham et al. | Sep 2001 | B1 |
6334103 | Surace et al. | Dec 2001 | B1 |
6366884 | Bellegarda et al. | Apr 2002 | B1 |
6374217 | Bellegarda | Apr 2002 | B1 |
6397183 | Baba et al. | May 2002 | B1 |
6405169 | Kondo et al. | Jun 2002 | B1 |
6424944 | Hikawa | Jul 2002 | B1 |
6477488 | Bellegarda | Nov 2002 | B1 |
6499014 | Chihara | Dec 2002 | B1 |
6553344 | Bellegarda et al. | Apr 2003 | B2 |
6708153 | Brittan et al. | Mar 2004 | B2 |
6804649 | Miranda | Oct 2004 | B2 |
6970820 | Junqua et al. | Nov 2005 | B2 |
7065485 | Chong-White et al. | Jun 2006 | B1 |
7096183 | Junqua | Aug 2006 | B2 |
7103548 | Squibbs et al. | Sep 2006 | B2 |
7127396 | Chu et al. | Oct 2006 | B2 |
20010032080 | Fukada | Oct 2001 | A1 |
20020026315 | Miranda | Feb 2002 | A1 |
20020138270 | Bellegarda et al. | Sep 2002 | A1 |
20020193996 | Squibbs et al. | Dec 2002 | A1 |
20030078780 | Kochanski et al. | Apr 2003 | A1 |
20030163316 | Addison et al. | Aug 2003 | A1 |
20040193421 | Blass | Sep 2004 | A1 |
20040249667 | Oon | Dec 2004 | A1 |