The use of speech synthesis-based applications is becoming more and more prevalent. Such applications are used for handling information inquiries, by reservation and ordering systems, to perform email reading, and so forth. The generated speech used in such applications ordinarily comes from a pre-trained model, or pre-recordings. As a result, it is difficult to change the prosody of synthesized speech to meet a user's desired style.
However, in some applications, it is more powerful if the speech is synthesized according to user's specific requirements. For example, computer-Assisted Language Learning (CALL) systems output speech based on a user's own voice characteristics; consider using such a system to learn a language like Mandarin Chinese, where prosody-like tonality is essential to lexical access and for disambiguation of homonyms. Prosody is thus important for the user to understand and to match when speaking. Other uses, such as post-editing synthesized speech to make it sound more natural, may likewise benefit from changed prosody.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which the prosody of speech may be changed by varying data associated with that speech. An interface or the like displays a visual representation of speech such as in the form of one or more waveforms and corresponding text. The interface allows changing prosody of the speech based on interaction with the visual representation to change data corresponding to the prosody, e.g., duration, pitch and/or loudness data, with respect to at least one part of the speech. The part of the speech that may be varied may comprise a phoneme, a morpheme, a syllable, a word, a phrase, and/or a sentence.
In one implementation, the changed speech can be played back to hear the change in prosody resulting from the interactive changes. The user can also change the text and hear newly synthesized speech, which may then be similarly edited to change data that corresponds to the prosody.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards controlling prosody, particularly for speech synthesized (e.g., text-to-speech) applications. In one aspect, there is provided a visual interface that shows a visual representation of speech, and includes an interactive mechanism for changing the pitch, duration and/or loudness of synthesized speech, e.g., in the framework of HMM-based speech synthesis. A set of speech may be interacted with as a whole (e.g., an entire sentence or paragraph), or smaller portions thereof, e.g., a phoneme, morpheme, syllable, word or phrase.
While some of the examples described herein are directed towards text-to-speech applications, such as related to speech synthesis and supervised machine learning, e.g., to supervise a speech synthesis system to generate specific prosody as desired by a user, e.g., with emotions, intonations and speaking styles, speech or tones rather than text may be directly input. For example, in computer-assisted language learning, a user may speak and view generated prosody with a user's own voice characteristics; singing voice synthesis can generate a singing voice by using (text or actual) speech data according to a given melody. Further, the technology has application in the study of speech perception, e.g., via perception tests for the research of phonetics and phonology in linguistics and cognitive psychology and perception in psychology, e.g., to examine the discriminative prosody area for the disambiguation of homonyms.
As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in speech and/or sound processing in general.
Turning to
As described below, the speech output 110 may be stored, whether in memory or a data store 112 (as exemplified in
In the training phase, a speech signal (e.g., from a database 226) is converted to a sequence of observed feature vectors through a feature extraction module 228, and modeled by a corresponding sequence of HMMs. Each observed feature vector consists of spectral parameters and excitation parameters, which are separated into different streams. The spectral feature comprises line spectrum pair (LSP) and log gain, and the excitation feature is the log of the fundamental frequency (F0). LSPs are modeled by continuous HMMs and F0s are modeled by multi-space probability distribution HMM (MSD-HMM), which provides a modeling of F0 without any heuristic assumptions or interpolations. Context-dependent phone models are used to capture the phonetic and prosody co-articulation phenomena. State typing based on decision-tree and minimum description length (MDL) criterion is applied to overcome the problem of data sparseness in training. An HMM training mechanism 230 inputs the log F0, LSP and Gain, and decision data 234 to output stream-dependent models 236, which are built to cluster the spectral, prosodic and duration features into separated decision trees.
In the synthesis phase, input text is converted first into a sequence of contextual labels through a text analysis component 240. The corresponding contextual HMMs are retrieved by traversing the decision trees (corresponding to the models 236) and the duration of each state is obtained from a duration model. The LSP, gain and F0 trajectories are generated by using a parameter generation algorithm 242 based on maximum likelihood criterion with dynamic feature and global variance constraints. A speech waveform is synthesized from the generated spectral and excitation parameters by LPC synthesis as generally known and referred to above. This waveform may be used, or stored for prosody manipulation as described herein, e.g., in some memory or storage (e.g., corresponding to the data store 112 of
In
With respect to duration, a user is able to change the duration of phoneme, morpheme, syllable, word, phrase and sentence. For model generated speech, an adjustment factor ρ is first calculated by:
where u(k) and σ2(k) are the mean and variance of the duration density of state k, respectively. T is the duration as modifiable by the user, and may be at any levels of phoneme, morpheme, syllable, word, phrase and sentence. Each state duration d(k) may be adjusted according to ρ as:
d(k)=u(k)+ρ*σ2(k)
For online recorded speech, the state duration is first obtained by forced alignment, with that duration linearly shrunk and/or expanded according to the user's input.
By way of example, a user may change the duration by dragging one of the bars in the area 332 to increase or decrease the duration value of its corresponding part of speech. To vary a full word at the same time, for example, a user may select some or all of the text in the box 332, and drag the last bar of that word, for example, proportionally increasing or decreasing the durations of each of the parts of that word. A syllable may be modified by selecting part of a word, and so forth. The duration of the entire sentence may be increased.
To adjust pitch, the F0 trajectories are modifiable according to the user's input in the generation part of HMM-based speech synthesis. The user's input may comprise the local contour for a voiced region or global schematic curve for intonation. For a local contour, the value of F0 is directly modifiable. For a global schematic curve, the tendency of F0 trajectory is made as approximate as possible with minimum changing local fine structure of F0 contour.
By way of example, a user may change the pitch (of impulses) by interactively varying the waveforms shown in the displayed areas 333-345. The user may move each of the waveforms up and down as a whole, or all of the waveforms together, or a portion of one, e.g., by highlighting or pointing to that portion to move.
Loudness is adjustable by directly modifying the gain trajectories according to the user's input in the generation part of HMM-based speech synthesis. To vary the loudness, a user may interact in the area 338, for example.
Step 406 represents some user interaction taking place, such as to request speech playback, select some of the text, type in or otherwise edit/enter different text, move a duration bar, change the pitch, adjust the loudness, and so forth. If the interaction is such that an action needs to be taken, step 406 continues to step 408. (Note for example that simply selecting text is not shown herein as being such an action, and is represented by the wait/more loop at the right side of step 406.)
Steps 408 and later represent command processing. As can be readily appreciated, these steps need not be in any particular order, and indeed may be event driven rather than part of a loop as shown herein for purposes of simplicity.
Steps 408 and 409 handle the user requesting audio playback of whatever state the current speech is in, whether initially or after any prosody modifications. Note that the playback may be automatic (or user-configurable as to whether it is automatic) whenever the user makes a change to the prosody. For example, a user may make a change, and if the user stops interacting for a short time or moves to a different interaction area, automatically hear the changed speech played back.
Step 410 represents detecting a change to the text. If this occurs, the process returns to step 402 to convert the new text to speech via synthesis. As can be readily appreciated, new or changed speech may be similarly input, with text recognized from the speech.
Moreover, via step 411, the prosody may be automatically changed when appropriate to make a change to text sound more natural in the synthesized speech. For example, in the English language, changing a statement to a question, such as “This is a test.” to “This is a test?” results in a pitch increase on the last word, (and vice-versa). A relative pitch change may be automatically made upon detection of such a text change. Changing to an exclamation point may increase pitch and/or loudness, and/or shorten duration, relative to an original statement or question, for at least part of the sentence. Step 411 is shown as dashed to indicate that such a step is optional (and may branch to step 415, described below), and alternatively may be performed in the conversion step of step 402.
Steps 412-414 represent the user making prosody changes, to duration, pitch or loudness, respectively as described above. The change varies the prosody data (step 405) corresponding to the frequency waveforms or loudness waveform, which is redrawn as represented by step 404. Other steps such as reset to restore the initial data (step 418 and 419), and done (steps 420 and 421, including an option to save changes) are shown. Step 422 represents other action handling, such as to change input modes, for example.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 510 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 510 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 510.
Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 530 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 531 and random access memory (RAM) 532. A basic input/output system 533 (BIOS), containing the basic routines that help to transfer information between elements within computer 510, such as during start-up, is typically stored in ROM 531. RAM 532 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 520. By way of example, and not limitation,
The computer 510 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 510 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 580. The remote computer 580 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 510, although only a memory storage device 581 has been illustrated in
When used in a LAN networking environment, the computer 510 is connected to the LAN 571 through a network interface or adapter 570. When used in a WAN networking environment, the computer 510 typically includes a modem 572 or other means for establishing communications over the WAN 573, such as the Internet. The modem 572, which may be internal or external, may be connected to the system bus 521 via the user input interface 560 or other appropriate mechanism. A wireless networking component 574 such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 510, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 599 (e.g., for auxiliary display of content) may be connected via the user interface 560 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 599 may be connected to the modem 572 and/or network interface 570 to allow communication between these systems while the main processing unit 520 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.