Modeling speech signals for applications such as automatic speech synthesis, speech coding, automatic speech recognition, and so forth, has been an active field of research. Speech synthesis is the artificial production of human speech. A computing system used for this purpose serves as a speech synthesizer, and may be implemented in a variety of hardware and software embodiments. This may be part of a text-to-speech system, that takes text and converts it into synthesized speech.
One established framework for a variety of applications, such as automatic speech synthesis and automatic speech recognition, is based on pattern models known as hidden Markov models (HMMs), which provide state space models with latent variables describing interconnected states, for modeling data with sequential patterns. Units of a speech signal, such as phones, may be associated with one or more states of the pattern models. Typically, the pattern models incorporate classification parameters that must be trained to correspond accurately to a speech signal. However, it remains a challenge to effectively model speech signals, to achieve goals such as a synthesized speech signal that is easier to understand and more like natural human speech.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
Novel techniques for providing superior performance and sound quality in speech applications, such as speech synthesis, speech coding, and automatic speech recognition, are hereby disclosed. In one illustrative embodiment, a method includes modeling a speech signal with parameters comprising line spectrum pairs. Density parameters are provided based on the density of the line spectrum pairs. A speech application output, such as synthesized speech, is provided based at least in part on the line spectrum pair density parameters. The line spectrum pair density parameters use computing resources efficiently while providing improved performance and sound quality in the speech application output.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
According to one illustrative embodiment, in hidden Markov model (HMM)-based speech synthesis, a speech signal is provided that emulates the sound of natural human speech. The speech signal includes a speech frequency spectrum representing voiced vibrations of a speaker's vocal tract. It includes information content such as a fundamental frequency F0 (representing vocal fold or source information content), duration of various signal portions, patterns of pitch, gain or loudness, voiced/unvoiced distinctions, and potentially any other information content needed to provide a speech signal that emulates natural human speech, although different embodiments are not limited to including any combination of these forms of information content in a speech signal. Any or all of these forms of signal information content can be modeled simultaneously with hidden Markov modeling, or any of various other forms of modeling a speech signal.
A speech signal may be based on waveforms generated from a set of hidden Markov models, based on a universal, maximum likelihood function. HMM-based speech synthesis using line spectrum pair density parameters may be statistics-based and vocoded, and generate a smooth, natural-sounding speech signal. Characteristics of the synthetic speech can easily be controlled by transforming HMM modeling parameters, which may be done with a statistically tractable metric such as a likelihood function. HMM-based speech synthesis using line spectrum pair density parameters combines high clarity in the speech signal with efficient usage of computing resources, such as RAM, processing time, and bandwidth, and is therefore well-suited for a variety of implementations, including those for mobile and small devices.
In the training phase 101, a speech signal 113 from a speech database 111 is converted to a sequence of observed feature vectors through the feature extraction module 117, and modeled by a corresponding sequence of HMMs in HMM training module 123. The observed feature vectors may consist of spectral parameters and excitation parameters, which are separated into different streams. The spectral features 121 may comprise line spectrum pair (LSP) and log gain, and the excitation feature 119 may comprise a logarithm of fundamental frequency F0. LSPs may be modeled by continuous HMMs and fundamental frequencies F0 may be modeled by multi-space probability distribution HMM (MSD-HMM), which provides a cogent modeling of F0 without any heuristic assumptions or interpolations. Context-dependent phone models may be used to capture the phonetic and prosody co-articulation phenomena 125. State typing based on decision-tree and minimum description length (MDL) criterion may be applied to overcome any problem of data sparseness in training. Stream-dependent HMM models 127 may be built to cluster the spectral, prosodic and duration features into separated decision trees.
In the synthesis phase, input text 131 may be converted first into a sequence of contextual labels through the text analysis module 133. The corresponding contextual HMMs 129 may be retrieved by traversing the trees of spectral and pitch information and the duration of each state is also obtained by traversing the duration tree, then the LSP, gain and F0 trajectories 141, 139 may be generated by using the parameter generation algorithm 137 based on maximum likelihood criterion with dynamic feature and global variance constraints. The fundamental frequency F0 trajectory and corresponding statistical voiced/unvoiced information can be used to generate mixed excitation parameters 145 with the generation module 143. Finally, speech waveform 150 may be synthesized from the generated spectral and excitation parameters 141, 145 by LPC synthesis module 147.
A broad variety of different options can be used to implement automatic speech synthesis system 100. Illustrative examples of some of the various implementing options are provided hereby as examples, while these are understood not to imply limitation from other embodiments. For example, a speech corpus may be recorded by a single speaker to provide speech database 111, with training data composed of a relatively large number of phonetically and prosodically rich sentences. A smaller number of sentences may be used for testing data. A speech signal may be sampled at any of a variety of selected rates; in one illustrative embodiment it may be sampled at 16 kilohertz, windowed by a 25 millisecond window with a five millisecond shift, although higher and lower sampling frequencies, and window and shift times, or other timing parameters than this may also be used. These may be transformed into any of a broad range of LSPs counts; in one illustrative embodiment, 24th-order LSPs may be used, or 40th-order in another, or other numbers of LSPs above and below these values. For example, the order of LSPs, the speech sample frame sizes, and other gradations of the speech signal data and modeling parameters may be suited to the level of resources, such as RAM, processing speed, and bandwidth, that are to be available in a computing device used to implement the automatic speech synthesis system 100. An implementation with a server in contact with a client machine may use more intensive and higher performance options, such as a shorter signal frame length and higher LSP order, while the opposite may be true for a base-level mobile device or cellphone handset, in various illustrative embodiments.
Line spectrum pairs 321 provide a good, simple, salient indication of what portions of the frequency spectrum of the speech signal 311 correspond to the formants 331, 333, 335, 337. The formants, or dominant frequencies in a speech signal, are significantly more important to sound quality than the troughs in between them. This is particularly true of the lowest-frequency and highest-power formant, 331. The formants occupy portions of the frequency spectrum that have significantly higher power than their surrounding frequencies, and are therefore indicated as the peaks in the graphical curve representing the power as a function of frequency for the speech signal 311.
Because the line spectrum pairs 321 tend to cluster around the formants 331, 333, 335, 337, the positions of the line spectrum pairs 321 serve as effective and efficient indicators of the positions (in terms of portions of the frequency spectrum) of the formants. Furthermore, the density of the line spectrum pairs 321, in terms of the differences in their positions, with smaller spacing differences coinciding with higher densities, also provides perhaps an even more effective and efficient indicator of the frequencies and properties of the formants. By modeling the speech signal at least in part with parameters based on the density of the line spectrum pair frequencies, an automatic speech synthesis system, such as automatic speech synthesis system 100 of
Some of the advantages of using parameters based on line spectrum pairs and line spectrum pair density, are provided in further detail as follows in accordance with one illustrative embodiment, by way of example and not by limitation. Line spectrum pairs provide equivalent information as linear predictive coefficients (LPC), but with certain advantages that lend themselves well to interpolation, quantization, search techniques, and speech applications in particular. Line spectrum pairs can provide a more convenient parameterization of linear predictive coefficients by providing symmetric and antisymmetric polynomials that sum up to an arbitrary polynomial in the denominator of linear predictive coefficients.
In analyzing line spectrum pairs, a speech signal may be modeled as the output of an all-pole filter H(z) defined as:
where M is the order of linear predictive coefficient (LPC) analysis and {αi}i=1M are the corresponding LPC coefficients. The LPC coefficients can be represented by the LSP parameters, which are mathematically equivalent (one-to-one) and more amenable to quantization. The LSP parameters may be calculated with reference to the symmetric polynomial P(z) and antisymmetric polynomial Q(z) as follows:
P(z)=A(z)+z−(M+1)A(z−1) (EQ. 2)
Q(z)=A(z)−z−(M+1)A(z−1) (EQ. 3)
The symmetric polynomial P(z) and anti-symmetric polynomial Q(z) have the following properties: all zeros of P(z) and Q(z) are on the unit circle, and the zeros of P(z) and Q(z) are interlaced with each other around the unit circle. These properties are useful for finding the LSPs {ωi}i=1M, i.e., the roots the polynomial P(z) and Q(z), which are ordered and bounded:
0<ω1<ω2< . . . <ωM<π (EQ. 4)
LSP-based parameters have many advantages for speech representation. For example, LSP parameters correlate well to formant or spectral peak location and bandwidth. Referring again to
As another advantage of LSP-based parameters, perturbation of an LSP parameter has a localized effect. That is, a perturbation in a given LSP frequency introduces a perturbation of LPC power spectrum mainly in the neighborhood of the perturbed LSP frequency, and does not significantly disturb the rest of the spectrum. As a further advantage, LSP-based parameters have good interpolation properties.
In the automatic speech synthesis system 100 depicted in
O=[C,ΔC,Δ2C]T, C=[c1T,c2T, . . . , cTT]T,
ΔC=[Δc1T,Δc2T, . . . , ΔcTT]T, Δ2C=[Δ2c1T, Δ2c2T, . . . , Δ2cTT]T
which maximizes the probability for a speech parameter vector sequence O given the HMM λ, over a summation of state sequences Q:
If given state sequence Q={q1,q2,q3, . . . , qT}, Eq. 5 only need consider maximizing the logarithm of the probability for a speech parameter vector sequence O given the state sequence Q and the HMM λ, P(O|Q,λ), with respect to speech parameter vector sequence O as a weighting matrix W applied to speech parameter C, O=WC, i.e.,
From this, we may obtain:
where D is the dimension of feature vector and T is the total number of frame in the sentence. W is a block matrix which is composed of three DT×DT matrices: Identity matrix (IF), delta coefficient matrix (WΔF), and delta-delta coefficient matrix (WΔΔF). M and U are the 3DT×1 mean vector and the 3DT×3DT covariance matrix, respectively.
As mentioned above, a gathering of (for example, two or three) LSPs depicts a formant frequency and the closeness of the corresponding LSPs indicates the magnitude and bandwidth of a given formant. Therefore, the differences between the adjacent LSPs, in terms of the density of the line spectrum pairs, provides advantages beyond the absolute values of the individual LSPs. On the other hand, all LSP frequencies are ordered and bounded, i.e. any two adjacent LSP trajectories do not cross each other. Using static and dynamic LSPs alone, in modeling and generation, may have difficulty ensuring the stability of LSPs. On the other hand, this may be resolved by providing line spectrum pair density parameters, such as by adding the differences of adjacent LSP frequencies directly into spectral parameter modeling and generation. The weighting matrix W, which is used to transform the observation feature vector, may be modified to provide line spectrum pair density parameters, as:
W=[IF, WDF, WΔF, WΔFWDF, WΔΔF, WΔΔFWDF] (EQ. 11)
where F is static LSP; DF is the difference between adjacent LSP frequencies; ΔF and ΔΔF are dynamic LSPs, i.e., first and second order time derivatives; and WDF is (D−1)T×DT matrix and constructed as:
In this way, diagonal covariance structure is kept the same, while the correlation and differences in frequency of adjacent LSPs can be modeled, and used to provide line spectrum pair density parameters, based on a measure of density of at least two or more of the line spectrum pairs. These line spectrum pair density parameters can then be used to provide speech application outputs, such as synthesized speech, with previously unavailable efficiency and sound clarity.
Computing system environment 400 as depicted in
Embodiments are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with various embodiments include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
Embodiments may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Some embodiments are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices. As described herein, such executable instructions may be stored on a medium such that they are capable of being read and executed by one or more components of a computing system, thereby configuring the computing system with new capabilities.
With reference to
Computer 410 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 410 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 410. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 430 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 431 and random access memory (RAM) 432. A basic input/output system 433 (BIOS), containing the basic routines that help to transfer information between elements within computer 410, such as during start-up, is typically stored in ROM 431. RAM 432 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 420. By way of example, and not limitation,
The computer 410 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 410 through input devices such as a keyboard 462, a microphone 463, and a pointing device 461, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 420 through a user input interface 460 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 491 or other type of display device is also connected to the system bus 421 via an interface, such as a video interface 490. In addition to the monitor, computers may also include other peripheral output devices such as speakers 497 and printer 496, which may be connected through an output peripheral interface 495.
The computer 410 is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 480. The remote computer 480 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 410. The logical connections depicted in
When used in a LAN networking environment, the computer 410 is connected to the LAN 471 through a network interface or adapter 470. When used in a WAN networking environment, the computer 410 typically includes a modem 472 or other means for establishing communications over the WAN 473, such as the Internet. The modem 472, which may be internal or external, may be connected to the system bus 421 via the user input interface 460, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 410, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Memory 504 is implemented as non-volatile electronic memory such as random access memory (RAM) with a battery back-up module (not shown) such that information stored in memory 504 is not lost when the general power to mobile device 500 is shut down. A portion of memory 504 is illustratively allocated as addressable memory for program execution, while another portion of memory 504 is illustratively used for storage, such as to simulate storage on a disk drive.
Memory 504 includes an operating system 512, application programs 514 as well as an object store 516. During operation, operating system 512 is illustratively executed by processor 502 from memory 504. Operating system 512, in one illustrative embodiment, is a WINDOWS® CE brand operating system commercially available from Microsoft Corporation. Operating system 512 is illustratively designed for mobile devices, and implements database features that can be utilized by applications 514 through a set of exposed application programming interfaces and methods. The objects in object store 516 are maintained by applications 514 and operating system 512, at least partially in response to calls to the exposed application programming interfaces and methods.
Communication interface 508 represents numerous devices and technologies that allow mobile device 500 to send and receive information. The devices include wired and wireless modems, satellite receivers and broadcast tuners to name a few. Mobile device 500 can also be directly connected to a computer to exchange data therewith. In such cases, communication interface 508 can be an infrared transceiver or a serial or parallel communication connection, all of which are capable of transmitting streaming information.
Input/output components 506 include a variety of input devices such as a touch-sensitive screen, buttons, rollers, and a microphone as well as a variety of output devices including an audio generator, a vibrating device, and a display. The devices listed above are by way of example and need not all be present on mobile device 500. In addition, other input/output devices may be attached to or found with mobile device 500.
Mobile computing system 500 also includes network 520. Mobile computing device 501 is illustratively in wireless communication with network 520—which may be the Internet, a wide area network, or a local area network, for example—by sending and receiving electromagnetic signals 599 of a suitable protocol between communication interface 508 and wireless interface 522. Wireless interface 522 may be a wireless hub or cellular antenna, for example, or any other signal interface. Wireless interface 522 in turn provides access via network 520 to a wide array of additional computing resources, illustratively represented by computing resources 524 and 526. Naturally, any number of computing devices in any locations may be in communicative connection with network 520. Computing device 501 is enabled to make use of executable instructions stored on the media of memory component 504, such as executable instructions that enable computing device 501 to implement various functions of using line spectrum pair density modeling for automatic speech applications, in an illustrative embodiment.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. As a particular example, while the terms “computer”, “computing device”, or “computing system” may herein sometimes be used alone for convenience, it is well understood that each of these could refer to any computing device, computing system, computing environment, mobile device, or other information processing component or context, and is not limited to any individual interpretation. As another particular example, while many embodiments are presented with illustrative elements that are widely familiar at the time of filing the patent application, it is envisioned that many new innovations in computing technology will affect elements of different embodiments, in such aspects as user interfaces, user input methods, computing environments, and computing methods, and that the elements defined by the claims may be embodied according to these and other innovative advances while still remaining consistent with and encompassed by the elements defined by the claims herein.