A hybrid approach being explored recently in Text-to-Speech Synthesis (TTS) includes concatenating natural speech segments and artificial segments generated from a statistical model. Herein, this approach is referred to as Multi-Form Segment (MFS) synthesis, the natural segments are referred to as template segments or templates, and the artificial segments generated from statistical models are referred to as model segments. A voice dataset of an MFS TTS system contains a templates database and a set of statistical models typically represented by states of Hidden Markov Models (HMM). Each statistical model corresponds to a distinct context-dependent phonetic element. A many-to-one mapping exists that establishes an association between the templates and the statistical models. In synthesis time, input text is converted to a sequence of the context-dependent phonetic elements. Then, each element can be represented by either a template or a model segment generated from the corresponding statistical model.
The motivation behind the MFS approach is to combine the advantages of unit selection or the concatenative TTS paradigm, which operates purely on template segments, and the statistical TTS paradigm to build a flexible system that produces natural sounding speech with stable quality for a wide range of system footprints. However, if the voice character differs significantly between the concatenated template and model segments, the switching between the template and model segments deteriorates human perception. The perceptual quality of the MFS synthesis output strongly depends on the representation type (template versus model) selected for each segment comprising the synthesized sentence. If the representation type decision is made off-line prior to synthesis for all of the segments available within the voice dataset then the templates database can be pruned, resulting in system footprint reduction as model segments can be stored more compactly compared to template segments.
In another context, there is the problem of how to select a speaker for building a statistical TTS system. Voice dataset preparation for a statistical TTS model training is an intensive human labor and time consuming process. It typically includes the recording of several hours (e.g., 5-10 hours) of speech in a studio environment that is done in several sessions, and several person-weeks are required afterwards for manual error correction in speech transcripts and in phonetic alignment. Characteristics of the recorded voice significantly influence the final quality of the generated speech. The models produced from one speaker perform better than those built from another, while the gender, recording conditions, and the build process are very similar.
An embodiment according to the invention provides a capability of automatically predicting how favorable a given speech signal is for statistical modeling, which is advantageous in a variety of different contexts. In Multi-Form Segment (MFS) synthesis, for example, an embodiment according to the invention uses this capability to provide an automatic acoustic driven template versus model decision maker with an output quality that is high, stable, and depends gradually on the system footprint. In speaker selection for a statistical Text-to-Speech synthesis (TTS) system build, as another example context, an embodiment according to the invention enables a fast selection of the most appropriate speaker among several available ones for the full voice dataset recording and preparation, based on a small amount of recorded speech material. An embodiment according to the invention may be used in other contexts in which it is advantageous to determine suitability of a speech signal for statistical modeling automatically.
In accordance with an embodiment of the invention, there is provided a system (or corresponding method) for automatically determining suitability of at least a portion of a speech signal for statistical modeling. The system comprises a modelability estimator configured to determine a statistical modelability score of the at least a portion of the speech signal, the determining of the statistical modelability score being based at least in part on determining a temporal stationarity of the at least a portion of the speech signal; and a decision maker configured to determine suitability of the at least a portion of the speech signal for statistical modeling based at least in part on the statistical modelability score. As used herein, a “temporal stationarity” of a signal is a measure of the extent to which an instantaneous characteristic of the signal varies with respect to time.
In further, related embodiments, the modelability estimator may be further configured to determine the temporal stationarity based on variability of an instantaneous spectrum of the at least portion of the speech signal. The modelability estimator may be still further configured to determine the variability of the instantaneous spectrum based on (i) a first moment of an instantaneous spectrum component distribution and (ii) a second moment of the instantaneous spectrum component distribution.
In further, related embodiments, the decision maker may be further configured to determine a segment representation type in a multi-form segment speech synthesis based on the statistical modelability score. The modelability estimator may be further configured to determine the statistical modelability score for at least one segment comprising at least a portion of an output speech signal being synthesized, and the decision maker may be further configured to determine the segment representation type, for the at least one segment, based on at least the statistical modelability score for the at least one segment. The modelability estimator may be further configured to determine, for at least one segment comprising at least a portion of an output speech signal being synthesized, the statistical modelability score for a segment cluster that includes the at least one segment, and the decision maker may be further configured to determine the segment representation type, for the at least one segment, based on at least the statistical modelability score of the segment cluster that includes the at least one segment. The system may further comprise a templates pruner configured to remove from a voice dataset at least one segment relative to its statistical modelability score. The statistical modelability score may be further based at least in part on a loudness score.
In another related embodiment, the decision maker may be further configured to determine a preferred speaker selection for building a statistical text-to-speech system based on the statistical modelability score determined for speech provided by each of a plurality of speakers.
The foregoing will be apparent from the following more particular description of example embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments of the present invention.
A description of example embodiments of the invention follows.
Open questions in Multi-Form Segment (MFS) synthesis are whether devising an automatic acoustic driven template versus model decision maker is possible so that the output quality is highly natural, homogeneous and depends gradually on the system footprint, and, if possible, how to devise such a decision maker.
In another context, i.e., the context of selecting a speaker for building a statistical TTS system, it would be useful to have a method for the final statistical TTS quality prediction based on a small amount of recorded speech material provided by a candidate speaker. Such a method would enable a fast selection of the most appropriate speaker among several available ones for the full voice dataset recording and preparation.
At first glance, the two above mentioned problems seem different from each other. However, the solutions to both problems require the same capability: an automatic acoustic properties based prediction of how favorable a given speech signal is for statistical modeling in terms of human perception.
Embodiments according to the present invention provide:
1) A method of estimating statistical modelability of a given speech segment. As used herein, a “statistical modelability” or simply “modelability” is favorability of a given speech segment for statistical modeling, or, in other words, how accurately the speech segment can be represented by a statistical model trained on similar segments from a human perception viewpoint. The method is based on temporal stationarity estimation of the speech segment. As used herein, a “temporal stationarity” of a signal is a measure of the extent to which an instantaneous characteristic of the signal varies with respect to time.
2) A method of determining a temporal stationarity score for a given speech segment using an instantaneous spectrum in the form of a Short-Time Fourier Transform transformed to a perceptual scale as the instantaneous characteristic above. For example, the instantaneous characteristic may be the first and second moments of the segment. The score is indicative of the segment modelability.
3) A method for speech segment representation type selection in multi-form speech synthesis. As used herein, a “segment” is a contiguous portion of a speech signal representing a basic context-dependent phonetic element, e.g., one third of a phoneme, which may for example be used by a target MFS system.
In one embodiment, a statistical modelability score combining the stationarity and loudness is computed and stored for each template segment available in the templates database. The scores can be used in synthesis time for dynamic selection of the representation type (model versus template).
In another embodiment, aiming at system footprint reduction, the method operates on a cluster of segments derived from a plurality of speech signals rather than on an individual segment. A cluster is associated with a distinct statistical model of the MFS system. Typically the model is given in the form of a Hidden Markov Model (HMM) state. The clustering procedure is commonly implemented using a contextual decision tree built for spectral parameters stream. The clusters are associated with the leaves of the tree and are referred to herein as “acoustic leaves” or simply “leaves.” Depending on the target footprint, each leaf is classified as template or model based on a statistical modelability score combining the stationarity and loudness statistics of the comprising segments. The template versus model classification above may be based on the statistical modelability score combined with phonological information. The natural segments associated with those leaves classified as model are removed from the voice dataset which leads to the footprint reduction of the final MFS system. The template versus model representation selection is taken depending whether the leaf contains templates or not.
4) A method for a preferred speaker selection for a statistical TTS system building. The selection process employs a small number of sentences (e.g., less than 100) from each candidate speaker. The speech data is segmented through an HMM-state level alignment process using an existing statistical acoustic model. The segmental stationarity statistics are compared between the candidate speakers. The speaker with the most stationary speech is selected.
For example, the modelability scores may be input to a segment representation type decision maker 110 for Multi-Form Speech (MFS) synthesis. In this case, the collection of segments 101 is the templates database. Such segments are typically provided by a single speaker. Depending on an embodiment as described below, the labels 102 may be provided in the form of acoustic leaf identifiers available in the MFS voice dataset. The modelability estimator 103, in accordance with one embodiment, comprises the following blocks: 1) a segmental stationarity estimator 104: 2) a segmental augmenting info extractor 105 configured to estimate loudness of a speech segment; 3) a normalizer 107 configured to map stationarity and loudness scores to interval [0,1]; 4) a mixer 108 configured, for example, to calculate a linear combination of stationarity and loudness scores. Depending on the embodiment, as described below, if the labels 102 are provided the modelability estimator 103 may further comprise a statistical analyzer 106 configured to calculate a percentile of segmental stationarity information and segmental augmenting information within clusters.
In another example, the modelability scores may be input to a preferred speaker decision maker for statistical TTS 111. In this case, the input segments 101 are derived from speech signals provided by two or more candidate speakers and associated textual transcripts. The segments are preferably derived in the way it would be done during a TTS voice building. One of the known in the art techniques can be employed to segment the transcribed speech signals into segments using a grapheme-to-phoneme converter and pre-existing statistical acoustic models. The segments are labeled by respective candidate speaker identity. The modelability estimator 103, in accordance with one embodiment, comprises a segmental stationarity estimator 104 only.
In accordance with an embodiment of the invention, a segmental temporal stationarity score may be used as at least a part of the basis for an objective measure of the statistical modelability of a speech signal.
In accordance with an embodiment of the invention, an analyzed speech segment is divided into overlapping frames at a high frame rate, e.g., 1000 Hz. The frame length is chosen to be as small as possible providing that the frame includes at least one pitch cycle when the segment contains a portion of voiced speech. The frame size may be kept constant or be made variable adaptively to the pitch information associated with the analyzed segment. Typically the segment contains tens of frames. Each frame is converted to a Perceptual Loudness Spectrum (PLS) known in the art. A similar conversion is utilized in the popular Perceptual Linear-Predictive Acoustic Speech Recognition (ASR) analysis front-end described for example in Hermansky, H., “Perceptual linear-predictive analysis of speech”, The Journal of Acoustical Society of America, 1990, the entire teachings of which are hereby incorporated by reference. The conversion comprises the following steps: 1) time windowing followed by the Fourier transform; 2) calculating power spectrum; 3) filtering the power spectrum by a filter bank specified on the Bark frequency scale and accommodating the known psychoacoustic masking phenomena; 4) raising the components of the filter-bank output to the order of 0.3. The resulting PLS is a vector (e.g., of order 23 for 22 kHz speech) whose components are proportional to perceptual loudness levels associated with respective critical frequency bands.
Let
where N is the number of frequency bands; and
Let M1k and M2k be respectively empirical first and second moments of the k-th component of the PLS vector distribution within the segment:
In accordance with an embodiment of the invention, the segment non-stationarity measure R can be defined as integral relative variance of the PLS vector components:
In accordance with an embodiment of the invention, the temporal stationarity score of the segment is defined as:
which yields
The stationarity score of Equation (4) has the range [0,1]. It receives the value of 1 for an ideally stationary segment with invariant Perceptual Loudness Spectrum. The score receives the value 0 for an extremely non-stationary (singular) segment that has δ-like temporal loudness distribution, e.g., only one non-silent frame. To give an intuitive insight of the matter, a stationary segment has: a) a slowly evolving spectral envelope; and b) an excitation represented by a mix of quasi periodic and random stationary components. Typically, a segment representing a stable part of a vowel or a fricative sound has a high stationarity score. Transient sounds and plosive onsets have a low stationarity score. Other techniques of determining stationarity than that given in Equation (4) may be used.
In accordance with embodiments of the invention, such segment stationarity scores may be used for determining a selection of segment representation type (template versus model) in multi-form speech synthesis. Specifically, the more stationary the segment is the more favorable it is for being replaced by a model-based representation. Applicants have found that the representation type selection based on a combination of the stationarity and loudness performs better than the one based on the stationarity only. Without being bound by theory, this can be explained by the fact that the most stationary segments typically represent the louder parts of vowels. Hence the template-model joints and the modeled character of voice can become audible. To include this sensitivity into the modelability score, the stationarity score may be augmented with a loudness score as defined above.
In embodiments according to the invention, the temporal stationarity score is determined for each segment available in the templates database. Additionally, a loudness score may be determined for each segment as:
In accordance with a first, “dynamic,” embodiment of the method, the stationarity scores and loudness scores are normalized over the voice dataset as described below. Let Sj and Lj be respectively the stationarity and loudness scores of segment j and J be the number of segments in the templates database. The normalized scores NSj and NLj are calculated as:
Further, in accordance with the first embodiment of the method, the segmental modelability score (SMOD) may be defined as:
SMOD
j=0.5·(NSj+1−NL) (7)
Such a segmental modelability score, defined within the range [0,1], receives a higher value as the segment is more stationary and less loud. Other techniques of determining such a segmental modelability score may be used; for instance, a non-linear combination of NSj and NLj may be used.
In accordance with the first embodiment of the method, the segmental modelability scores are stored and used in synthesis time for segment representation type selection in an MFS synthesis system. For example, as an addition to be used within the context of the framework described in V. Pollet, A. Breen, “Synthesis by generation and concatenation of multiform segments”, in Proc Interspeech 2008 (the entire teachings of which are hereby incorporated herein by reference), the segmental modelability scores determined in accordance with an embodiment of the present invention can serve as the channel cues employed in a combination with phonologic cues for segment representation type selection. As another example, as an addition to be used within the framework described in U.S. Patent Application Publication No. US 2009/0048841 A1 of Pollet et al. (the entire teachings of which are hereby incorporated herein by reference), the segmental modelability scores determined in accordance with an embodiment of the present invention can be used to augment the information used by the model-template sequencer. As another example, as an addition to be used in the system described in S. Tiomkin et al., “A hybrid text-to-speech system that combines concatenative and statistical synthesis units,” IEEE Trans on Audio, Speech and Language Processing, v 19, no 5, July 2011 (the entire teachings of which are hereby incorporated herein by reference), the segmental modelability scores determined in accordance with an embodiment of the present invention may be incorporated in the natural mode versus statistical mode decision.
In accordance with a second, “static,” embodiment of the method, for each acoustic leaf cluster, the empirical distribution of the segmental stationarity score and segmental loudness score may be analyzed. A leaf stationarity measure (LSM) and leaf loudness measure (LLM) may be derived as certain percentiles of the respective empirical distribution within the leaf cluster. Typically the LSM and LLM are set close respectively to the lower and upper bound of the respective segmental score distribution within the leaf cluster. For example: the leaf is assigned LSM=S if 90% of the segments comprising it have the stationarity score values above S; and the leaf is assigned LLM=L if 90% of the segments comprising it have the loudness score values below L.
In accordance with an embodiment of the invention, the leaf stationarity and loudness measures defined above may be normalized over the voice as follows. Let LSi and LLi be the LSM and LLM of the leaf i respectively, and I be the number of the acoustic leaves in the system. The normalized values NLSi and NLLi are calculated as:
Further, in accordance with an embodiment of the invention, the leaf modelability score (LMOD) may be defined as:
LMOD
i=0.5·(NLSi+1−NLLi) (9)
Such a leaf modelability score is defined within the range [0,1]. Other techniques of determining such a leaf modelability score may be used; for instance, a non-linear combination of NLSi and NLLi may be used.
In accordance with an embodiment of the invention, all of the acoustic leaves may be ordered by their modelability score values. A target footprint reduction percentage P % is achieved by marking the required number of the most modelable acoustic leaves and removing all the template segments that are associated with them from the templates database. The number of the leaves to be marked is calculated such that the durations of the template segments associated with those leaves are summed up to approximately P % of the total duration of all of the template segments in the original templates database. The reduced voice dataset is used for the synthesis. At synthesis time, segments associated with the marked (free of templates) leaves are generated from the respective statistical parametric models while other leaves are represented by templates.
In accordance with an embodiment of the invention, generation of model segments is carried out in a way that reduces discontinuities at template-model joints using known in the art techniques, for example the boundary constrained model generation described in S. Tiomkin et al., “A hybrid text-to-speech system that combines concatenative and statistical synthesis units”, IEEE Trans on Audio, Speech and Language Processing, v 19, no 5, July 2011, the entire teachings of which are hereby incorporated herein by reference.
The method disclosed above, in accordance with an embodiment of the invention, produces high quality speech within a wide range of footprints.
In accordance with an embodiment of the invention, the segment or leaf representation type decision may also be based on other contributing factors, such as phonologic cues. The final decision may be based on both phonologic and signal-based cues. Alternatively, it is also possible to use only the modelability scores described above as the basis for the segment or leaf representation type decision. This may be useful where, for example, there is little or no phonologic knowledge available (for example, with a new language).
It will be appreciated that a combination of the first (dynamic) and second (static) embodiments described above of the method can be devised. In such a combined embodiment the modelability estimator is configured to provide both segmental and leaf modelability scores. The MFS voice dataset is pruned by removing entire leaf clusters and individual segments based on the leaf modelability scores and segmental modelability scores respectively. In synthesis time, the segments associated with the “empty” leaves are generated from statistical models while a dynamic selection of representation type is applied to the other segments.
In accordance with another embodiment of the invention, the segment stationarity scores described above may be used for determining a preferred speaker selection for a statistical TTS system build. A relatively small number (e.g., 50) of sentences read out by each candidate speaker is recorded. The following process is applied to the recording set associated with each candidate speaker. An HMM-state level alignment and segmentation is applied to each speech signal using a pre-existing acoustic model. The temporal stationarity score of Equation (4) is calculated for each segment. The empirical distribution of the segmental stationarity scores is analyzed and a speaker voice modelability score is derived, e.g., as the empirical mean or median value. The modelability scores associated with the speakers are compared to each other and the speaker having the highest one is selected.
An embodiment according to the invention may be used in other contexts in which it is advantageous to automatically determine suitability of a speech signal for statistical modeling.
A system in accordance with the invention has been described in which there is determined the suitability of at least portion of a speech signal for statistical modeling. Components of such a system, for example a modelability estimator, decision maker, templates pruner and other systems discussed herein may, for example, be a portion of program code, operating on a computer processor.
Portions of the above-described embodiments of the present invention can be implemented using one or more computer systems, for example to permit determine suitability of at least a portion of a speech signal for statistical modeling. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be stored on any form of non-transient computer-readable medium and loaded and executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers.
Further, it should be appreciated that a computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer. Additionally, a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smart phone or any other suitable portable or fixed electronic device.
Also, a computer may have one or more input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets. As another example, a computer may receive input information through speech recognition or in other audible format.
Such computers may be interconnected by one or more networks in any suitable form, including as a local area network or a wide area network, such as an enterprise network or the Internet. Such networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.
Also, the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
In this respect, at least a portion of the invention may be embodied as a computer readable medium (or multiple computer readable media) (e.g., a computer memory, one or more floppy discs, compact discs, optical discs, magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement the various embodiments of the invention discussed above. The computer readable medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various aspects of the present invention as discussed above.
In this respect, it should be appreciated that one implementation of the above-described embodiments comprises at least one computer-readable medium encoded with a computer program (e.g., a plurality of instructions), which, when executed on a processor, performs some or all of the above-discussed functions of these embodiments. As used herein, the term “computer-readable medium” encompasses only a non-transient computer-readable medium that can be considered to be a machine or a manufacture (i.e., article of manufacture). A computer-readable medium may be, for example, a tangible medium on which computer-readable information may be encoded or stored, a storage medium on which computer-readable information may be encoded or stored, and/or a non-transitory medium on which computer-readable information may be encoded or stored. Other non-exhaustive examples of computer-readable media include a computer memory (e.g., a ROM, a RAM, a flash memory, or other type of computer memory), a magnetic disc or tape, an optical disc, and/or other types of computer-readable media that can be considered to be a machine or a manufacture.
The terms “program” or “software” are used herein in a generic sense to refer to any type of computer code or set of computer-executable instructions that can be employed to program a computer or other processor to implement various aspects of the present invention as discussed above. Additionally, it should be appreciated that according to one aspect of this embodiment, one or more computer programs that when executed perform methods of the present invention need not reside on a single computer or processor, but may be distributed in a modular fashion amongst a number of different computers or processors to implement various aspects of the present invention.
Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Typically the functionality of the program modules may be combined or distributed as desired in various embodiments.
The teachings of all patents, published applications and references cited herein are incorporated by reference in their entirety.
While this invention has been particularly shown and described with references to example embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.