The present disclosure relates to a voice analysis apparatus, a voice synthesis apparatus, and a voice analysis synthesis system.
Speech synthesis methods are classified into a unit-selection speech synthesis method and a statistical parametric speech synthesis method.
While the unit-selection speech synthesis method may synthesize high quality speech, it has limitations, such as excessive database dependency and difficulty in voice characteristics transformation. The statistical parametric speech synthesis method has advantages such as low database dependency, a small database size, and easy voice characteristics transformation, whereas it has a disadvantage, such as low quality of synthesized speech. Based on those characteristics, any one of the above two methods is selectively used for speech synthesis.
As a kind of statistical parametric speech synthesis, the Hidden Markov Model (HMM)-based speech synthesis system has been well known. In the HMM-based speech synthesis system, core factors determining speech quality are representation/reconstruction of a speech signal, training accuracy of sentence database, and smoothing intensity of output parameters generated in a training model.
Meanwhile, as related art speech modeling methods for representation/reconstruction of a speech signal, a Pulse or Noise (PoN) model, and a speech transformation and representation using adaptive interpolation of weighted spectrum (STRAIGHT) model have been proposed. The PoN model is a speech synthesis method using excitation and spectral parts divided. The STRAIGHT model represents speech using three parameters. The three parameters consist of a pitch value F0, spectrum smoothed in a frequency region, and aperiodicity for reconstructing aperiodicity of a signal disappearing in the course of spectral smoothing.
Since the STRAIGHT model use a small number of parameters, it may obtain an effect in that degeneration of reconstructed speech is small. However, the STRAIGHT model has drawbacks such as difficulty in F0 search, an increase in complexity of signal representation due to extraction of aperiodicity spectrum. Thus, a new model for representation/reconstruction of a speech signal is required.
Embodiments provide a speech analysis apparatus, a speech synthesis apparatus, and a speech analysis synthesis system that enable to synthesize speech closer to the original voice.
Embodiments also provide a speech analysis apparatus, a speech synthesis apparatus, and a speech analysis synthesis system that enables to represent speech with less data.
In one embodiment, a speech analysis apparatus includes: an F0 extraction part extracting a pitch value from speech information; a spectrum extraction part extracting spectrum information from the speech information; and an MVF extraction part extracting a maximum voiced frequency and allowing boundary information for respectively filtering a harmonic component and a non-harmonic component to be obtained.
In another embodiment, a speech synthesis apparatus allowing speech to be synthesized after a harmonic component and a non-harmonic component are separately generated, the apparatus includes: a low-pass filter performing a filtering when the harmonic component is generated; and a high-pass filter performing a filtering when the non-harmonic component is generated.
In further another embodiment, a speech analysis synthesis system includes: a speech signal analysis part analyzing a speech signal; a statistical model training part training a parameter analyzed by the speech signal analysis part; a database storing the parameter trained by the statistical model training part; a parameter generating part extracting the parameter corresponding to a specific character from the database when a character is inputted; and a synthesis part synthesizing speech by using the parameter, wherein the parameter comprises a pitch value, spectrum information, and an MVF value which is defined as a boundary frequency value between a section having a relatively large harmonic component and a section having a relatively small harmonic component.
According to the speech analysis apparatus, speech synthesis apparatus, and speech analysis synthesis system of the present invention, speech that is closer to the original voice and is more natural may be synthesized. Also, speech may be represented with less data capacity.
The details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.
Reference will now be made in detail to the embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings.
A speech modeling method according to an embodiment will now be described.
It is known that a speech signal consists of a harmonic component and a non-harmonic component. A speech modeling method according to an embodiment analyzes the harmonic component and the non-harmonic component, respectively, based on such a fact. Equation 1 indicates that an arbitrary given speech signal consists of a harmonic component and a non-harmonic component.
s(n)=sh(n)+snh(n), [Equation 1]
where s(n) is a given speech signal, sh(n) is a harmonic signal, and snh(n) is a non-harmonic signal. A speech representation model according to the embodiment is characterized by separately processing and synthesizing the harmonic signal and the non-harmonic signal. The speech representation model defined in the embodiment may be named a harmonic non-harmonic (HNH) model. In the below description, the speech representation model may be named a harmonic non-harmonic speech model or an HNH model.
Herein, sh(n) is a periodic accumulation of unit speech components fm(n), and may be represented as Equation 2.
where m is an F0 index that is a pitch value, l is an accumulation index, and S is a sampling rate. Also, f(n,m) meaning one frame is different values on time axis for each m, and its length is consistently N. m may be defined as a predetermined range that is represented by one F0 value. In the present embodiment, N is 1024. p(m) indicates a F0 value for each m, in which the F0 value may represent pitch information. In the case of p(m)=0, sh(n) is 0, and thus the corresponding region may be treated as an unvoiced region free of a harmonic component without calculating Equation 2.
In Equation 2, it is noted that the range of l is followed by the condition of Equation 3.
where M is the duration of p(m) in samples, i.e., may be considered to be the duration of the same P(m). In this embodiment, M is set as 80, which is 5 ms in time with a sampling frequency (S) of 16 kHz. For example under the above condition, when p(m) is 200 Hz, since 1 has only the value of 0, f(n,m) is only once added, when p(m) is 201 Hz, since 1 has the values of 0 and 1, one step preceding value on a time axis, and a current value may be added, and when p(m) is 401 Hz, since 1 has the values of 0, 1, and 2, one step preceding or two step preceding value and the current value may be added. This process is necessary for acquirement of an accurate speech signal in relation to the subsequent process of a frequency region.
Meanwhile, in Equation 2, h(n,m) acts as a low pass filter with a specific cut-off frequency, and the cut-off frequency may be defined by v(m) which is a boundary v(m) between harmonic and non-harmonic. In other words, v(m) may mean a boundary value between a section having a sufficiently high harmonic energy and a section not having the sufficiently high harmonic energy.
In Equation 1, a non-harmonic speech signal, snh(n) may be modeled as Equation 4, similarly to the harmonic speech signal.
The non-harmonic speech signal may be also provided based on the harmonic speech signal. In Equation 4, f(n,m) is consisted of different values for each m on the time axis like Equation 2, and its length is consistently N. r(n) is white noise and is Gaussian-distributed random sequence. As represented in a lower side of Equation 4, when p(m) is greater than 0, it becomes 4p(m), otherwise, it becomes 800. Also, hH(n,m) is a high pass filter and may use as the cut-off frequency, v(m) that is the boundary value between the harmonic component and the non-harmonic component.
Also, G is a gain value of the non-harmonic speech signal for similarly controlling a power ratio of the harmonic component and the non-harmonic component to that of input speech.
As described previously, real speech signals contain both harmonic and non-harmonic components in a voiced region. To more fully realize such a characteristic in the speech modeling method according to the embodiment, filter values contained in Equations 2 and 3 may be defined as Equation 5.
where v(m) is a maximum voiced frequency (MVF). Therefore, when analysis is performed in the frequency region, the absolute value of HL(k,m) decreases when k is greater than v(m), and the absolute value of HL(k,m) becomes 1 when k is less than v(m). The absolute value of HH(k,m) is a value obtained by subtracting the absolute value of HL(k,m) from 1.
Equation 5 may be provided in the form of a graph as shown in
According to the above description, when real speech is represented using the HNH model, the speech modeling method according to the embodiment may represent and reconstruct speech by using four parameters.
1. p(m): Pitch Value
First, pitch value p(m) is given as F0 . This value may be obtained by applying the well known robust algorithm for pitch tracking (RAPT) technology. It will be construed that the RAPT technology is included in the description of the present application, and it is natural that p(m) may be found through methods other than the RAPT technology.
2. F(k,m): Spectral Information
Secondly, spectral information F(k,m) may be obtained by FET transformation of f(n,m), and is represented as Equation 6.
where w(n,m) refers to a F0 adaptive window function, and this function may smooth high harmonics to inhibit frequency interference between adjacent spectrums.
3. v(m): Maximum Voiced Frequency (MVF)
Thirdly, maximum voiced frequency (MVF) may be calculated through two steps. A method of calculating the MVF will be described with reference to
Referring to
According to Equation 7, in the frame of the specific time represented as m, v(m) may be obtained. argmax is a function of obtaining a j value which makes the value of the function be highest.
When the value of v(m) is found, HL(n,m) and HH(n,m) may be obtained using Equation 5.
4. G: Gain Value
Fourthly, the gain value may be obtained by respectively obtaining the gain value (Gh) of the harmonic component and the gain value (Gnh) of the non-harmonic component and then obtaining their ratio. Hereinafter, an equation for obtaining the gain value of each of the harmonic component and the non-harmonic component.
In Equation 8, s(n) is an input speech signal, and ŝh and ŝnh are speech signals which are arbitrarily reconstructed by the pseudo-synthesis part (see 24 of
Meanwhile, a large portion of energy of the speech signal is positioned at a low frequency region, i.e., a harmonic region, and in the harmonic speech signal, the reconstructed speech signal almost corresponds to the input speech signal. Unlike this, in the case of the non-harmonic signal, the reconstructed non-harmonic signal is not accurate due to its randomness in the number of times of OLA. Therefore, a final gain value may be represented as a relative ratio (Gnh/Gh) of the gain value of the non-harmonic component over the gain value of the harmonic component. By obtaining the gain values as above, the proportions of the harmonic component and the non-harmonic component may be maintained even without an additional operation.
As suggested in the above description, the HNH model according to the embodiment may analyze and synthesize speech by using the parameters denoted as the pitch value (p(m)), the spectrum information (F(k,m)), the maximum voiced frequency (MVF) (v(m)), and the gain value (G). An apparatus for specifically analyzing and synthesizing speech will be understood with reference to the explanation to be described later.
Referring to
Through the above processes, the pitch value (F0 ), the spectrum information (sp), the maximum voiced information (MVF), and the gain value (G) for the specific speech signal (s(n)) are extracted. Thereafter, a training process is performed by the statistical parametric-based speech synthesis method, such as the Hidden Markov Model (HMM). By the training process, four parameters representing the specific speech signal (s(n)) may be deduced and stored in database. The specific speech signal may be provided as phonemes, syllables, words, or the like.
A speech analysis synthesis system according to an embodiment will be described in more detail with reference to the block diagram of
Referring to
The four parameters may be the pitch value (p(m)), the spectrum information (F(k,m)), the maximum voiced frequency (MVF) (v(m)), and the gain value (G). It may be understood that a detailed configuration of the harmonic non-harmonic analysis part 2 includes the block diagram of
Referring to
In detail, the harmonic non-harmonic parameter synthesis part includes a time region transforming part 51 transforming the spectrum information sp′ in a frequency region to a time region to output frame information (f′(n,m)), and a harmonic boundary filter generating part 52 generating a boundary filter according to Equation 5 by using the maximum voiced frequency (MVF′). The harmonic boundary filter generating part 52 generates a harmonic boundary filter (h′H(n,m)) applied to the synthesis harmonic speech signal, and a non-harmonic boundary filter (h′NH(n,m)) applied to the synthesis non-harmonic speech signal. The pitch value, the boundary filters, the frame information, and the gain value are transmitted to a harmonic component generating part 53 and a non-harmonic component generating part 54 to synthesize a synthesis harmonic speech signal and a synthesis non-harmonic speech signal, respectively. The synthesized harmonic speech signal and non-harmonic speech signal are synthesized in a synthesis part 56 for output.
In detail, the harmonic component generating part 53 may synthesize the harmonic component by using the pitch value, the frame information, the gain value, and the boundary filter provided as a low pass filter. The non-harmonic component generating part 54 may synthesize the non-harmonic component by using the pitch value, the frame information, the gain value, and the boundary filter provided as a high pass filter. The harmonic component generating part 53 and the non-harmonic component generating part 54 may be synthesized by Equations 2 and 4, respectively.
Hereinafter, results are analyzed using the HNH model according to the embodiment, and are analyzed and compared using the synthesized speech signal, the PoN model, and the STRAIGHT model.
<Size Comparison>
First, data sizes used in the modeling method are compared.
Referring to Table 1, it may be see that the harmonic non-harmonic model according to the embodiment has a larger total data size than the PoN model, but has a smaller total data size than the STRAIGHT model. Considering that in the case of the PoN model, the synthesized speech quality is coarse and thus it is difficult to compare the data sizes, it may be seen that the harmonic non-harmonic model decreases in total data size by 3, compared with the STRAIGHT model.
<Quality Evaluation 1>
In the quality evaluation 1, after reference speeches were analyzed and synthesized by the PoN model, the STRAIGHT model, and the HNH model, both objective and subjective speech quality measurements were performed in order to evaluate the quality of the synthesized speech and its similarity to the original speech. Sample data were prepared as follows. Ten samples were used for reference from each of the CMU-ARCTIC-SLT and CMU-ARCTIC-AWB speech database.
First, the subject speech quality evaluation includes a PCM reference speech, and was performed via an MOS (Mean Opinion Scores) listening test using the speeches synthesized by the PoN model/STRAIGHT model/HNH model. Eleven listeners participated in the test. For each sample, scores were recorded on a 1 to 4.5 scale; hidden references also existed in the test set.
The objective evaluation was performed via a PESQ. Here, four sets of 20 samples used in the MOS listening test were reused in the object evaluation. Note that the tests were separately averaged over samples from CMU-ARCTIC-SLT and CMU-ARCTIC-AWB speech database.
Both the MOS and PESQ results are presented in
Referring to
Referring to
<Quality Evaluation 2>
In the quality evaluation 2, the qualities of speeches synthesized from text labels by using the PoN model, the STRAIGHT model, and the HNH model were compared. The HMM-based speech synthesis systems were used for comparison.
The specifications of the systems for the evaluation are as follows.
First, CMU-ARCTIC-SLT and CMU-ARCTIC-AWB speech databases each having 1132 utterances were used as training data. The systems having the STRAIGHT model and the HNH model were made for both the SLT and AWB databases as a speaker-dependent system. Hence, four speech synthesis systems were set for this evaluation 2. Secondly, Speaker-dependent demo scripts for the HMM-based speech synthesis systems (version 2.2) were used in acoustic model training and parameter generation. Thirdly, The global variance option in the scripts was turned off to inhibit unnatural prosody in the synthesized results. Instead, conventional post-filtering using a coefficient was performed on the MFCC parameters generated. Fourthly, Parameter types and their sizes for the HTS systems were identically set as Table 1. Quality comparison was then conducted via a MOS test for the results from the three systems applying the same database for each. In this test, 20 English utterances were converted into a corresponding label sequence. Then, all systems generated the output parameters from the given text labels. Then, speech reconstruction was performed. Note that the same 11 participants as in the quality evaluation 1 participated in the tests.
Referring to
From statements of common participants, the speech synthesized with the HNH model sounded natural and smooth, through slightly less intelligible. In contrast, the speech synthesized with the STRAIGHT model sounded artificial, but more intelligible. Thus, from the test results and participants' perceptions of the synthesized speech, it may be concluded here that naturalness is treated as a more important factor than intelligibility in perceptual measurements of synthesized speech.
The present invention may include another embodiment as well as the above embodiment. For example, the gain value is used for maintaining the ratio of the harmonic and non-harmonic components. However, while the gain value is not applied, it will be possible to maintain the quality above a predetermined level. Therefore, it will be construed that an embodiment in which the gain value is not separately used as a data value is included in the embodiments of the present invention.
According to the present invention, since the harmonic and non-harmonic components are separately synthesized, the synthesized speech sounds more natural. This advantage is further needed for synthesized speech. Also, the present invention is advantageous in that it may represent speech with less data.
Although embodiments have been described with reference to a number of illustrative embodiments thereof, it should be understood that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure. More particularly, various variations and modifications are possible in the component parts and/or arrangements of the subject combination arrangement within the scope of the disclosure, the drawings and the appended claims. In addition to variations and modifications in the component parts and/or arrangements, alternative uses will also be apparent to those skilled in the art.
This application claims the benefit under 35 U.S.C. §119 of U.S. Patent Application No. 61/615,903, filed Mar. 27, 2012, which is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
7562018 | Kamai | Jul 2009 | B2 |
20100049522 | Tamura | Feb 2010 | A1 |
20100217584 | Hirose | Aug 2010 | A1 |
20120053933 | Tamura et al. | Mar 2012 | A1 |
20120123782 | Wilfart et al. | May 2012 | A1 |
Number | Date | Country |
---|---|---|
2012-048154 | Mar 2012 | JP |
10-1997-0012548 | Mar 1997 | KR |
Entry |
---|
Han et al, “Optimum MVF estimation-based two-band excitation for HMM-based speech synthesis”, 2009, In ETRI J., vol. 31, No. 4, pp. 457-459. |
Stylianou, “Modeling speech based on harmonic plus noise models”, 2005, In Nonlinear speech modeling, pp. 244 -260. |
Bjorkan, “Speech Generation and Modification in Concatenative Speech Synthesis”, 2010, Dissertation, Norwegian University of Science and Technology, pp. 1-186. |
Vandromme, “Harmonic Plus Noise Model for Concatenative Speech Synthesis”, 2005, Diploma thesis, IDIAP, 2005, IDIAP-RR 05-37, pp. 1-70. |
Kim et al, “HMM-based Korean speech synthesis system for hand-held devices,” 2006, In Consumer Electronics, IEEE Transactions on , vol. 52, No. 4, pp. 1384-1390. |
Sawicki et al “Design of text to speech synthesis system based on the harmonic and noise model”, 2009, Zeszyty naukowe politechniki Bialostockiej, 2009.—pp. 111-125. |
P{hacek over (r)}ibil et al, “Two Synthesis Methods Based on Cepstral Parameterization”, 2002, In Radioengineering 11(2), pp. 35-39 (2002). |
Office Action dated Jun. 20, 2013 in Korean Application No. 10-2012-0069776. |
Office Action dated Dec. 26, 2013 in Korean Application No. 10-2012-0069776. |
Number | Date | Country | |
---|---|---|---|
20130262098 A1 | Oct 2013 | US |
Number | Date | Country | |
---|---|---|---|
61615903 | Mar 2012 | US |