1. Technical Field of the Invention
The present invention relates to a technology for interconnecting a plurality of phonetic pieces to synthesize a voice, such as a speech voice or a singing voice.
2. Description of the Related Art
In a voice synthesis technology of phonetic piece connection type for interconnecting a plurality of phonetic pieces to synthesize a desired voice, it is necessary to expand and contract a phonetic piece to a target time length. Japanese Patent Application Publication No. H7-129193 discloses a construction in which a plurality of kinds of phonetic pieces is classified into a stable part and a transition part, and the time length of each phonetic piece is separately adjusted in the normal part and the transition part. For example, the normal part is more greatly expanded and contracted than the transition part.
In a technology of Japanese Patent Application Publication No. H7-129193, the time length is adjusted at a fixed expansion and contraction rate within a range of a phonetic piece classified into the normal part or the transition part. In real pronunciation, however, a degree of expansion may be changed on a section to section basis even within a range of a phonetic piece (phoneme). In the technology of Japanese Patent Application Publication No. H7-129193, therefore, an aurally unnatural voice (that is, a voice different from a really pronounced sound) may be synthesized in a case in which a phonetic piece is expanded.
The present invention has been made in view of the above problems, and it is an object of the present invention to synthesize an aurally natural voice even in a case in which a phonetic piece is expanded.
Means adopted by the present invention so as to solve the above problems will be described. Meanwhile, in the following description, elements of embodiments, which will be described below, corresponding to those of the present invention are shown in parentheses for easy understanding of the present invention; however, the scope of the present invention is not limited to illustration of the embodiments.
A voice synthesis apparatus according to a first aspect of the present invention is designed for synthesizing a voice signal using a plurality of phonetic piece data each indicating a phonetic piece which contains at least two phoneme sections (for example, a phoneme section S1 and a phoneme section S2) corresponding to different phonemes. The apparatus comprises; a phonetic piece adjustment part (for example, a phonetic piece adjustment part 26) that forms a target section (for example, a target section WA) from a first phonetic piece (for example, a phonetic piece V1) and a second phonetic piece (for example, a phonetic piece V2) so as to connect the first phonetic piece and the second phonetic piece to each other such that the target section is formed of a rear phoneme section of the first phonetic piece corresponding to a consonant phoneme and a front phoneme section of the second phonetic piece corresponding to the consonant phoneme, and that carries out an expansion process for expanding the target section by a target time length to form an adjustment section (for example, an adjustment section WB) such that a central part of the target section is expanded at an expansion rate higher than that of a front part and a rear part of the target section, to thereby create synthesized phonetic piece data (for example, synthesized phonetic piece data DB) of the adjustment section having the target time length and corresponding to the consonant phoneme; and a voice synthesis part (for example, a voice synthesis part 28) that creates a voice signal from the synthesized phonetic piece data created by the phonetic piece adjustment part.
In the above construction, the expansion rate is changed in the target section corresponding to a phoneme of a consonant, and therefore, it is possible to synthesize an aurally natural voice as compared with the construction of Japanese Patent Application Publication No. H7-129193 in which an expansion and contraction rate is fixedly maintained within a range of a phonetic piece.
In a preferred aspect of the present invention, each phonetic piece data comprises a plurality of unit data corresponding to a plurality of frames arranged on a time axis. In case that the target section corresponds to a voiced consonant phoneme, the phonetic piece adjustment part expands the target section to the adjustment section such that the adjustment section contains a time series of unit data corresponding to the front part (for example, a front part σ1) of the target section, a time series of a plurality of repeated unit data which are obtained by repeating unit data corresponding to a central point (for example, a time point tAc) of the target section, and a time series of a plurality of unit data corresponding to the rear part (for example, a rear part σ2) of the target section.
In the above aspect, a time series of plurality of unit data corresponding to the front part of the target section and a time series of a plurality of unit data corresponding to the rear part of the target section are applied as unit data of each frame of the adjustment section, and therefore, the expansion process is simplified as compared with, for example, a construction in which both the front part and the rear part are expanded. The expansion of the target section according to the above aspect is particularly preferable in a case in which the target section corresponds to a phoneme of a voiced consonant.
In a preferred aspect of the present invention, the unit data of the frame of the voiced consonant phoneme comprises envelope data designating characteristics of a shape in an envelope line of a spectrum of a voice and spectrum data indicating the spectrum of the voice. The phonetic piece adjustment part generates the unit data corresponding to the central point of the target section such that the generated unit data comprises envelope data obtained by interpolating the envelope data of the unit data before and after the central point of the target section and spectrum data of the unit data immediately before or after the central point.
In the above aspect, the envelope data created by interpolating the envelope data of the unit data before and after the central point of the target section are included in the unit data after expansion, and therefore, it is possible to synthesize a natural voice in which a voice component of the central point of the target section is properly expanded.
In a preferred aspect of the present invention, the phonetic piece data comprises a plurality of unit data corresponding to a plurality of frames arranged on a time axis. In case that the target section corresponds to an unvoiced consonant phoneme, the phonetic piece adjustment part sequentially selects the unit data of each frame of the target section as unit data of each frame of the adjustment section to create the synthesized phonetic piece data, wherein velocity (for example, progress velocity ν), at which each frame in the target section corresponding to each frame in the adjustment section is changed according to passage of time in the adjustment section, is decreased from a front part to a central point (for example, a central point tBc) of the adjustment section and increased from the central point to a rear part of the adjustment section.
The expansion of the target section according to the above aspect is particularly preferable in a case in which the target section corresponds to a phoneme of an unvoiced consonant.
In a preferred aspect of the present invention, the unit data of the frame of an unvoiced sound comprises spectrum data indicating a spectrum of the unvoiced sound. The phonetic piece adjustment part creates the unit data of the frame of the adjustment section such that the created unit data comprises spectrum data of a spectrum containing a predetermined noise component (for example, a noise component p) adjusted according to an envelope line (for example, an envelope line ENV) of a spectrum indicated by spectrum data of unit data of a frame in the target section.
For example, preferably the phonetic piece adjustment part sequentially selects the unit data of each frame of the target section and creates the synthesized phonetic piece data such that the unit data thereof comprises spectrum data of a spectrum containing a predetermined noise component adjusted based on an envelope line of a spectrum indicated by spectrum data of the selected unit data of each frame in the target section (second embodiment).
Alternately, the phonetic piece adjustment part selects the unit data of a specific frame of the target section (for example, one frame corresponding to a central point of the target section) and creates the synthesized phonetic piece data such that the unit data thereof comprises spectrum data of a spectrum containing a predetermined noise component adjusted based on an envelope line of a spectrum indicated by spectrum data of the selected unit data of the specific frame in the target section (third embodiment).
In the above aspect, unit data of a spectrum in which a noise component (typically, a white noise) is adjusted based on the envelope line of the spectrum indicated by the unit data of the target section are created, and therefore, it is possible to synthesize a natural voice, acoustic characteristics of which is changed for every frame, even in a case in which a frame in the target section is repeated over a plurality of frames in the adjustment section.
By the way, manner of expansion of really pronounced phonemes are different depending upon type of phonemes. In the technology of Japanese Patent Application Publication No. H7-129193, however, expansion rates are merely different between the normal part and the transition part with the result that it may not be possible to synthesize a natural voice according to type of phonemes. In view of the above problems, a voice synthesis apparatus according to a second aspect of the present invention is designed for synthesizing a voice signal using a plurality of phonetic piece data each indicating a phonetic piece which contains at least two phoneme sections corresponding to different phonemes, the apparatus comprising a phonetic piece adjustment part that uses different expansion processes based on types of phonemes indicated by the phonetic piece data. In the above aspect, an appropriate expansion process is selected according to type of a phoneme to be expanded, and therefore, it is possible to synthesize a natural voice as compared with the technology of Japanese Patent Application Publication No. H7-129193.
For example, in a preferred example in which the first aspect and the second aspect are combined, a phoneme section (for example, a phoneme section S2) corresponding to a phoneme of a consonant of a first type (for example, a type C1a or a type C1b) which is positioned at the rear of a phonetic piece and pronounced through temporary deformation of a vocal tract includes a preparation process (for example, a preparation process pA1 or a preparation process pB1) just before deformation of the vocal tract, a phoneme section (for example, a phoneme section S1) which is positioned at the front of a phonetic piece and corresponds to the phoneme of the consonant of the first type includes a pronunciation process (for example, a pronunciation process pA2 or a pronunciation process pB2) in which the phoneme is pronounced as the result of temporary deformation of the vocal tract, a phoneme section corresponding to a phoneme of a consonant of a second type (for example, a second type C2) which is positioned at the rear of a phonetic piece and can be normally continued includes a process (for example, a front part pC1) in which pronunciation of the phoneme is commenced, a phoneme section which is position at the front of a phonetic piece and corresponds to the phoneme of the consonant of the second type includes a process (for example, a rear part pC2) in which pronunciation of the phoneme is ended.
Under the above circumstance, the phonetic piece adjustment part carries out the already described expansion process for expanding the target section by a target time length to form an adjustment section such that a central part of the target section is expanded at an expansion rate higher than that of a front part and a rear part of the target section in case that the consonant phoneme of the target section belongs to one type (namely the second type C2) including fricative sound and semivowel sound, and carries out another expansion process in case that the consonant phoneme of the target section belongs to another type (namely the first type C1) including plosive sound, affricate sound, nasal sound and liquid sound for inserting an intermediate section between the rear phoneme section of the first phonetic piece and the front phoneme section of the second phonetic piece in the target section.
In the above aspect, the same effects as the first aspect are achieved, and, in addition, it is possible to properly expand a phoneme of the first type pronounced through temporary deformation of the vocal tract.
For example, in a case in which the phoneme of the consonant corresponding to the target section is a phoneme (for example, a plosive sound or an affricate) of the first type in which an air current is stopped at the preparation process (for example, the preparation process pA1), the phonetic piece adjustment part inserts a silence section as the intermediate section.
Also, in a case in which the phoneme of the consonant corresponding to the target section is a phoneme (for example, a liquid sound or a nasal sound) of the first type in which pronunciation is maintained through ventilation at the preparation process (for example, the preparation process pB1), the phonetic piece adjustment part inserts an intermediate section containing repetition of a frame selected from the rear phoneme section of the first phonetic piece or the front phoneme section of the second phonetic piece in case that the consonant phoneme of the target section is nasal sound or liquid sound. For example, the phonetic piece adjustment part inserts the intermediate section containing repetition of the last frame of the rear phoneme section of the first phonetic piece. Alternatively, the phonetic piece adjustment part inserts the intermediate section containing repetition of the top frame of the front phoneme section of the second phonetic piece.
The voice synthesis apparatus according to each aspect described above is realized by hardware (an electronic circuit), such as a digital signal processor (DSP) which is exclusively used to synthesize a voice, and, in addition, is realized by a combination of a general processing unit, such as a central processing unit (CPU), and a program. A program (for example, a program PGM) of the present invention is executed by a computer to perform a method of synthesizing a voice signal using a plurality of phonetic piece data each indicating a phonetic piece which contains at least two phoneme sections corresponding to different phonemes, the method comprising: forming a target section from a first phonetic piece and a second phonetic piece so as to connect the first phonetic piece and the second phonetic piece to each other such that the target section is formed of a rear phoneme section of the first phonetic piece corresponding to a consonant phoneme and a front phoneme section of the second phonetic piece corresponding to the consonant phoneme; carrying out an expansion process for expanding the target section by a target time length to form an adjustment section such that a central part of the target section is expanded at an expansion rate higher than that of a front part and a rear part of the target section, to thereby create synthesized phonetic piece data of the adjustment section having the target time length and corresponding to the consonant phoneme; and creating a voice signal from the synthesized phonetic piece data.
The program as described above realizes the same operation and effects as the voice synthesis apparatus according to the present invention. The program according to the present invention is provided to users in a form in which the program is stored in machine readable recording media that can be read by a computer so that the program can be installed in the computer, and, in addition, is provided from a server in a form in which the program is distributed via a communication network so that the program can be installed in the computer.
The central processing unit (CPU) 12 executes a program PGM stored in the storage unit 14 to perform a plurality of functions (a phonetic piece selection part 22, a phoneme length setting part 24, a phonetic piece adjustment part 26, and a voice synthesis part 28) for creating a voice signal VOUT indicating the waveform of a synthesized sound. Meanwhile, the respective functions of the central processing unit 12 may be separately realized by a plurality of integrated circuits, or a designated electronic circuit, such as a DSP, may realize some of the functions. The sound output unit 16 (for example, a headphone or a speaker) outputs a sound wave corresponding to the voice signal VOUT created by the central processing unit 12.
The storage unit 14 stores the program PGM, which is executed by the central processing unit 12, and various kinds of data (phonetic piece group GA and synthesis information GB), which are used by the central processing unit 12. Well-known recording media, such as semiconductor recording media or magnetic recording media, or a combination of a plurality of kinds of recording media may be adopted as the storage unit 14.
As shown in
As shown in
As shown in
The excitation waveform envelope (excitation curve) r1 is a variable approximate to an envelope line of a spectrum of vocal cord vibration. The chest resonance r2 designates a bandwidth, a central frequency, and an amplitude value of a predetermined number of resonances (band pass filters) approximate to chest resonance characteristics. The vocal tract resonance r3 designates a bandwidth, a central frequency, and an amplitude value of each of a plurality of resonances approximate to vocal tract resonance characteristics. The difference spectrum r4 means the difference (error) between a spectrum approximate to the excitation waveform envelope r1, the chest resonance r2 and the vocal tract resonance r3, and a spectrum of a voice.
As shown in
As shown in
time domain waveforms of phonemes of the respective types C1a, C1b and C2 are illustrated in parts (A) of
In a case in which a phoneme section S2 at the rear of a phonetic piece V corresponds to a phoneme of the type C1a, as shown in a part (B) of
As shown in a part (A) of
As shown in a part (A) of
As shown in
The phonetic piece selection part 22 of
The phoneme length setting part 24 of
The phonetic piece adjustment part 26 of
The voice synthesis part 28 of
Upon commencing the process of
As shown in
In a case in which the target phoneme does not belong to the type C1a (SA1: NO), the phonetic piece adjustment part 26 determines whether or not the target phoneme belongs to the type C1b (a liquid sound or nasal sounds) (SA3). A determination method of step SA3 is identical to that of step SA1. In a case in which the target phoneme belongs to the type C1b (SA3: YES), the phonetic piece adjustment part 26 carries out a second insertion process to create synthesized phonetic piece data DB of the adjustment section WB (SA4).
As shown in
In a case in which the target phoneme belongs to the first type C1 (C1a and C1b) as described above, the phonetic piece adjustment part 26 inserts the intermediate section M (MA and MB) between the phoneme section S2 at the rear of the phonetic piece V1 and the phoneme section S1 at the front of the phonetic piece V2 to create synthesized phonetic piece data DB of the adjustment section WB. Meanwhile, the frame at the endmost part of the preparation process pA1 (the phoneme section S2 of the phonetic piece V1) of the phoneme belonging to the type C1a is almost silence, and therefore, in a case in which the target phoneme belongs to the type C1a, it is also possible to carry out a second insertion process of inserting a time series of unit data UA of the frame at the endmost part of the phoneme section S2 as the intermediate section MB in the same manner as step SA4.
In a case in which the target phoneme belongs to the second type C2 (SA1: NO and SA3: NO), the phonetic piece adjustment part 26 carries out an expansion process of expanding the target section WA, so that an expansion rate of the central part in the time axis direction of the target section WA of the target phoneme is higher than that of the front part and the rear part of the target section WA (the central part of the target section WA is much more expanded than the front part and the rear part of the target section WA), to create synthesized phonetic piece data DB of the adjustment section WB of the time length LB (SA5).
Hereinafter, the time length (distance on the time axis) in the target section WA corresponding to a predetermined unit time in the adjustment section WB will be expressed as progress velocity ν. That is, the progress velocity ν is velocity at which each frame in the target section WA corresponding to each frame in the adjustment section WB is changed according to passage of time in the adjustment section WB. Consequently, in a section in which the progress velocity ν is 1 (for example, the front part and the rear part of the adjustment section WB), each frame in the target section WA and each frame in the adjustment section WB correspond to each other one to one, and, in a section in which the progress velocity ν is 0 (for example, the central part in the adjustment section WB), a plurality of frames in the adjustment section WB correspond to a single frame in the target section WA (that is, the frame in the target section WA is not changed according to passage of time in the adjustment section WB).
A graph showing time-based change of the progress velocity ν in the adjustment section WB is also shown in
Specifically, the progress velocity ν is maintained at 1 from the start point tBs to a specific time point tB1 of the adjustment section WB, is then decreased over time from the time point tB1, and reaches 0 at the central point tBc of the adjustment section WB. After the central point tBc, the progress velocity ν is changed in a trajectory obtained by reversing the section from the start point tBs to the central point tBc with respect to the central point tBc in the time axis direction in line symmetry. As the result that the progress velocity ν is increased and decreased as above, the target section WA is expanded so that an expansion rate of the central part in the time axis direction of the target section WA of the target phoneme is higher than that of the front part and the rear part of the target section WA as previously described.
As shown in
As can be understood from
First, as shown in
Next, as shown in
As previously described, unit data UA of a voiced sound include envelope data R and spectrum data Q. The envelope data R can be interpolated between the frames for respective variables r1 to r4. On the other hand, a spectrum indicated by the spectrum data Q is changed moment by moment for every frame with the result that, in a case in which the spectrum data Q are interpolated between the frames, a spectrum having characteristics different from those of the spectrum before interpolation may be calculated. That is, it is difficult to properly interpolate the spectrum data Q.
In consideration of the above problems, the phonetic piece adjustment part 26 of the first embodiment calculates the envelope data R of the unit data UA of the frame FA[K+0.5] of the central point tAc of the target section WA by interpolating the respective variables r1 to r4 of the envelope data R between the frame FA[K] just before the central point tAc and the frame FA[K+1] just after the central point tAc. For example, in an illustration of
Also, the phonetic piece adjustment part 26 appropriates the spectrum data Q of the unit data UA of the frame FA[K+1] just after the central point tAc of the target section WA (or the spectrum data Q of the frame FA[K] just before the central point tAc of the target section WA) as the spectrum data Q of the unit data UA of the frame FA[K+0.5] corresponding to the central point tAc of the target section WA. For example, in an illustration of
On the other hand, in a case in which the target phoneme is an unvoiced sound (SB1: NO), the phonetic piece adjustment part 26 expands the target section WA, so that the adjustment section WB and the target section WA satisfy a relationship of the trajectory z2, to create synthesized phonetic piece data DB of the adjustment section WB (SB3). As previously described, the unit data UA of the unvoiced sound include the spectrum data Q but do not include the envelope data R. The phonetic piece adjustment part 26 selects unit data UA of a frame nearest the trajectory z2 with respect to the respective frames in the adjustment section WB of a plurality of frames constituting the target section WA as unit data UB of each of N frames of the adjustment section WB to create synthesized phonetic piece data DB including N unit data UB.
A time point tAn in the target section WA corresponding to an arbitrary frame FB[n] of the adjustment section WB is shown in
As described above, in the first embodiment, an expansion rate is changed in a target section WA corresponding to a phoneme of a consonant, and therefore, it is possible to synthesize an aurally natural voice as compared with Japanese Patent Application Publication No. H7-129193 in which the expansion rate is uniformly maintained within a range of a phonetic piece.
Also, in the first embodiment, an expansion method is changed according to types C1a, C1b and C2 of phonemes of consonants, and therefore, it is possible to expand each phoneme without excessively changing characteristics (particularly, a section important when a listener distinguishes a phoneme) of each phoneme.
For example, for a phoneme (a plosive sound or an affricate) of the type C1a, an intermediate section MA of silence is inserted between a preparation process pA1 and a pronunciation process pA2, and therefore, it is possible to expand a target section WA while little changing characteristics of the pronunciation process pA2, which are particularly important when a listener distinguishes a phoneme. In the same manner, for a phoneme (a liquid sound or a nasal sound) of the type C1b, an intermediate section MB, in which the final frame of a preparation process pB1 is repeated, is inserted between a preparation process pB1 and a pronunciation process pB2, and therefore, it is possible to expand a target section WA while little changing characteristics of the pronunciation process pB2, which are particularly important when distinguishing a phoneme. For a phoneme (a fricative sound or a semivowel) of the second type C2, a target section WA is expanded so that an expansion rate of the central part of a target section WA of the target phoneme is higher than that of the front part and the rear part of the target section WA, and therefore, it is possible to expand the target section WA without excessively changing characteristics of the front part or the rear part, which are particularly important when a listener distinguishes a phoneme.
Also, in the expansion process of a phoneme of the second type C2, for spectrum data Q, which are difficult to interpolate, spectrum data Q of unit data UA in phonetic piece data DA are applied to synthesized phonetic piece data DB, and, for envelope data R, envelope data R calculated through interpolation of frames before and after the central point tAc in a target section WA are included in unit data UB of the synthesized phonetic piece data DB. Consequently, it is possible to synthesize an aurally natural voice as compared with a construction in which envelope data R are not interpolated.
Meanwhile, for example, a method of calculating envelope data R of each frame in an adjustment section WB so that the envelope data R follow a trajectory z1 through interpolation and of selecting spectrum data Q so that the spectrum data Q follow a trajectory z2 from phonetic piece data D (hereinafter, referred to as a ‘comparative example’) may be assumed as a method of expanding a phoneme of a voiced consonant. In the method of the comparative example, however, characteristics of the envelope data R and the spectrum data Q are different from each other with the result that a synthesized sound may be aurally unnatural. In the first embodiment, each piece of unit data of the synthesized phonetic piece data DB is created so that both the envelope data R and the spectrum data Q follow the trajectory z2, and therefore, it is possible to synthesize an aurally natural voice as compared with the comparative example. However, it is not intended that the comparative example is excluded from the scope of the present invention.
Hereinafter, a second embodiment of the present invention will be described. Meanwhile, elements of embodiments which will be described below equal in operation or function to those of the first embodiment are denoted by the same reference numerals used in the above description, and a detailed description thereof will be properly omitted.
In the first embodiment, in a case in which the target phoneme is an unvoiced sound, unit data UA of a frame satisfying a relationship of the trajectory z2 with respect to each frame in the adjustment section WB of a plurality of frames constituting the target section WA are selected. In the construction of the first embodiment, unit data UA of a frame in the target section WA are repeatedly selected over a plurality of frames (repeated sections i of
First, the phonetic piece adjustment part 26 selects a frame FA nearest a time point tAn corresponding to a frame FB[n] in the adjustment section WB of a plurality of frames FA of the target section WA in the same manner as in the first embodiment, and, as shown in
As described above, in the second embodiment, in a case in which the target phoneme is an unvoiced sound, a frequency characteristic (envelope line ENV) of the spectrum prescribed by the unit data UA, of the target section WA is added to the noise component μ to create unit data UB of the synthesized phonetic piece data DB. The intensity of the noise component μ at each frequency is randomly changed on the time axis every second, and therefore, characteristics of the synthesized sound is changed moment by moment over time (every frame) even in a case in which a piece of unit data UA in the target section WA is repeatedly selected over a plurality of frames in the adjustment section WB. According to the second embodiment, therefore, it is possible to reduce unnaturalness of a synthesized sound caused by repetition of a piece of unit data UA as compared with the first embodiment in addition to the same effects as the first embodiment.
As also described in the second embodiment, for an unvoiced consonant, a piece of unit data UA of the target section WA can be repeated over a plurality of frames in the adjustment section WB. On the other hand, each frame of the unvoiced consonant is basically an unvoiced sound but a frame of a voiced sound may be mixed. In a case in which a frame of a voiced sound is repeated in a synthesized sound of the phoneme of the unvoiced consonant, a periodic noise (a buzzing sound) which is very harsh to the ear may be pronounced. The third embodiment is provided to solve the above problems.
A phonetic piece adjustment part 26 of the third embodiment selects unit data UA of a frame corresponding to the central point tAc in a target section WA with respect to each frame in a repetition section τ continuously corresponding to a frame in the target section WA at a trajectory z2 of an adjustment section WB. Subsequently, the phonetic piece adjustment part 26 calculates an envelope line ENV of a spectrum indicating spectrum data Q of a piece of unit data UA corresponding to the central point tAc of the target section WA and creates unit data UA including spectrum data Q of a spectrum in which a predetermined noise component μ is adjusted based on the envelope line ENV as unit data UB of each frame in the repetition section τ of the adjustment section WB. That is, the envelope line ENV of the spectrum is common to a plurality of frames in the repetition section τ. Meanwhile, the reason that the unit data UA corresponding to the central point tAc of the target section WA are selected as a calculation source of the envelope line ENV is that the unvoiced consonant can be stably and easily pronounced in the vicinity of the central point tAc of the target section WA (there is a strong possibility of an unvoiced sound).
The third embodiment also has the same effects as the first embodiment. Also, in the third embodiment, unit data UB of each frame in the repetition section i are created using the envelope line ENV specified from a piece of unit data UA (particularly, unit data UA corresponding to the central point tAc) in the target section WA, and therefore, a possibility of a frame of a voiced sound being repeated in a synthesized sound of a phoneme of an unvoiced consonant is reduced. Consequently, it is possible to restrain the occurrence of a periodic noise caused by repetition of the frame of the voiced sound.
Each of the above embodiments may be modified in various ways. Hereinafter, concrete modifications will be illustrated. Two or more modifications arbitrarily selected from the following illustration may be appropriately combined.
(1) Although different methods of expanding the target section WA are used according to types C1a, C1b and C2 of phonemes of consonants in each of the above embodiments, it is also possible to expand the target section WA of a phoneme of each type using a common method. For example, it is also possible to expand a target section WA of a phoneme of a type C1a or a type C1b using an expansion process for expanding the target section WA (step SA5 of
(2) The expansion process carried out at step SA5 of
(3) In the second insertion process of the above described embodiments, the intermediate section MB is generated by repeatedly arranging unit data UA of the last frame of the phonetic piece V1 (hatched portion of
(4) Although the envelope line ENV of the spectrum indicated by a piece of unit data UA selected from the target section WA is used to adjust the noise component μ in the second embodiment, it is also possible to adjust the noise component μ based on an envelope line ENV calculated through interpolation between the frames. For example, in a case in which a frame of the time point tAn satisfying a relationship of the trajectory z1 with respect to the frame FB[n] of the adjustment section WB does not exist in the target section WA, as described with reference to
(5) The form of the phonetic piece data DA or the synthesized phonetic piece data DB is optional. For example, although a time series of unit data U indicating a spectrum of each frame of the phonetic piece V is used as the phonetic piece data DA in each of the above embodiments, it is also possible to use a sample series of the phonetic piece V on the time axis as the phonetic piece data DA.
(6) Although the storage unit 14 for storing the phonetic piece data group GA is mounted on the voice synthesis apparatus 100 in each of the above embodiments, there may be another configuration in which an external device (for example, a server device) independent from the voice synthesis apparatus 100 stores the phonetic piece data group GA. In such a case, the voice synthesis apparatus 100 (the phoneme piece selection part 22) acquires the phonetic piece V (phonetic piece data DA) from the external device through, for example, communication network so as to generate the voice signal VOUT. In similar manner, it is possible to store the synthesis information GB in an external device independent from the voice synthesis apparatus 100. As understood from the above description, a device such as the aforementioned storage unit 14 for storing the phonetic piece data DA and the synthesis information GB is not an indispensable element of the voice synthesis apparatus 100.
Number | Date | Country | Kind |
---|---|---|---|
2011-123770 | Jun 2011 | JP | national |
2012-110358 | May 2012 | JP | national |