The present invention relates to voice synthesis techniques.
Heretofore, various techniques have been proposed for synthesizing voices imitative of real human voices. In Japanese Patent Application Laid-open Publication No. 2003-255974, for example, there is disclosed a technique for synthesizing a desired voice by cutting out a real human voice (hereinafter referred to as “input voice”) on a phoneme-by-phoneme basis to thereby sample voice segments of the human voice and then connecting together the sampled voice segments. Each voice segment (particularly, voice segment including a voiced sound, such as a vowel) is extracted out of the input voice with a boundary set at a time point where a waveform amplitude becomes substantially constant.
However, because the voice segment [s_a] has the end point T3 set after the stationary point T0, the conventional technique can not necessarily synthesize a natural voice. Since the stationary point T0 corresponds to a time point when the person has gradually opened his or her mouth into a fully-opened position for utterance of the voice, the voice synthesized using the voice segment extending over the entire region including the stationary point T0 would inevitably become imitative of the voice uttered by the person fully opening his or her mouth. However, when actually uttering a voice, a person does not necessarily do so by fully opening the mouth. For example, in singing a fast-tempo music piece, it is sometimes necessary for a singing person to utter a next word before fully opening the mouth to utter a given word. Also, to enhance a singing expression, a person may sing without sufficiently opening the mouth at an initial stage immediately after the begining of a music piece and then gradually increasing the opening degree of the mouth as the tune rises or livens up. Despite such circumstances, the conventional technique is arranged to merely synthesize voices fixedly using voice segments corresponding to fully-opened mouth positions, it can not appropriately synthesize subtle voices like those uttered with the mouth insufficiently opened.
It is possible, in a fashion, to synthesize voices corresponding to various opening degrees of the mouth, by sampling a plurality of voice segments from different input voices uttered with various opening degrees of the mouth and selectively using any of the sampled voice segments. In this case, however, a multiplicity of voice segments must be prepared, involving a great amount of labor to create the voice segments; in addition, a storage device of a great capacity is required to hold the multiplicity of voice segments.
In view of the foregoing, it is an object of the present invention to appropriately synthesize a variety of voices without increasing the necessary number of voice segments.
To accomplish the above-mentioned object, the present invention provides an improved voice synthesis apparatus, which comprises: a phoneme acquisition section that acquires a voice segment including one or more phonemes; a boundary designation section that designates a boundary intermediate between start and end points of a vowel phoneme included in the voice segment acquired by the phoneme acquisition section; and a voice synthesis section that synthesizes a voice for a region of the vowel phoneme that precedes the designated boundary in said vowel phoneme, or a region of the vowel phoneme that succeeds the designated boundary in said vowel phoneme.
According to the present invention, a boundary is designated intermediate between start and end points of a vowel phoneme included in a voice segment, and a voice is synthesized based on a region of the vowel phoneme that precedes the designated boundary in the vowel phoneme, or a region that succeeds the designated boundary in the vowel phoneme. Thus, as compared to the conventional technique where a voice is synthesized merely on the basis of an entire region of a voice segment, the present invention can synthesize diversified and natural voices. For example, by synthesizing a voice for a region, of a vowel phoneme included in a voice segment, before a waveform of the region reaches a stationary state, it is possible to synthesize a voice imitative of a real voice uttered by a person without sufficiently opening the mouth. Further, because the region to be used to synthesize a voice for a voice segment is variably designated, there is no need to prepare a multiplicity of voice segments with regions different among the segments. Even if there is no need to prepare a multiplicity of voice segments, it is never intended to mean that the present invention excludes, from the scope of the invention, the idea or construction of, for example, preparing, for a same phoneme, a plurality of voice segments with different regions in pitch or dynamics (e.g., construction disclosed in Japanese Patent Application Laid-open Publication No. 2002-202790).
The “voice segment” used in the context of the present invention is a concept embracing both a “phoneme” that is an auditorily-distinguishable minimum unit obtained by dividing a voice (typically, a real voice of a person), and a phoneme sequence obtained by connecting together a plurality of such phonemes. The phoneme is either a consonant phoneme (e.g., [s]) or a vowel phoneme (e.g., [a]). The phoneme sequence, on the other hand, is obtained by connecting together a plurality of phonemes, representing a vowel or consonant, on the time axis, such as a combination of a consonant and a vowel (e.g., [s_a]), a combination of a vowel and a consonant (e.g., [i_t]) and a combination of successive vowels (e.g., [a_i]). The voice segment may be used in any desired form, e.g. as a waveform in the time domain (on the time axis) or as a spectrum in the frequency domain (on the frequency axis).
How or from which source the voice segment acquisition section acquires a voice segment may be chosen as desired by a user. More specifically, a read out section for reading out a voice segment stored in a storage section may be employed as the voice segment acquisition section. For example, where the present invention is applied to synthesize singing voices, the voice segment acquisition section, employed in arrangements which include a storage section storing a plurality of voice segments and a lyric data acquisition section (corresponding to “data acquisition section” in each embodiment to be detailed below) for acquiring lyric data designating lyrics or words of a music piece, acquires, from among the plurality of voice segments stored in the storage section, voice segments corresponding to lyric data acquired by the lyric data acquisition section. Further, the voice segment acquisition section may be arranged to either acquire, through communication, voice segments retained by another communication terminal, or acquire voice segments by dividing or segmenting each voice input by the user. The boundary designation section, which designates a boundary at a time point intermediate between the start and end points of a vowel, and it may also be interpreted as a means for designating a specific range defined by the boundary (e.g., region between the start or end point of the vowel phoneme and the boundary).
For a voice segment where a region including an end point is a vowel phoneme (e.g., a voice segment comprising only a vowel phoneme, such a [a], or phoneme sequence where the last phoneme is a vowel, such as [s_a] or [a_i]), a range of the voice segment is defined such that a time point at which a voice waveform of the vowel has reached a stationary state becomes the end point. When such a voice segment has been acquired by the voice segment acquisition section, the voice synthesis section synthesizes a voice based on a region preceding a boundary designated by the boundary designation section. With such arrangements, it is possible to synthesize a voice imitative of a real voice utter by a person before fully opening his or her mouth after started gradually opening the mouth in order to utter the voice. For a voice segment where a region including a start point is a vowel phoneme (e.g., a voice segment comprising only a vowel phoneme, such as [a], or phoneme sequence where the first phoneme is a vowel, such as [a_s] or [i_a]), a range of the voice segment is defined such that a time point at which a voice waveform of the vowel has reached a stationary state becomes the start point. When such a voice segment has been acquired by the voice segment acquisition section, the voice synthesis section synthesizes a voice based on a region succeeding a boundary designated by the boundary designation section. With such arrangements, it is possible to synthesize a voice imitative of a real voice uttered by a person while gradually closing his or her mouth after having opened the mouth partway.
The above-identified embodiments may be combined as desired. Namely, in one embodiment, the voice segment acquisition section acquires a first voice segment where a region including an end point is a vowel phoneme (e.g., a voice segment [s_a] as shown in
It is desirable that a time length of a region of a voice segment to be used in voice synthesis by the voice synthesis section be chosen in accordance with a duration time length of a voice to be synthesized here. Thus, in one embodiment, there is further provided a time data acquisition section that acquires time data designating a duration time length of a voice (corresponding to the “data acquisition section” in the embodiments to be described later), and the boundary designation section designates a boundary in a vowel phoneme, included in the voice segment, at a time point corresponding to the duration time length designated by the time data. Where the present invention is applied to synthesize singing voices, the time data acquisition section acquires data indicative of a duration time length (i.e., note length) of a note constituting a music piece, as time data (corresponding to note data in the embodiments to be detailed below). Such arrangements can synthesize a natural voice corresponding to a predetermined duration time length. More specifically, when the voice segment acquisition section has acquired a voice segment where a region having an end point is a vowel, the boundary designation section designates, as a boundary, a time point of the vowel phoneme, included in the voice segment, closer to the end point as a longer time length is indicated by the time data, and the voice synthesis section synthesizes a voice on the basis of a region preceding the designated boundary. Further, when the voice segment acquisition section has acquired a voice segment where a region having a start point is a vowel, the boundary designation section designates, as a boundary, a time point of the vowel phoneme, included in the voice segment, closer to the start point as a longer time length is indicated by the time data, and the voice synthesis section synthesizes a voice on the basis of a region succeeding the designated boundary.
However, in the present invention, any desired way may be chosen to designate a boundary in a vowel phoneme. For example, in one embodiment, the voice synthesis apparatus further includes an input section that receives a parameter input thereto, and the boundary designation section designates a boundary at a time point of a vowel phoneme, included in a voice segment acquired by the voice segment acquisition section, corresponding to the parameter input to the input section. In this embodiment, each region of a voice segment, to be used for voice synthesis, is designated in accordance with a parameter input by the user via the input section, so that a variety of voices with user's intent precisely reflected therein can be synthesized. Where the present invention is applied to synthesize singing voices, it is desirable that time points corresponding to a tempo of a music piece be set as boundaries. For example, when the voice segment acquisition section has acquired a voice segment where a region including an end point is a vowel phoneme, the boundary designation section designates, as a boundary, a time point of the vowel phoneme closer to the end point as a slower tempo of a music piece is designated, and the voice synthesis section synthesizes a voice on the basis of a region of the vowel phoneme preceding the boundary. When the voice segment acquisition section has acquired a voice segment where a region including a start point is a vowel phoneme, the boundary designation section designates, as a boundary, a time point of the vowel phoneme closer to the start point as a slower tempo of a music piece is designated, and the voice synthesis section synthesizes a voice on the basis of a region of the vowel phoneme succeeding the boundary.
The voice synthesis apparatus may be implemented not only by hardware, such as a DSP (Digital Signal Processor), dedicated to voice synthesis, but also by a combination of a personal computer or other computer and a program. For example, the program causes the computer to perform: a phoneme acquisition operation for acquiring a voice segment including one or more phonemes; a boundary designation operation designating a boundary intermediate between start and end points of a vowel phoneme included in the voice segment acquired by the phoneme acquisition operation; and a voice synthesis operation for synthesizing a voice for a region, of the vowel phoneme included in the voice segment acquired by the phoneme acquisition operation, preceding the boundary designated by the boundary designation operation, or a region of the vowel phoneme succeeding the designated boundary. This program too can achieve the benefits as set forth above in relation to the tone synthesis apparatus of the invention. The program of the invention may be supplied to the user in a transportable storage medium and then installed in a computer, or may be delivered from a server apparatus via a communication network then installed in a computer.
The present invention is also implemented as a voice synthesis method comprising: a phoneme acquisition step of acquiring a voice segment including one or more phonemes; a boundary designating step of designating a boundary intermediate between start and end points of a vowel phoneme included in the voice segment acquired by the phoneme acquisition step; and a voice synthesis step of synthesizing a voice for a region, of the vowel phoneme included in the voice segment acquired by the phoneme acquisition step, preceding the boundary designated by the boundary designation step, or a region of the vowel phoneme succeeding the designated boundary. This method too can achieve the benefits as stated above in relation to the voice synthesis apparatus.
The following will describe embodiments of the present invention, but it should be appreciated that the present invention is not limited to the described embodiments and various modifications of the invention are possible without departing from the basic principles. The scope of the present invention is therefore to be determined solely by the appended claims.
For better understanding of the objects and other features of the present invention, its preferred embodiments will be described hereinbelow in greater detail with reference to the accompanying drawings, in which:
Now, a detailed description will be made about embodiments of the present invention where the basic principles of the invention are applied to synthesis of singing voices of a music piece.
First, a description will be given about a general setup of a voice synthesis apparatus in accordance with a first embodiment of the present invention, with reference to
The data acquisition section 10 of
The storage section 20 is a means for storing data indicative of voice segments (hereinafter referred to as “voice segment data”). The storage section 20 is in the form of any of various storage devices, such as a hard disk device containing a magnetic disk and a device for driving a removable or transportable storage medium typified by a CD-ROM. In the instant embodiment, the voice segment data is indicative of frequency spectra of a voice segment, as will be later described. Procedures for creating such voice segment data will be described with primary reference to
In (a1) of
In (b1) of
Voice segment, having its time axial range demarcated in the above-described manner, is divided into frames F each having a predetermined time length (e.g., in a range of 5 ms to 10 ms). As seen in (a1) of
As shown in
The boundary designation section 33 is a means for designating a boundary (hereinafter referred to as “phoneme segmentation boundary”) Bseg in the voice segments acquired by the voice segment acquisition section 31. As seen in (a1) and (a2) or (b1) and (b2) of
The voice synthesis section 35 shown in
Sometimes, merely connecting together a plurality of voice segments can not provide a desired note length. Further, if voice segments of different tone colors are connected, there is a possibility of noise unpleasant to the ear being produced in a connection between the voice segments. To avoid such inconveniences, the voice synthesis section 35 in the instant embodiment includes an interpolation section 351 that is a means for filling or interpolating a gap Cf between the voice segments. For example, the interpolation section 351, as shown in (c) of
Further, the output processing section 41 shown in
Next, a description will be given about the embodiment of the voice synthesis apparatus D.
The voice segment acquisition section 31 of the voice processing section 30 sequentially reads out voice segment data, corresponding to lyric data supplied from the data acquisition section 10, from the storage section 20 and outputs the thus read-out voice segment data to the boundary designation section 33. Here, let it be assumed that letters “sa” have been designated by the lyric data. In this case, the voice segment acquisition section 31 reads out, from the storage section 20, voice segment data corresponding to voice segments, [#_s], [s_a] and [a_#], and outputs the read-out voice segment data to the boundary designation section 33 in the order mentioned.
In turn, the boundary designation section 33 designates phoneme segmentation boundaries Bseg for the voice segment data sequentially supplied from the voice segment acquisition section 31.
If, on the other hand, the voice segment includes a vowel phoneme as determined at step S1, the boundary designation section 33 makes a determination, at step S3, as to whether the front phoneme of the voice segment indicated by the voice segment data is a vowel phoneme. If answered in the affirmative at step S3, the boundary designation section 33 designates, at step S4, a phoneme segmentation boundary Bseg such that the time length from the end point of the vowel phoneme, as the front phoneme, of the voice segment to the phoneme segmentation boundary Bseg corresponds to the note length indicated by the note data. For example, the voice segment [a_#] to be used for synthesizing the voice “sa” has a vowel as the front phoneme, and thus, when the voice segment data indicative of the voice segment [a_#] has been supplied from the voice segment acquisition section 31, the boundary designation section 33 designates a phoneme segmentation boundary Bseg through the operation of step S4. Specifically, with a longer note length, an earlier time point on the time axis, i.e. earlier than the end point Tb2 of the vowel phoneme [a], is designated as a phoneme segmentation boundary Bseg, as shown in (b1) and (b2) of
Then, the boundary designation section 33 determines, at step S5, whether the rear phoneme of the voice segment indicated by the voice segment data is a vowel. If answered in the negative, the boundary designation section 33 jumps over step S6 to step S7. If, on the other hand, the rear phoneme of the voice segment indicated by the voice segment data is a vowel as determined at step S5, the boundary designation section 33 designates, at step S6, a phoneme segmentation boundary Bseg such that the time length from the start point of the vowel as the rear phoneme of the voice segment to the phoneme segmentation boundary Bseg corresponds to the note length indicated by the note data. For example, the voice segment [s_a] to be used for synthesizing the voice “sa” has a vowel as the rear phoneme, and thus, when the voice segment data indicative of the voice segment [s_a] has been supplied from the voice segment acquisition section 31, the boundary designation section 33 designates a phoneme segmentation boundary Bseg through the operation of step S6. Specifically, with a longer note length, a later time point on the time axis, i.e. later than the start point Ta2 of the rear phoneme [a], is designated as a phoneme segmentation boundary Bseg, as shown in (a1) and (a2) of
Once the boundary designation section 33 designates the phoneme segmentation boundary Bseg through the above-described procedures, it adds a marker, indicative of the position of the phoneme segmentation boundary Bseg, to the voice segment data and then outputs the thus-marked voice segment data to the voice synthesis section 35, at step S7. Note that, for each voice segment where the front and rear phonemes are each a vowel (e.g., [a_i]), both of the operations at steps S4 and S6 are carried out. Thus, for such a type of voice segment, a phoneme segmentation boundary Bseg (e.g., Bseg1, Bseg2) is designated for each of the front and rear phonemes, as illustrated in
Then, the voice synthesis section 35 connects together the plurality of voice segments to generate voice synthesizing data. Namely, the voice synthesis section 35 first selects a subject data group from the voice segment data supplied from the boundary designation section 33. The way to select the subject data groups will be described in detail individually for a case where the supplied voice segment data represents a voice segment including no vowel, a case where the supplied voice segment data represents a voice segment whose front phoneme is a vowel, and a case where the supplied voice segment data represents a voice segment whose rear phoneme is a vowel.
For the voice segment including no vowel, the end point of the voice segment is set, at step S2 of
Namely, once the voice segment data of the voice segment, where the rear phoneme is a vowel, is supplied along with the marker, the voice synthesis section 35 extracts, as a subject data group, the unit data D belonging to a region that precedes the phoneme segmentation boundary Bseg indicated by the marker. Now consider a case where voice segment data, including unit data D1 to Dl corresponding to a front phoneme [s] and unit data D1 to Dm corresponding to a rear phoneme [a] (vowel phoneme) as illustratively shown in (a2) of
Once the voice segment data of the voice segment, where the front phoneme is a vowel, is supplied along with the marker, the voice synthesis section 35 extracts, as a subject data group, the unit data D belonging to a region that succeeds the phoneme segmentation boundary Bseg indicated by the marker. Now consider a case where voice segment data, including unit data D1 to Dn corresponding to a front phoneme [a] of a voice segment [a_#] as illustratively shown in (b2) of
Further, for the voice segment where the front and rear phonemes are each a vowel, unit data D belonging to a region from a phoneme segmentation boundary Bseg, designated for the front phoneme, to the end point of the front phoneme and unit data D belonging to a region from the start point of the rear phoneme to a phoneme segmentation boundary Bseg designated for the rear phoneme are extracted as a subject data group. For example, for a voice segment [a_i] comprising a combination of the front and rear phonemes [a] and [i] that are each a vowel as illustratively shown in
Once the subject data groups of successive voice segments are designated through the above-described operations, the interpolation section 351 of the voice synthesis section 35 generates interpolating unit data Df for filling a gap Cf between the voice segments. More specifically, the interpolation section 351 generates interpolating unit data Df through linear interpolation using the last unit data D in the subject data group of the preceding voice segment and the first unit data D in the subject data group of the succeeding voice segment. In a case where the voice segments [s_a] and [a_#] are to be interconnected as shown in
Then, the voice synthesis section 35 performs predetermined operations on the individual unit data generated by the interpolation operation (including the interpolating unit data Df), to generate voice synthesizing data. The predetermined operations performed here include an operation for adjusting a voice pitch, indicated by the individual unit data D, into a pitch designated by the note data. The pitch adjustment may be performed using any one of the conventionally-known schemes. For example, the pitch may be adjusted by displacing the frequency spectra, indicated by the individual unit data D, along the frequency axis by an amount corresponding to the pitch designated by the note data. Further, the voice synthesis section 35 may perform an operation for imparting any of various effects to the voice represented by the voice synthesizing data. For example, when the note length is relatively long, slight fluctuation or vibrato may be imparted to the voice represented by the voice synthesizing data. The voice synthesizing data generated in the above-described manner is output to the output processing section 41. The output processing section 41 outputs the voice synthesizing data after converting the data into an output voice signal of the time domain.
As set forth above, the instant embodiment can vary the position of the phoneme segmentation boundary Bseg that defines a region of a voice segment to be supplied for the subsequent voice synthesis processing. Thus, as compared to the conventional technique where a voice is synthesized merely on the basis of an entire region of a voice segment, the present invention can synthesize diversified and natural voices. For example, when a time point, of a vowel phoneme included in a voice segment, before a waveform reaches a stationary state, has been designated as a phoneme segmentation boundary Bseg, it is possible to synthesize a voice imitative of a real voice uttered by a person without sufficiently opening the mouth. Further, because a phoneme segmentation boundary Bseg can be variably designated for one voice segment, there is no need to prepare a multiplicity of voice segment data with different regions (e.g., a multiplicity of voice segment data corresponding to various different opening degree of the mouth of a person).
In many cases, lyrics of a music piece where each tone has a relatively short note length vary at a high pace. It is necessary for a singer of such a music piece to sing at high speed, e.g. by uttering a next word before sufficiently opening his or her mouth to utter a given word. On the basis of such a tendency, the instant embodiment is arranged to designate a phoneme segmentation boundary Bseg in accordance with a note length of each tone constituting a music piece. Where each tone has a relatively short note length, such arrangements of the invention allow a synthesized voice to be generated using a region of each voice segment whose waveform has not yet reached a stationary state, so that it is possible to synthesize a voice imitative of a real voice uttered by a person (singing person) as the person sings at high speed without sufficiently opening his or her mouth. Where each tone has a relatively long note length, on the other hand, the arrangements of the invention allow a synthesized voice to be generated by also using a region of each voice segment whose waveform has reached the stationary state, so that it is possible to synthesize a voice imitative of a real voice uttered by a person as the person sings with his or her mouth sufficiently opened. Thus, the instant embodiment can synthesize natural singing voices corresponding to a music piece.
Further, according to the instant embodiment, a voice is synthesized on the basis of both a region, of a voice segment whose rear phoneme is a vowel, extending up to an intermediate or along-the-way point of the vowel and a region, of another voice segment whose front phoneme is a vowel, extending from an along-the-way point of the vowel. As compared to the technique where a phoneme segmentation boundary Bseg is designated for only one voice segment, the inventive arrangements can reduce differences between characteristics at and near the end point of a preceding voice segment and characteristics at and near the start point of a succeeding voice segment, so that the successive voice segments can be smoothly interconnected to synthesize a natural voice.
Next, a description will be made about a voice synthesis apparatus D in accordance with a second embodiment of the present invention, with reference
As shown in
Once voice segment data is supplied to the voice segment acquisition section 31 in the voice synthesis apparatus D, a time point, in a vowel of the voice segment indicated by the supplied voice segment data, corresponding to a parameter input via the input section 38, is designated as a phoneme segmentation boundary Bseg. More specifically, at step S4 of
The second embodiment too allows the position of the phoneme segmentation boundary Bseg to be variable and thus can achieve the same benefits as the first embodiment; that is, the second embodiment too can synthesize a variety of voices without having to increase the number of voice segments. Further, because the position of the phoneme segmentation boundary Bseg can be controlled in accordance with a parameter input by the user, a variety of voices can be synthesized with users intent precisely reflected therein. For example, there is a singing style where a singer sings without sufficiently opening the mouth at an initial stage immediately after a start of a music piece performance and then increases opening degree of the mouth as the tune rises or livens up. The instant embodiment can reproduce such a singing style by varying the parameter in accordance with progression of a music piece performance.
The above-described embodiments may be modified variously as explained by way of example below, and the modifications to be explained may be combined as necessary.
(1) The arrangements of the above-described first and second embodiments may be used in combination. Namely, the position of the phoneme segmentation boundary Bseg may be controlled in accordance with both a note length designated by note data and a parameter input via the input section 38. However, the position of the phoneme segmentation boundary Bseg may be controlled in any desired manner; for example, it may be controlled in accordance with a tempo of a music piece. Namely, for a voice segment where the front phoneme is a vowel, the faster the tempo of a music piece, the later time point on the time axis is designated as a phoneme segmentation boundary Bseg, while, for a voice segment where the rear phoneme is a vowel, the faster the tempo of a music piece, the earlier time point on the time axis is designated as a phoneme segmentation boundary Bseg. Further, data indicative of a position of a phoneme segmentation boundary Bseg may be provided in advance for each tone of a music piece so that the boundary designation section 33 designates a phoneme segmentation boundary Bseg on the basis of the data. Namely, in the present invention, it is only necessary that the phoneme segmentation boundary Bseg to be designated in a vowel phoneme be variable in position, and each phoneme segmentation boundary Bseg may be designated in any desired manner.
(2) In the above-described embodiments, the boundary designation section 33 outputs voice segment data to the voice synthesis section 35 after attaching the above-mentioned marker to the segment data, and the voice synthesis section 35 discards unit data D other than a selected subject data group. In an alternative, the boundary designation section 33 may discard the unit data D other than the selected subject data group. Namely, in the alternative, the boundary designation section 33 extracts the subject data group from the voice segment data on the basis of a phoneme segmentation boundary Bseg, and then supplies the extracted subject data to the sound synthesis section 35, discarding the other unit data D than the subject data group. Such inventive arrangements can eliminate the need for attaching the marker to the voice segment data.
(3) Form of the voice segment data may be other than the above-described. For example, data indicative of spectral envelopes of individual frames F of each voice segment may be stored and used as voice segment data. In another alternative, data indicative of a waveform, on the time axis, of each voice segment may be stored and used as voice segment data. In another alternative, the waveform of the voice segment may be divided, by the SMS (Spectral Modeling Synthesis) technique, into a deterministic component and stochastic component, and data indicative of the individual components may be stored and used as voice segment data. In this case, both of the deterministic component and stochastic component are subjected to various operations by the boundary designation section 33 and voice synthesis section 35, and the thus-processed deterministic and stochastic components are added together by an adder provided at a stage following the voice synthesis section 35. Alternatively, after each voice segment is divided into frames F, amounts of a plurality of characters related to spectral envelopes of the individual divided frames F of the voice segment, such as frequencies and gains at peaks of the spectral envelopes or overall inclinations of the spectral envelopes, may be extracted so that a set of parameters indicative of these amounts of characters is stored and used as voice segment data. Namely, in the present invention, the voice segments may be stored or retained in any desired form.
(4) Whereas the embodiments have been described as including the interpolation section 351 for interpolating a gap Cf between voice segments, such interpolation is not necessary essential. For example, there may be prepared a voice segment [a] to be inserted between voice segments [s_a] and [a_#], and the time length of the voice segment [a] may be adjusted in accordance with a note length so as to adjust a synthesized voice. Further, although the embodiments have been described as linearly interpolating a gap Cf between voice segments, the interpolation may be performed in any other desired manner. For example, curve interpolation, such as spline interpolation, may be performed. In another alternative, interpolation is performed on extracted parameters indicative of spectral envelope shapes (e.g., spectral envelopes and inclinations) of voice segments.
(5) The first embodiment has been described above as designating phoneme segmentation boundaries Bseg for both a voice segment where the front phoneme is a vowel and a voice segment where the rear phoneme is a vowel on the basis of the same or common mathematical expression ({(t−40)/2}). The way to designate the phoneme segmentation boundaries Bseg may differ between two such voice segments.
(6) Further, whereas the embodiments have been described as applied to an apparatus for synthesize singing voices, the basic principles of the invention is of course applicable to any other apparatus. For example, the present invention may be applied to an apparatus which reads out a string of letters on the basis of document data (e.g., text file). Namely, the voice segment acquisition section 31 may read out voice segment data from the storage section 20, on the basis of letter codes included in the text file, so that a voice is synthesized on the basis of the read-out out voice segment data. This type of apparatus can not use the factor “note length” to designate a phoneme segmentation boundary Bseg unlike in the case where a singing voice of a music piece is synthesized; however, if data designating a duration time length of each letter is prepared in advance in association with the document data, the apparatus can control the phoneme segmentation boundary Bseg in accordance with the time length indicated by the data. The “time data” used in the context of the present invention represents a concept embracing all types of data designating duration time lengths of voices, including not only data (“note data” in the above-described first embodiment) designating note lengths of tones constituting a music piece and sounding times of letters as explained in the modified examples. Note that, in the above-described document reading apparatus too, there may be employed arrangements for controlling the position of the phoneme segmentation boundary Bseg on the basis of a user-input parameter, as in the second embodiment.
Number | Date | Country | Kind |
---|---|---|---|
2004-209033 | Jul 2004 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
4278838 | Antonov | Jul 1981 | A |
6029131 | Bruckert | Feb 2000 | A |
6308156 | Barry et al. | Oct 2001 | B1 |
6332123 | Kaneko et al. | Dec 2001 | B1 |
6785652 | Bellegarda et al. | Aug 2004 | B2 |
6836761 | Kawashima et al. | Dec 2004 | B1 |
20010032079 | Okutani et al. | Oct 2001 | A1 |
20020184006 | Yoshioka et al. | Dec 2002 | A1 |
20030009336 | Kenmochi et al. | Jan 2003 | A1 |
20030009344 | Kayama et al. | Jan 2003 | A1 |
20030093280 | Oudeyer | May 2003 | A1 |
20030159568 | Kemmochi et al. | Aug 2003 | A1 |
20030221542 | Kenmochi et al. | Dec 2003 | A1 |
20050137871 | Capman et al. | Jun 2005 | A1 |
20060085196 | Kayama et al. | Apr 2006 | A1 |
Number | Date | Country |
---|---|---|
0 144 731 | Jun 1985 | EP |
1 220 194 | Jul 2002 | EP |
2002-73069 | Mar 2002 | JP |
2002-202790 | Jul 2002 | JP |
2003-255974 | Sep 2003 | JP |
Number | Date | Country | |
---|---|---|---|
20060015344 A1 | Jan 2006 | US |