The present disclosure relates to a technology to synthesize a sound.
A fragment connection type sound synthesizing technology has conventionally been proposed in which the duration and the utterance content (for example, lyrics) are specified for each unit of synthesis such as a musical note (hereinafter, referred to as “unit sound”) and a plurality of sound fragments corresponding to the utterance content of each unit sound are interconnected to thereby generate a desired synthesized sound. According to JP-B-4265501, a sound fragment corresponding to a vowel phoneme among a plurality of phonemes corresponding to the utterance content of each unit sound is prolonged, whereby a synthesized sound which is the utterance content of each unit sound uttered over a desired duration can be generated.
There are cases where, for example, a polyphthong (a diphthong, a triphthong) consisting of a plurality of vowels coupled together is specified as the utterance content of one unit sound. As a configuration for ensuring a sufficient duration with respect to one unit sound for which a polyphthong is specified as mentioned above, for example, a configuration is considered in which the sound fragment of the first one vowel of the polyphthong is prolonged. However, with the configuration in which the object to be prolonged is fixed to the first vowel of the unit sound, there is a problem in that synthesized sounds that can be generated are limited. For example, assuming a case where an utterance content “fight” (one syllable) containing a polyphthong where a vowel phoneme /a/ and a vowel phoneme /l/ are continuous in one syllable is specified as one unit sound, although a synthesized sound “[fa:lt]” where the first phoneme /a/ of the polyphthong is prolonged can be generated, a synthesized sound “[fal:t]” where the rear phoneme /l/ is prolonged cannot be generated (the symbol “:” means prolonged sound). While a case of a polyphthong is shown as an example in the above description, when a plurality of phonemes are continuous in one syllable, a similar problem can occur irrespective of whether they are vowels or consonants. In view of the above circumstances, an object of the present disclosure is to generate a variety of synthesized sounds by easing such restriction when sound fragments are prolonged.
In order to achieve the above object, according to the present invention, there is provided a sound synthesizing method comprising:
acquiring synthesis information which specifies a duration and an utterance content for each unit sound;
setting whether prolongation is permitted or inhibited for each of a plurality of phonemes corresponding to the utterance content of the each unit sound; and
generating a synthesized sound corresponding to the synthesis information by connecting a plurality of sound fragments corresponding to the utterance content of the each unit sound,
wherein in the generating process, a sound fragment corresponding to the phoneme the prolongation of which is permitted, among a plurality of phonemes corresponding to the utterance content of the each unit sound, is prolonged in accordance with the duration of the unit sound.
For example, in the setting process, whether the prolongation of each of the phonemes is permitted or inhibited is set in response to an instruction from a user.
For example, the sound synthesizing method further comprises: displaying a set image which provides a plurality of phonemes corresponding to the utterance content of a unit sound selected by the user among a plurality of unit sounds specified by the synthesis information for accepting from the user an instruction as to whether the prolongation of each of the phonemes is permitted or inhibited.
For example, the sound synthesizing method further comprises: displaying on a display device a phonemic symbol of each of the plurality of phonemes corresponding to the utterance content of the each unit sound so that a phoneme the prolongation of which is permitted and a phoneme the prolongation of which is inhibited are displayed in different display modes.
For example, in the display modes, a phonemic symbol having at least one of highlighting, an underlined part, a circle, and a dot is applied to the phoneme the prolongation of which is permitted.
For example, in the setting process, whether the prolongation is permitted or inhibited for, of the plurality of phonemes corresponding to the utterance content of the each unit sound, a sustained phoneme which is sustainable timewise is set.
For example, the sound synthesizing method further comprises: displaying a set image which provides a plurality of phonemes corresponding to the utterance content of a unit sound selected by the user among a plurality of unit sounds specified by the synthesis information for accepting from the user an instruction as to durations of the phonemes, wherein in the setting process, the sound fragments corresponding to the utterance content of the unit sound are prolonged so that duration of each of the phonemes corresponding to the utterance content of the unit sound conform with a ratio among the durations of the phonemes specified by the instruction accepted in the set image.
According to the present invention, there is also provided a sound synthesizing apparatus comprising:
a processor coupled to a memory, the processor configured to execute computer-executable units comprising:
wherein the sound synthesizer prolongs among a plurality of phonemes corresponding to the utterance content of the each unit sound, a sound fragment corresponding to the phoneme the prolongation of which is permitted in accordance with the duration of the unit sound.
According to the present invention, there is also provided a computer-readable medium having stored thereon a program for causing a computer to implement the sound synthesizing method.
According to the present invention, there is also provided a sound synthesizing method comprising:
acquiring synthesis information which specifies a duration and an utterance content for each unit sound;
setting whether prolongation is permitted or inhibited for at least one of a plurality of phonemes corresponding to the utterance content of the each unit sound; and
generating a synthesized sound corresponding to the synthesis information by connecting a plurality of sound fragments corresponding to the utterance content of the each unit sound,
wherein in the generating process, a sound fragment corresponding to the phoneme the prolongation of which is permitted, among a plurality of phonemes corresponding to the utterance content of the each unit sound, is prolonged in accordance with the duration of the unit sound.
The above objects and advantages of the present disclosure will become more apparent by describing in detail preferred exemplary embodiments thereof with reference to the accompanying drawings, wherein:
The arithmetic processing unit 12 executes a program PGM stored in the storage device 14, thereby implementing a plurality of functions (a display controller 32, an information acquirer 34, a prolongation setter 36 and a sound synthesizer 38) for generating the sound signal S. The following configurations may also be adopted: a configuration in which the functions of the arithmetic processing unit 12 are distributed to a plurality of apparatuses; and a configuration in which a dedicated electronic circuit (for example, DSP) implements some of the functions of the arithmetic processing unit 12.
The display device 22 (for example, a liquid crystal display panel) displays an image specified by the arithmetic processing unit 12. The input device 24 is a device (for example, a mouse or a keyboard) that accepts instructions from the user. A touch panel structured integrally with the display device 22 may be adopted as the input device 24. The sound emitting device 26 (for example, a headphone or a speaker) reproduces a sound corresponding to the sound signal S generated by the arithmetic processing unit 12.
The storage device 14 stores the program PGM executed by the arithmetic processing unit 12 and various pieces of data (a sound fragment group DA, synthesis information DB) used by the arithmetic processing unit 12. A known recording medium such as a semiconductor storage medium or a magnetic recording medium, or a combination of a plurality of kinds of recording media can be freely adopted as the storage device 14.
The sound fragment group DA is a sound synthesis library constituted by the pieces of fragment data P of a plurality of kinds of sound fragments used as sound synthesis materials. The pieces of fragment data P each define, for example, the sample series of the waveform of the sound fragment in the time domain and the spectrum of the sound fragment in the frequency domain. The sound fragments are each an individual phoneme (for example, a vowel or a consonant) which is the minimum unit when a sound is divided from a linguistic point of view (monophone), or a phoneme chain where a plurality of phonemes are coupled together (for example, a diphone or a triphone). The fragment data P of the sound fragment of the individual phoneme expresses the section, in which the waveform is stable, of the sound of continuous utterance of the phoneme (the section during which the acoustic feature is maintained stationary). On the other hand, the fragment data P of the sound fragment of the phoneme chain expresses the utterance of transition from a preceding phoneme to a succeeding phoneme.
Phonemes are divided into phonemes the utterance of which is sustainable timewise (hereinafter, referred to as “sustained phonemes”) and phonemes the utterance of which is not sustained (or is difficult to sustain) timewise (hereinafter, referred to as “non-sustained phonemes”). While a typical example of the sustained phonemes is vowels, consonants such as affricates, fricatives and liquids (nasals) (voiced consonants, voiceless consonants) can be included in the sustained phonemes. On the other hand, the non-sustained phonemes are phonemes the utterance of which is momentarily executed (for example, a phoneme uttered through a temporary deformation of the vocal tract that is in a closed state). For example, plosives are a typical example of the non-sustained phonemes. There is a difference that the sustained phonemes can be prolonged timewise whereas the non-sustained phonemes are difficult to prolong timewise with an auditorily natural sound being maintained.
The synthesis information DB stored in the storage device 14 is data (score data) that chronologically (in a time-serial manner) specifies the synthesized sound as the object of sound synthesis, and as shown in
The pitch information XA of
The utterance information XC is information that specifies the utterance content (grapheme) of the unit sound, and includes grapheme information XC1 and phoneme information XC2. The grapheme information XC1 specifies the uttered letters (grapheme) expressing the utterance content of each unit sound. In the first embodiment, one syllable of uttered letters (for example, a letter string of lyrics) corresponding to one unit sound is specified by the grapheme information XC1. The phoneme information XC2 specifies the phonemic symbols of a plurality of phonemes corresponding to the uttered letters specified by the grapheme information XC1. The grapheme information XC1 is not an essential element for the synthesis of the unit sounds and may be omitted.
The prolongation information XD of
The display controller 32 of
The user can instructs the sound synthesizing apparatus 100 to dispose the sound indicator 52 (add a unit sound) in the musical score area 50 by operating the input device 24. The display controller 32 disposes the sound indicator 52 specified by the user in the musical score area 50, and the information acquirer 34 adds to the synthesis information DB the unit information U corresponding to the sound indicator 52 disposed in the musical score area 50. The pitch information XA of the unit information U corresponding to the sound indicator 52 disposed by the user is selected in accordance with the position of the sound indicator 52 in the direction of the pitch axis AF. The utterance time XB1 of the time information XB of the unit information U corresponding to the sound indicator 52 is selected in accordance with the position of the sound indicator 52 in the direction of the time axis AT, and the duration XB2 of the time information XB is selected in accordance with the display length of the sound indicator 52 in the direction of the time axis AT. In response to an instruction from the user on the previously-disposed sound indicator 52 in the musical score area 50, the display controller 32 changes the position of the sound indicator 52 and the display length thereof on the time axis AT, and the information acquirer 34 changes the pitch information XA and the time information XB of the unit information U corresponding to the sound indicator 52.
By appropriately operating the input device 24, the user can select the sound indicator 52 of a given unit sound in the musical score area 50 and specify a desired utterance content (uttered letters). The information acquirer 34 sets, as the unit information U of the unit sound selected by the user, the grapheme information XC1 specifying the uttered letters specified by the user and the phoneme information XC2 specifying the phonemic symbols corresponding to the uttered letters. The prolongation setter 36 sets the prolongation information XD of the unit sound selected by the user, as the initial value (for example, the numeric value to inhibit the prolongation of each phoneme).
The display controller 32 disposes, as shown in
When the user selects the sound indicator 52 of a desired unit sound (hereinafter, referred to as “selected unit sound”) and applies a predetermined operation to the input device 24, as shown in
As shown in
The display controller 32 displays on the display device 22 the phonemic symbol 56 of the phoneme the prolongation information XD of which indicates permission of the prolongation and the phonemic symbol 56 of the phoneme the prolongation information XD of which indicates inhibition of the prolongation in different modes (modes that the user can visually distinguish from each other).
The sound synthesizer 38 of
Part (A) of
When the prolongation information XD of the phoneme /a/ specifies permission of the prolongation and the prolongation information XD of each of the phoneme /f/ and the phoneme /l/ specifies inhibition of the prolongation, as shown in part (B) of
When the prolongation information XD of the phoneme /l/ specifies permission of the prolongation and the prolongation information XD of each of the phoneme /f/ and the phoneme /a/ specifies inhibition of the prolongation, as shown in part (C) of
When the prolongation information XD of each of the phoneme /a/ and the phoneme /l/ specifies permission of the prolongation and the prolongation information XD of the phoneme /f/ specifies inhibition of the prolongation, as shown in part (D) of
Part (A) of
When the prolongation information XD of the phoneme /V/ specifies permission of the prolongation and the prolongation information XD of each of the phoneme /f/ and the phoneme /n/ specifies inhibition of the prolongation, as shown in part (B) of
On the other hand, when the prolongation information XD of the phoneme /n/ specifies permission of the prolongation and the prolongation information XD of each of the phoneme /f/ and the phoneme /V/ specifies inhibition of the prolongation, as shown in part (C) of
When the prolongation information XD of each of the phoneme /V/ and the phoneme /n/ specifies permission of the prolongation and the prolongation information XD of the phoneme /f/ specifies inhibition of the prolongation, as shown in part (D) of
As is understood from the examples shown above, the sound synthesizer 38 prolongs the sound fragment corresponding to a phoneme the prolongation of which is permitted by the prolongation setter 36 among a plurality of phonemes corresponding to the utterance content of one unit sound according to the duration XB2 of the unit sound. Specifically, the sound fragment corresponding to an individual phoneme the prolongation of which is permitted by the prolongation setter 36 (the sound fragments [a] and [l] in the example shown in
As described above, according to the first embodiment, since whether the prolongation is permitted or inhibited is individually set for each of a plurality of phonemes corresponding to the utterance content of one unit sound, the restriction on the prolongation of the sound fragments can be eased, for example, compared with a configuration in which the sound fragment of the first one vowel of a polyphthong is prolonged. Consequently, an advantage that a variety of synthesized sounds can be generated is offered. For example, for the uttered letters “fight” shown as an example in
A second embodiment of the present disclosure will be described. In the modes shown below as examples, elements the action and function of which are similar to those in the first embodiment are also denoted by the reference designations referred to in the description of the first embodiment, and detailed descriptions thereof are omitted as appropriate.
The prolongation setter 36 of the second embodiment sets whether the prolongation of each phoneme is permitted or inhibited, in accordance with the positions of the operation images 74 in the set image 70. The sound synthesizer 38 prolongs each sound fragment so that the durations of the phonemes corresponding to one unit sound conform with the ratio among the durations of the phonemes specified on the set image 70. That is, in the second embodiment, as in the first embodiment, whether the prolongation is permitted or inhibited is individually set for each of a plurality of phonemes of each unit sound. Consequently, similar effects to those of the first embodiment are achieved in the second embodiment.
<Modifications>
The above-described modes may be modified variously. Concrete modifications will be shown below. Two or more modifications arbitrarily selected from among the modifications shown below may be merged as appropriate.
(1) While a case where a synthesized sound which is an utterance of English (the uttered letters “fight” and “fun”) is generated is shown as an example in the above-described embodiments, the language of the synthesized sound is arbitrary. In some languages, there are cases where a one-syllable phoneme chain of a first consonant, a vowel and a second consonant (C-V-C) can be specified as the uttered letters of one unit sound. For example, in Korean, a phoneme chain consisting of a first consonant, a vowel and a second consonant is present. The phoneme chain includes the second consonant (a consonant situated at the end of a syllable) called “patchim”. When the first consonant and the second consonant are sustained phonemes, as in the above-described first and second embodiments, a configuration is suitable in which whether the prolongation of each of the first consonant, the vowel and the second consonant is permitted or inhibited is individually set. For example, when one-syllable uttered letters “han” constituted by a phoneme /h/ of the first consonant, a phoneme /a/ of the vowel and a phoneme /n/ of the second consonant are specified as one unit sound, a synthesized sound “[ha:n]” where the phoneme /a/ is prolonged and a synthesized sound “[han:]” where the phoneme /n/ is prolonged can be selectively generated.
While
(2) While the information acquirer 34 generates the synthesis information DB in response to an instruction from the user in the above-described modes, the following configurations may be adopted: a configuration in which the information acquirer 34 acquires the synthesis information DB from an external apparatus, for example, through a communication network; and a configuration in which the information acquirer 34 acquires the synthesis information DB from a portable recording medium. That is, the configuration in which the synthesis information DB is generated or edited in response to an instruction from the user may be omitted. As is understood from the above description, the information acquirer 34 is embraced as an element that acquires the synthesis information DB (an element that acquires the synthesis information DB from an external apparatus or an element that generates the synthesis information DB by itself).
(3) While a case where one syllable of uttered letters are specified as one unit sound is shown in the above-described modes, one syllable of uttered letters may be assigned to a plurality of unit sounds. For example, as shown in
(4) While a configuration in which whether the prolongation is permitted or inhibited is not specified for the non-sustained phonemes is shown in the above-described embodiments, a configuration in which whether the prolongation is permitted or inhibited can be specified for the non-sustained phonemes may be adopted. The sound fragments of the non-sustained phonemes include the silent sections of the non-sustained phonemes before utterance. Therefore, when the prolongation is permitted for the non-sustained phonemes, the sound synthesizer 38 prolongs, for example, the silent sections of the sound fragments of the non-sustained phonemes.
Here, the details of the above embodiments are summarized as follows.
A sound synthesizing apparatus of the present disclosure includes: an information acquirer (for example, information acquirer 34) for acquiring synthesis information that specifies a duration and an utterance content for each unit sound, a prolongation setter (for example, prolongation setter 36) for setting whether prolongation is permitted or inhibited for each of a plurality of phonemes corresponding to the utterance content of each unit sound, and a sound synthesizer (for example, sound synthesizer 38) for generating a synthesized sound corresponding to the synthesis information by connecting a plurality of sound fragments corresponding to the utterance content of the each unit sound, the sound synthesizer prolongs, among a plurality of phonemes corresponding to the utterance content of the each unit sound, a sound fragment corresponding to the phoneme the prolongation of which is permitted by the prolongation setter, according to the duration of the unit sound.
According to this configuration, since whether the prolongation is permitted or inhibited is set for each of a plurality of phonemes corresponding to the utterance content of each unit sound, an advantage is offered that compared with the configuration in which, for example, the first phoneme of a plurality of phonemes (for example, a polyphthong) corresponding to each unit sound is prolonged at all times, the limitation on the prolongation of sound fragments at the time of synthesized sound generation is eased and a variety of synthesized sounds can be generated as a result.
For example, the prolongation setter sets whether the prolongation of each phoneme is permitted or inhibited in response to an instruction from a user.
According to this configuration, since whether the prolongation of each phoneme is permitted or inhibited is set in response to an instruction from the user, an advantage is offered that a variety of synthesized sounds conforming to the user's intension can be generated. For example, a sound synthesizing apparatus is provided with a first display controller (for example, display controller 32) for providing a plurality of phonemes corresponding to the utterance content of a unit sound selected by the user among a plurality of unit sounds specified by the synthesis information, and displaying a set image (for example, set image 60 or set image 70) that accepts from the user an instruction as to whether the prolongation of each phoneme is permitted or inhibited.
According to this configuration, since the set image which provides a plurality of phonemes corresponding to a unit sound selected by the user and accepts an instruction from the user is displayed on a display device, an advantage is offered that the user can easily specify whether the prolongation of each phoneme is permitted or inhibited for each of a plurality of unit sounds.
A sound synthesizing apparatus is provided with a second display controller (for example, display controller 32) for displaying on a display device a phonemic symbol of each of a plurality of phonemes corresponding to the utterance content of each unit sound so that a phoneme the prolongation of which is permitted by the prolongation setter and a phoneme the prolongation of which is inhibited by the prolongation setter are displayed in different display modes. According to this configuration, since the phonemic symbols of the phonemes are displayed in different display modes according to whether the prolongation is permitted or inhibited, an advantage is offered that the user can easily check whether the prolongation of each phoneme is permitted or inhibited. The display mode means image characteristics that the user can visually discriminate, and typical examples of the display mode are the brightness (gradation), the chroma, the hue and the format (the letter type, the letter size, the presence or absence of highlighting such as an underline). Moreover, in addition to the configuration in which the display modes of the phonemic symbols themselves are made different, a configuration may be embraced in which the display modes of the backgrounds (grounds) of the phonemic symbols are made different according to whether the prolongation of the phonemes is permitted or inhibited. For example, the following configurations are adopted: a configuration in which the patterns of the backgrounds of the phonemic symbols are made different; and a configuration in which the backgrounds of the symbols are blinked.
Also, the prolongation setter sets whether the prolongation is permitted or inhibited for, of a plurality of phonemes corresponding to the utterance content of each unit sound, a sustained phoneme that is sustainable timewise.
According to this configuration, since whether the prolongation is permitted or inhibited is set for the sustained phoneme, an advantage is offered that a synthesized sound can be generated with an auditorily natural sound being maintained for each phoneme.
The sound synthesizing apparatus according to the above-described modes is implemented by a cooperation between a general-purpose arithmetic processing unit such as a CPU (central processing unit) and a program as well as implemented by hardware (electronic circuit) such as a DSP (digital signal processor) exclusively used for synthesized sound generation. The program of the present disclosure causes a computer to execute: information acquiring processing for acquiring synthesis information that specifies a duration and an utterance content for each unit sound; prolongation setting processing for setting whether prolongation is permitted or inhibited for each of a plurality of phonemes corresponding to the utterance content of each unit sound; and sound synthesizing processing for generating a synthesized sound corresponding to the synthesis information by connecting a plurality of sound fragments corresponding to the utterance content of each unit sound, the sound synthesizing processing prolonging, of a plurality of phonemes corresponding to the utterance content of each unit sound, a sound fragment corresponding to the phoneme the prolongation of which is permitted by the prolongation setting processing, according to the duration of the unit sound. According to this program, similar workings and effects to those of a music data editing apparatus of the present disclosure are realized. The program of the present disclosure is installed on a computer by being provided in the form of distribution through a communication network as well as installed on a computer by being provided in the form of being stored in a computer readable recording medium.
Although the invention has been illustrated and described for the particular preferred embodiments, it is apparent to a person skilled in the art that various changes and modifications can be made on the basis of the teachings of the invention. It is apparent that such changes and modifications are within the spirit, scope, and intention of the invention as defined by the appended claims.
The present application is based on Japanese Patent Application No. 2012-074858 filed on Mar. 28, 2012, the contents of which are incorporated herein by reference.
Number | Date | Country | Kind |
---|---|---|---|
2012-074858 | Mar 2012 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5479564 | Vogten et al. | Dec 1995 | A |
6088671 | Gould et al. | Jul 2000 | A |
6240384 | Kagoshima et al. | May 2001 | B1 |
6330538 | Breen | Dec 2001 | B1 |
6470316 | Chihara | Oct 2002 | B1 |
7031922 | Kalinowski et al. | Apr 2006 | B1 |
7546241 | Yamada et al. | Jun 2009 | B2 |
7877259 | Marple et al. | Jan 2011 | B2 |
8214216 | Sato | Jul 2012 | B2 |
8346557 | Kurzweil et al. | Jan 2013 | B2 |
8504368 | Katae | Aug 2013 | B2 |
20010037202 | Yamada et al. | Nov 2001 | A1 |
20040102973 | Lott | May 2004 | A1 |
20060015344 | Kemmochi | Jan 2006 | A1 |
20060136214 | Sato | Jun 2006 | A1 |
20080319754 | Nishiike | Dec 2008 | A1 |
20080319755 | Nishiike et al. | Dec 2008 | A1 |
Number | Date | Country |
---|---|---|
101334994 | Dec 2008 | CN |
1 617 408 | Jan 2006 | EP |
2001-343987 | Dec 2001 | JP |
2002-123281 | Apr 2002 | JP |
2004-258562 | Sep 2004 | JP |
2006-071931 | Mar 2006 | JP |
04-265501 | May 2009 | JP |
2011-023363 | Feb 2011 | JP |
2011128186 | Jun 2011 | JP |
Entry |
---|
JPO Machine Translation of JP 2011128186 A. |
Search Report by Registered Searching Organization (Feb. 15, 2016) from JPO prosecution of JP2012-074858A. |
Assessment on Search Report by Registered Searching Organization (Feb. 22, 2016) from JPO prosecution of JP2012-074858A. |
Demol, M. et al. (Oct. 17, 2005). “Efficient Non-Uniform Time-Scaling of Speech with WSOLA,” SPECOM, XX, XX, pp. 163-166. |
European Search Report mailed Jun. 7, 2013, for European Patent Application No. 13158187.8, 7 pages. |
Tihelka, D. et al. (Sep. 1, 2011). “Generalized Non-uniform Time Scaling Distribution Method for Natural-Sounding Speech Rate Change,” Text, Speech and Dialogue, Springer Berlin Heidelberg, Berlin, Heidelbert, pp. 147-154. |
Chinese Search report dated Mar. 13, 2015 for CN Patent Application No. 201310104780, filed Mar. 28, 2013, with English translation, four pages. |
Notification of the first Office Action dated Mar. 13, 2015 for CN Patent Application No. 201310104780, filed Mar. 28, 2013, with English translation, 15 pages. |
Chinese Search report dated Oct. 23, 2015 for CN Patent Application No. 201310104780, filed Mar. 28, 2013, with English translation, four pages. |
Notification of the Second Office Action dated Oct. 23, 2015 for CN Patent Application No. 201310104780, filed Mar. 28, 2013, with English translation, 20 pages. |
Notification of Reasons for Refusal dated Mar. 8, 2016, for JP Patent Application No. 2012-074858, with English translation, ten pages. |
Number | Date | Country | |
---|---|---|---|
20130262121 A1 | Oct 2013 | US |