The present disclosure relates to voice synthesis.
Known in the art are voice synthesis techniques, such as those used for singing. To enhance expressiveness of a singing voice, attempts have been made to not only output a voice with given lyrics in a given scale, but also to impart musical expressivity to the singing voice. Japanese Patent Application Laid-Open Publication No. 2014-2338 (hereafter, Patent Document 1) discloses a technology for changing a voice quality of a synthesized voice to a target voice quality. This is achieved by adjusting a harmonic component of a voice signal of a voice having the target voice quality to be within a frequency band that is close to a harmonic component of a voice signal of a voice that has been synthesized (hereafter, “synthesized voice”).
In the technology disclosed in Patent Document 1, it may not be possible to impart to a synthesized voice a sufficient user-desired expressivity of a singing voice.
In contrast, the present disclosure provides a technology that is able to impart to a singing voice a richer variety of voice expression.
A voice synthesis method according to an aspect of the present disclosure includes altering a series of synthesis spectra in a partial period of a synthesis voice based on a series of amplitude spectrum envelope contours of a voice expression to obtain a series of altered spectra to which the voice expression has been imparted; and synthesizing a series of voice samples to which the voice expression has been imparted, based on the series of altered spectra.
In another aspect, a voice synthesis device includes: at least one processor and a memory coupled to the processor, the memory storing instructions executable by the processor that cause the processor to: alter a series of synthesis spectra in a partial period of a synthesis voice based on a series of amplitude spectrum envelope contours of a voice expression, to obtain a series of altered spectra to which the voice expression has been imparted; and synthesize a series of voice samples to which the voice expression has been imparted, based on the series of altered spectra.
In still another aspect, a non-transitory computer storage medium stores a computer program which when executed by a computer, causes the computer to perform a voice synthesis method of: altering a series of synthesis spectra in a partial period of a synthesis voice based on a series of amplitude spectrum envelope contours of a voice expression, to obtain a series of altered spectra to which the voice expression has been imparted; and synthesizing a series of voice samples to which the voice expression has been imparted, based on the series of altered spectra.
According to the present disclosure, it is possible to provide a richer variety of voice expression.
Various technologies for voice synthesis are known. A voice with a change in scale and rhythm among voices is referred to as a singing voice. To achieve singing voice synthesis, there are known synthesis, which is based on sample concatenation, and statistical synthesis. To carry out singing voice synthesis based on sample concatenation, a database in which a large number of recorded singing samples are stored can be used. A singing sample, which is an example voice sample, is mainly classified by lyrics consisting of a mono-phoneme or a phoneme chain. When singing voice synthesis is performed, singing samples are connected after a fundamental frequency, a timing, and a duration are adjusted based on a score. The score designates a start time, a duration (or an end time) and lyrics for each of a series of notes constituting a song.
It is necessary for a singing sample used for singing voice synthesis based on sample concatenation to have a voice quality that is as constant as possible for all lyrics registered in the database. If the voice quality is not constant, unnatural variances will occur when a singing voice is synthesized. Further, it is necessary that, from among dynamic acoustic changes that are included in the samples, a part corresponding to a singing voice expression, which is an example of a voice expression, is not expressed in the synthesized voice when synthesis is carried out. The reason for this is because the expression of the singing voice is to be imparted to a singing voice in accordance with a musical context, and does not have any direct association with lyric types. If a same expression of the singing voice is repeatedly used for a specific lyric type, the result will be unnatural. Thus, in carrying out singing voice synthesis based on sample concatenation, changes in fundamental frequency and volume, which are included in the singing sample, are not directly used, but changes in fundamental frequency and volume generated based on the score and one or more predetermined rules are instead used. Assuming that singing samples corresponding to all combinations of lyrics and expressions of singing voices are recorded in a database, a singing sample appropriate for a lyric type in a score and that imparts a natural expression to a singing voice in a specific musical context could be selected. In practice, such an approach is time and labor consuming, and if singing samples corresponding to all expressions of the singing voice for all lyric types were to be recorded, a huge storage capacity would be required. In addition, since a number of combinations of samples increases exponentially relative to a number of samples, there is no guarantee that an unnatural synthesized voice will be avoided for each and every sample relation.
On the other hand, using statistical singing voice synthesis, a relationship between a score and features pertaining to a spectrum of a singing voice (hereafter, “spectral features”) is learned in advance as a statistical model by using voluminous training data. Upon carrying out synthesis, the most likely spectral features are estimated with reference to the input score, and the singing voice is then synthesized using the spectral features. In carrying out statistical singing voice synthesis, it is possible to learn a statistical model that includes a richly expressive range of singing voices, by training data applicable to a wide range of different singing styles. Notwithstanding, two specific problems arise in carrying out statistical singing voice synthesis. The first problem is excessive smoothing. A process of learning a statistical model using voluminous training data involves that the data be averaged, which results in degradation of dimensional variance of spectral features, and inevitably causes a synthesized output to lack the expressivity of even an average single singing voice. As a result, expressivity and realism of the synthesized voice is far from satisfactory. The second problem is that types of spectral features from which the statistical model can be learned are limited. In particular, due to the cyclical value range of phase information, it is difficult to carry out satisfactory statistical modeling. For example, it is difficult to appropriately model a phase relationship between harmonic components or between a specific harmonic component and a component proximate to the specific harmonic component, and the modeling of a temporal variation thereof is also a difficult task. However, if a richly expressive singing voice is to be synthesized, including deep and husky characteristics, it is important that the phase information is appropriately used.
Voice quality modification (VQM) described in Patent Document 1 discloses a technology for carrying out synthesis to produce a variety of singing voice qualities. In the VQM, there are used a first voice signal of a voice corresponding to a particular singing voice expressivity, together with a second voice signal of a synthesized singing voice. The second voice signal may be of a singing voice that is synthesized based on sample concatenation, or it may be of a voice that is synthesized based on statistical analysis. By use of the two voice signals, singing voices with appropriate phase information are synthesized. As a result, a realistic singing voice that is rich in expressivity is synthesized, in contrast to ordinary singing voice that is synthesized. It is of note, however, that use of this technology, does not enable a temporal change in the spectral features of a first voice signal to be adequately reflected in the synthesized singing voice. It is also of note that the temporal change of interest here includes not only a rapid change in spectral features that occurs with steady utterance of a deep voice and a husky voice, but also, for example, upon transition in a voice quality over a relatively long period of time (a macroscopic transition), where a substantial amount of rapid variation occurs upon commencement of utterance, and gradually reduces over time, and then stabilizes with a further lapse of time. Depending on an expressivity of a voice, substantial changes in voice quality may occur.
In this example, a reference time for addition of the synthesized voice and the expressive sample is a head time of the note or an end time of the note. Hereafter, setting the head time of the note as a reference time is referred to as an “attack reference”, and setting the end time as a reference time is referred to as a “release reference”.
In this example, the storage device 103 stores a computer program that causes a computer device to function as the voice synthesis device 1 (hereafter, referred to as a “singing voice synthesis program”). By the CPU 101 executing the singing voice synthesis program, functions as shown in
The database 10 includes a database (a sample database) in which recorded singing samples are stored, and a database (a singing voice expression database) in which expressive samples are recorded and stored. Since the sample database is the same as a conventional database used for the singing voice synthesis based on sample concatenation, detailed description thereof will be omitted. Hereafter, the singing voice expression database is simply referred to as the database 10, unless otherwise specified. The spectral features of the expressive sample can be estimated in advance, and the estimated spectral features can be recorded in the database 10, to achieve both reduction in calculation load at the time of singing voice synthesis and prevention of an estimation error of the spectral features. The spectral features recorded in the database 10 may be corrected manually.
In the database 10, at least one sample per expression of the singing voice is recorded. Two or more samples may be recorded depending on lyrics. It is not necessary for a unique expressive sample to be recorded for each and every lyrics. This is because the expressive sample is morphed with a synthesized voice, as a result of which the basic quality of a singing voice has already been secured. For example, in order to obtain a singing voice having good quality in the singing voice synthesis based on sample concatenation, it is necessary to record a sample for each lyric of a 2-phoneme chain (for example, a combination of /a-i/ or /a-o/). However, a unique expressive sample may be recorded for each mono-phoneme (for example, /a/ or /o/), or a number may be reduced and one expressive sample (for example, only /a/) may be recorded per expression of the singing voice. A human database creator determines a number of samples to be recorded for each expression of the singing voice while balancing an amount of time for creation of the singing voice expression database and a quality of the synthesized voice. An independent expressive sample is recorded for each lyric in order to obtain a higher quality (realistic) synthesized voice. The number of samples per expression of the singing voice is reduced in order to reduce the amount of time for creation of the singing voice expression database.
When two or more samples are recorded per expression of the singing voice, it is necessary to define mapping (association) between the sample and the lyrics. An example is given in which, for a certain expression of the singing voice, a sample file “S0000” is mapped to lyrics /a/ and /i/, and a sample file “S0001” is mapped to lyrics /u/, /e/, and /o/. Such mapping is defined for each expression of the singing voice. The number of recorded samples stored in the database 10 may be different for each of the expressions of the singing voice. For example, two samples may be recorded for a particular expression of the singing voice, while five samples may be recorded for another expression of the singing voice.
Information indicating an expression reference time is stored for each expressive sample in the database 10. This expression reference time is a feature point on the time axis in a waveform of the expressive sample. The expression reference time includes at least one of a singing voice expression start time, a singing voice expression end time, a note onset start time, a note offset start time, a note onset end time, or a note offset end time. For example, as shown in
As shown in
As shown in
A template of parameters to be applied to singing voice synthesis is recorded in the database 10. The parameters referred to herein include, for example, a temporal transition in an amount of morphing (a coefficient), a time length of morphing (hereinafter referred to as an “expression impartment length”), and a speed of the expression of the singing voice.
As shown in
The expression imparter 20B in
Using the expression reference time recorded for the expressive sample, the timing calculator 21 calculates an amount of timing adjustment for matching the expressive sample with a predetermined timing of the synthesized voice. The amount of timing adjustment corresponds to a position on a time axis on which the expressive sample is set for the synthesized voice.
An operation of the timing calculator 21 will be described with reference to
The temporal expansion/contraction mapper 22 calculates temporal expansion or contraction mapping of the expressive sample positioned on the synthesized voice on the time axis (performs an expansion process on the time axis). Here, the temporal expansion/contraction mapper 22 calculates a mapping function representing a time correspondence between the synthesized voice and the expressive sample. A mapping function to be used here is a nonlinear function in which each expressive sample expands or contracts differently for each section based on the expression reference time of an expressive sample. Using such a function, the expression of the singing voice can be added to the synthesized voice while minimizing loss of the nature of the expression of the singing voice included in the sample. The temporal expansion/contraction mapper 22 performs temporal expansion on feature portions in the expressive sample using an algorithm differing from an algorithm used for portions other than the feature portions (that is, using a different mapping function). The feature portions are, for example, a pre-section T1 and an onset section T2 in the expression of the singing voice with the attack reference, as will be described below.
As shown in
In
The short-time spectrum operator 23 in
The amplitude spectrum envelope is a contour of the amplitude spectrum, and mainly relates to perception of lyrics and individuality. A large number of methods of obtaining the amplitude spectrum envelope has been proposed. For example, a cepstrum coefficient is estimated from the amplitude spectrum, and a low order coefficient (a coefficient group having an order equal to or lower than a predetermined order a) among the estimated cepstrum coefficients is used as the amplitude spectrum envelope. An important point of this embodiment is to treat the amplitude spectrum envelope independently of other components. Assuming, when the expressive sample having different lyrics or individuality from the synthesized voice is used, and if the amount of morphing regarding the amplitude spectrum envelope is set to zero, then 100% of lyrics and individuality of an original synthesized voice appears in the synthesized voice to which the expression of the singing voice has been imparted. Therefore, the expressive sample can be applied even if it has different lyrics or individuality from the synthesized voice (for example, other lyrics of a person himself or herself or samples of completely different persons). Conversely, if a user desires to intentionally change the lyrics or individuality of the synthesized voice, the amount of morphing for the amplitude spectrum envelope may be set to an appropriate amount that is not zero, and morphing may be carried out independently from morphing of other components of the expression of the singing voice.
The amplitude spectrum envelope contour is a contour in which the amplitude of the amplitude spectrum envelope is expressed more roughly and, mainly relates to the brightness of a voice. The amplitude spectrum envelope contour can be obtained in various ways. For example, coefficients having a lower order than the amplitude spectrum envelope (a group of coefficients having an order equal to or lower than an order b that is lower than the order a) among the estimated cepstrum coefficients are used as the amplitude spectrum envelope contour. Information on the lyrics or individuality is not substantially included in the amplitude spectrum envelope contour, unlike the amplitude spectrum envelope. Therefore, the brightness of the voice included in the expression of the singing voice and a temporal variation thereof can be imparted to the synthesized voice by morphing amplitude spectrum envelope contour components regardless of whether or not to carry out morphing of the amplitude spectrum envelope.
The phase spectrum envelope is a contour of the phase spectrum. The phase spectrum envelope can be obtained in various ways. For example, the short-time spectrum operator 23 first analyzes a short-time spectrum in a frame with a variable length and a variable amount of shift synchronized with a cycle of a signal. For example, a frame with a window width n times a fundamental cycle T (=1/F0) and the amount of shift m times (m<n) the fundamental cycle T is used (for example, m and n are natural numbers). A fine variation can be extracted with high temporal resolution by using the frame synchronized with the cycle. Thereafter, the short-time spectrum operator 23 extracts only a value of a phase in each harmonic component, discards other values at this stage, and carries out phase interpolation for other frequencies (between a harmonic and a harmonic) than the harmonic component, so that a phase spectrum envelope that is not a phase spectrum is obtained. For the interpolation, nearest neighbor interpolation or linear or higher order curve interpolation can be used.
When both the amplitude spectrum envelope and the amplitude spectrum envelope contour are used as the spectral features, morphing of (a) amplitude spectrum envelope (for example,
A harmonic amplitude and a harmonic phase may be used in place of the amplitude spectrum envelope and the phase spectrum envelope. The harmonic amplitude is a sequence of amplitudes of respective harmonic components constituting a harmonic structure of a voice, and the harmonic phase is a sequence of phases of the respective harmonic components constituting the harmonic structure of the voice. Whether to use the amplitude spectrum envelope and the phase spectrum envelope or to use the harmonic amplitude and the harmonic phase depends on a selection of a synthesis scheme by the synthesizer 24. When synthesis of a pulse train or synthesis using a time-varying filter is performed, the amplitude spectrum envelope and the phase spectrum envelope are used, and the harmonic amplitude and the harmonic phase are used in a synthesis scheme based on a sinusoidal model like SMS, SPP, or WBHSM.
The fundamental frequency mainly relates to perception of a pitch. The fundamental frequency cannot be obtained through simple interpolation between the two frequencies, unlike the other features of the spectrum. This is because a pitch of a note in the expressive sample and a pitch of a note of the synthesized voice are generally different from each other, and even when the fundamental frequency of the expressive sample and the fundamental frequency of the synthesized voice are synthesized at the simply interpolated fundamental frequency, a pitch completely different from the pitch to be synthesized is obtained. Therefore, in the embodiment, the short-time spectrum operator 23 first shifts the fundamental frequency of the entire expressive sample by a certain amount so that the pitch of the expressive sample matches the pitch of the note of the synthesized voice. This process is not for matching the fundamental frequency as of each time of the expressive sample with that of the synthesized voice. Therefore, a dynamic variation in the fundamental frequency included in the expressive sample is retained.
The first extractor 232 extracts, for each frame, an amplitude spectrum envelope H(f), an amplitude spectrum envelope contour G(f), and a phase spectrum envelope P(f) from spectra calculated by the frequency analyzer 231. The second extractor 233 calculates a difference between the amplitude spectrum envelopes H(f) of the temporally successive frames as a temporal fine variation I(f) of the amplitude spectrum envelope H(f) for each frame. Similarly, the second extractor 233 calculates a difference between the temporally successive phase spectrum envelopes P(f) as a temporal fine variation Q(f) of the phase spectrum envelope P(f). The second extractor 233 may calculate a difference between any one amplitude spectrum envelope H(f) and a smoothed value (for example, an average value) of amplitude spectrum envelopes H(f) as a temporal fine variation I(f). Similarly, the second extractor 233 may calculate a difference between any one phase spectrum envelope P(f) and a smoothed value of phase spectrum envelopes P(f) as a temporal fine variation Q(f). H(f) and G(f) extracted by the first extractor 232 are the amplitude spectrum envelope and the envelope contour from which the fine variation I(f) has been removed, and P(f) extracted by the first extractor 232 is the phase spectrum envelope from which the fine variation Q(f) has been removed.
It is of note that although the case in which the short-time spectrum operator 23 extracts the spectral features from the expressive sample is given as an example for convenience in the above description, the short-time spectrum operator 23 may extract the spectral features from the synthesized voice generated by the singing voice synthesizer 20A, using the same method. Depending on a synthesis scheme of the singing voice synthesizer 20A, the short-time spectrum and/or a part or the entirety of the spectrum features is likely to be included in the singing voice synthesis parameter, and in this case, the short-time spectrum operator 23 may receive these pieces of data from the singing voice synthesizer 20A, in which case the calculation may be omitted. Alternatively, the short-time spectrum operator 23 may extract the spectral features of the expressive sample in advance prior to the input of the synthesized voice and stores the spectral features in a memory, and when the synthesized voice is input, the short-time spectrum operator 23 may read out the spectral features of the expressive sample from the memory and output the spectral features. It is possible to reduce the amount of processing per unit time performed when the synthesized voice is input.
The synthesizer 24 synthesizes the synthesized voice with the expressive sample to obtain a synthesized voice to which the expression of the singing voice has been imparted. There are various methods of synthesizing the synthesized voice with the expressive sample and obtaining a waveform of the resultant voice in the time domain in the end. These methods can be roughly classified into two types depending on how an input spectrum is expressed. One of the methods is a method based on harmonic components and the other is a method based on the amplitude spectrum envelope.
As a synthesis method based on harmonic components, for example, SMS is known (Serra, Xavier, and Julius Smith. “Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition.” Computer Music Journal 14.4 (1990): 12-24). The spectrum of a voiced sound is expressed by a frequency, amplitude, and phase of a sinusoidal component at a fundamental frequency and at substantially integral multiples of the fundamental frequency. When the spectrum is generated by SMS and inverse Fourier transformation is performed, a waveform corresponding to several periods multiplied by a window function can be obtained. After dividing the waveform by the window function, only the vicinity of a center of a synthesis result is cut out by another window function and added in an overlapping manner in an output result buffer. This process is repeated at frame intervals such that a continuous waveform of a long duration can be obtained.
As a synthesis method based on the amplitude spectrum envelope, for example, NBVPM (Bonada, Jordi. “High quality voice transformations based on modeling radiated voice pulses in frequency domain.” Proc. Digital Audio Effects (DAFx). 2004) is known. In this example, the spectrum is expressed by the amplitude spectrum envelope and the phase spectrum envelope, and does not include the fundamental frequency or the frequency information of harmonic components. When this spectrum is subjected to inverse Fourier transformation, a pulse waveform corresponding to vocal cord vibration for one cycle and a vocal tract response thereto is obtained. This is added in an overlapping manner in an output buffer. In this case, when phase spectrum envelopes in the spectra of adjacent pulses have substantially the same value, a reciprocal number of a time interval for addition in an overlapping manner in the output buffer becomes a final fundamental frequency of the synthesized voice.
For synthesis of the synthesized voice with the expressive sample, there are a method of carrying out the synthesis in a frequency domain and a method of carrying out the synthesis in a time domain. In either method, the synthesis of the synthesized voice with the expressive sample is basically performed in accordance with the following procedure. First, the synthesized voice and the expressive sample are morphed relative to components other than the temporal fine variation component of the amplitude and the phase. Then, the synthesized voice to which the expression of the singing voice has been imparted is generated by adding the temporal fine variation components of the amplitudes and the phases of the respective harmonic components (or frequency bands proximate to the harmonic components).
It is of note that, when the synthesized voice is synthesized with the expressive sample, temporal expansion/contraction mapping different from that for components other than the temporal fine variation component may be used only for the temporal fine variation component. This is effective, for example, in two cases below.
The first case is a case in which the user intentionally has changed the speed of the expression of the singing voice. The speed of the variation or the periodicity of the temporal fine variation component is closely related to texture of a voice (for example, texture such as “rustling”, “scratchy”, or “fizzy”), and when the variation speed is changed, the texture of the voice is altered. For example, when the user inputs an instruction to increase a speed of the expression of the singing voice in the expression of the singing voice in which a pitch decreases at an end as shown in
The second case is a case where, in an expression of the singing voice, a cycle at which the temporal fine variation component varies should depend on the fundamental frequency. In the expression of the singing voice including periodic modulation in an amplitude and a phase of a harmonic component, it is empirically known that a voice may be heard naturally when temporal correspondence to the fundamental frequency is maintained with respect to a cycle at which an amplitude and a phase vary. An expression of the singing voice having such texture is referred to, for example, as “rough” or “growl”. A scheme that can be used for maintaining the temporal correspondence to the fundamental frequency with respect to the cycle at which the amplitude and the phase vary is to apply, to a data readout speed of a temporal fine variation component, the same ratio as a conversion ratio of a fundamental frequency that is applied when a waveform of an expressive sample is synthesized.
The synthesizer 24 of
Further, the amplitude spectrum envelope is a spectrum feature related to the lyrics. Accordingly, the expression of the singing voice can be imparted without affecting the lyrics by setting the amount of morphing of the amplitude spectrum envelope to zero, because, by thus setting, the amplitude spectrum envelope is excluded from the spectrum features to be morphed. For example, in the expression of the singing voice in which the sample is recorded for only specific lyrics (for example, /a/), when the amount of morphing of the amplitude spectrum envelope is set to zero, the expressive sample can be morphed for a synthesized voice of lyrics other than the specific lyrics without problems.
Thus, the spectral features to be morphed can be limited for each type of expression of the singing voice. The user may limit the spectrum features that are to be morphed as described above or may set all spectral features as those to be morphed regardless of a type of expression of the singing voice. When a large number of spectral features are to be morphed for a portion, a synthesized voice close to an original expressive sample can be obtained, such that naturalness of the portion is improved. However, since a greater difference will be resulted in voice quality from a portion to which the expression of the singing voice is not imparted, discomfort is likely to appear when the entire singing voice is heard. Therefore, in templating spectral features to be morphed, spectral features that are morphing targets are determined in consideration of a balance between naturalness and discomfort.
In step S1401, the acquirer 26 acquires a temporal change in the spectral features of the synthesized voice generated by the singing voice synthesizer 20A. The spectral features acquired here includes at least one of the amplitude spectrum envelope H(f), the amplitude spectrum envelope contour G(f), the phase spectrum envelope P(f), the temporal fine variation I(f) of the amplitude spectrum envelope, the temporal fine variation Q(f) of the phase spectrum envelope, or the fundamental frequency F0. It is of note that the acquirer 26 may acquire, for example, the spectrum features extracted by the short-time spectrum operator 23 from the singing sample to be used for generation of a synthesized voice.
In step S1402, the acquirer 26 acquires the temporal change in the spectral features used for impartment of the expression of the singing voice. The spectral feature(s) acquired here are basically the same type(s) as that (those) used for generation of a synthesized voice. In order to distinguish the spectral features of the synthesized voice and the spectral features of the expressive sample from each other; a subscript v is assigned to the spectral features of the synthesized voice, a subscript p is assigned to the spectral features of the expressive samples; and a subscript vp is assigned to the synthesized voice to which the expression of the singing voice has been imparted. The acquirer 26 acquires, for example, the spectral features that the short-time spectrum operator 23 has extracted from the expressive sample.
In step S1403, the acquirer 26 acquires the expression reference time set for the expressive sample to be imparted. The expression reference time acquired here includes at least one of the singing voice expression start time, the singing voice expression end time, the note onset start time, the note offset start time, the note onset end time, or the note offset end time, as described above.
In step S1404, the timing calculator 21 calculates a timing (a position on the time axis) at which the expressive sample is aligned with the note (synthesized voice), using data on the feature point of the synthesized voice determined by the singing voice synthesizer 20A and the expression reference time recorded with regard to the expressive sample. As will be understood from the above description, step S1404 is a process of positioning the expressive sample (for example, a series of amplitude spectrum envelope contours) with respect to the synthesized voice on the time axis so that the feature point (for example, the vowel start time, the vowel end time, and the pronunciation end time) of the synthesized voice on the time axis is aligned with the expression reference time of the sample.
In step S1405, the temporal expansion/contraction mapper 22 performs temporal expansion/contraction mapping on the expressive sample according to a relationship between a time length of the note and the time length of the expressive sample. As will be understood from the above description, step S1405 is a process of expanding or contracting the expressive sample (for example, a series of amplitude spectrum envelope contours) on the time axis to be matched with the time length of a period (for example, a note) of a part in the synthesized voice.
In step S1406, the temporal expansion/contraction mapper 22 shifts a pitch of the expressive sample so that the fundamental frequency F0v of the synthesized voice matches the fundamental frequency F0p of the expressive sample (that is, so that the pitches of the synthesized voice and the expressive sample match each other). As will be understood from the above description, step S1406 is a process of shifting a series of pitches of the expressive sample on the basis of a pitch difference between the fundamental frequency F0v (for example, a pitch designated in the note) of the synthesized voice and a representative value of the fundamental frequencies F0p of the expressive sample.
As shown in
Gvp(f)=(1−aG)Gv(f)+aG·Gp(f) (1)
Hvp(f)=(1−aH)Hv(f)+aH·Hp(f) (2)
Ivp(f)=(1−aI)Iv(f)+aI·Ip(f) (3),
where aG, aH, and aI are amounts of morphing for the amplitude spectrum envelope contour G(f), the amplitude spectrum envelope H(f), and the temporal fine variation I(f) of the amplitude spectrum envelope, respectively. As described above, in the actual processing, the morphing of (2) may not be morphing of (a) the amplitude spectrum envelope H(f), but (a′) a difference between the amplitude spectrum envelope contour G(f) and the amplitude spectrum envelope H(f) can be performed instead as the morphing of (2). Further, the synthesis of the temporal fine variation I(f) may be performed in the frequency domain as in (3) (
In step S1408, the generation processor 2401B of the spectrum generator 2401 generates and outputs a spectrum that is defined by the spectrum features as of after the synthesis by the feature synthesizer 2401A. As will be understood from the above description, steps S1404 to S1408 of the embodiment correspond to an altering step of obtaining a series of spectra to which the expression of the singing voice has been imparted (an example of a series of changed spectra) by altering the series of spectra of the synthesized voice (an example of a series of synthesis spectra) on the basis of the series of the spectral features of the expressive sample of the expression of the singing voice.
When the spectrum generated by the spectrum generator 2401 is input, the inverse Fourier transformer 2402 performs an inverse Fourier transformation on the input spectrum (step S1409) and outputs a waveform in the time domain. When the waveform in the time domain is input, the synthesis window applier 2403 applies a predetermined window function to the input waveform (step S1410) and outputs the result. The overlapping adder 2404 adds the waveform to which the window function has been applied, in an overlapping manner (step S1411). By repeating this process at frame intervals, a continuous waveform of a long duration can be obtained. The obtained waveform of the singing voice is played back by the output device 107 such as a speaker. As will be understood from the above description, steps S1409 to S1411 of the embodiment correspond to a synthesizing step of synthesizing a series of voice samples to which the expression of the singing voice has been imparted, on the basis of a series of spectra to which the expression of the singing voice has been imparted (a series of changed spectra).
The method of
The spectrum generator 2411 generates a spectrum of the synthesized voice to which the expression of the singing voice has been imparted. The spectrum generator 2411 of the embodiment includes a feature synthesizer 2411A and a generation processor 2411B. For each frame, the amplitude spectrum envelope H(f), the amplitude spectrum envelope contour G(f), the phase spectrum envelope P(f), and the fundamental frequency F0 for each of the synthesized voice and the expressive sample are input to the feature synthesizer 2411A. The feature synthesizer 2411A synthesizes (morphs) the input spectral features (H(f), G(f), P(f), and F0) between the synthesized voice and the expressive sample for each frame, and outputs the synthesized features. It is of note that the synthesized voice and the expressive sample are input and synthesized only in a section in which the expressive sample is positioned among the entire section of the synthesized voice, and in the remaining section, the feature synthesizer 2411A receives only the spectral features of the synthesized voice and outputs the spectral features as they are.
For each frame, the temporal fine variation Ip(f) of the amplitude spectrum envelope and the temporal fine variation Qp(f) of the phase spectrum envelope that the short-time spectrum operator 23 has extracted from the expressive sample are input to the generation processor 2411B. The generation processor 2411B generates and outputs a spectrum having fine variations according to the temporal fine variation Ip(f) and the temporal fine variation Qp(f) with a shape according to the spectrum features as of after the synthesis by the feature synthesizer 2401A for each frame.
The inverse Fourier transformer 2412 performs, for each frame, an inverse Fourier transformation on the spectrum generated by the generation processor 2411B to obtain a waveform in a time domain (that is, a series of voice samples). The synthesis window applier 2413 applies a predetermined window function to the waveform of each frame obtained through the inverse Fourier transformation. The overlapping adder 2414 adds the waveforms for a series of frames, to each of which waveforms the window function has been applied, in an overlapping manner. By repeating these processes at frame intervals, a continuous waveform A (a voice signal) of a long duration can be obtained. This waveform A shows a waveform in the time domain of the synthesized voice to which the expression of the singing voice has been imparted, where the fundamental frequency of the expression of the singing voice is shifted and the expression of the singing voice includes the fine variation.
The amplitude spectrum envelope Hvp(f), the amplitude spectrum envelope contour Gvp(f), the phase spectrum envelope Pvp(f), and the fundamental frequency F0vp of the synthesized voice are input to the singing voice synthesizer 2415. Using a known singing voice synthesis scheme, for example, the singing voice synthesizer 2415 generates a waveform B (a voice signal) in the time domain of the synthesized voice to which the expression of the singing voice has been imparted, where the fundamental frequency of the expression of the singing voice is shifted on the basis of these spectral features and the expression of the singing voice does not include the fine variation.
The multiplier 2416 multiplies the waveform A from the overlapping adder 2414 by an application coefficient a of the fine variation component. The multiplier 2417 multiplies the waveform B from the singing voice synthesizer 2415 by a coefficient (1−a). The adder 2418 adds together the waveform A from the multiplier 2416 and the waveform B from the multiplier 2417, to output a mixed waveform C.
In the method of synthesizing the fine variations in the time domain (
For the fine variation component, the method of carrying out synthesis in the time domain handles only a portion in which the waveform A is synthesized within a short frame. According to this method, it is not necessary for the singing voice synthesizer 2415 to be of a scheme suitable for a frame synchronized with the fundamental cycle T. In this case, in the singing voice synthesizer 2415, for example, a scheme such as spectral peak processing (SPP) (Jordi Bonada, Alex Loscos. “Sample-based singing voice synthesizer by spectral concatenation.” Proceedings of Stockholm Music Acoustics Conference. 2003) can be used. The SPP synthesizes a waveform that does not include a temporal fine variation and in which a component corresponding to texture of a voice has been reproduced according to a spectrum shape around a harmonic peak. In a case where an expression of the singing voice is imparted to a voice synthesized by an existing singing voice synthesizer adopting such a method, it is simple and convenient to adopt a method of synthesizing a fine variation in a time domain since an existing singing voice synthesizer can be used as it is. It is of note that in a case in which the synthesis is carries out in the time domain, waveforms are canceled with each other or beats are generated if phases are different between a synthesized voice and an expressive sample. In order to avoid this problem, the same fundamental frequency and the same phase spectrum envelope are used in the synthesizer for the waveform A and the synthesizer for the waveform B, and reference positions (so-called pitch marks) of a voice pulse for each cycle are matched between the synthesizers.
It is of note that since a value of the phase spectrum obtained by analyzing the voice through short-time Fourier transformation or the like generally has uncertainty with respect to θ+n2π, that is, an integer n, morphing the phase spectrum envelope may sometimes involve difficulty. Since an influence of the phase spectrum envelope on the perception of the voice is less than other spectral features, the phase spectrum envelope may not be necessarily synthesized and an arbitrary value may be imparted instead. An example of the simplest and most natural method for determining the phase spectrum envelope includes a method of using a minimum phase calculated from the amplitude spectrum envelope. In this case, an amplitude spectrum envelope H(f)+G(f) excluding the fine variation component is first obtained from the H(f) and G(f) in
The window 512 is an area in which there are displayed image objects indicating operators for imparting the singing voice expression with the attack reference to one or more notes selected in the score display area 511. The window 513 is an area in which there are displayed image objects indicating operators for imparting the singing voice expression with the release reference to one or more notes selected in the score display area 511. The selection of the note in the score display area 511 is performed by a predetermined operation (for example, left-button click of a mouse).
Accordingly, in the score display area 511, an icon 5116 and an icon 5117 are displayed proximate to a note 5111. The icon 5116 is an icon (an example of an image object) for instructing editing of the singing voice expression with the attack reference when the singing voice expression with the attack reference is imparted, and the icon 5117 is an icon for instructing editing of the singing voice expression with the release reference when the singing voice expression with the release reference is imparted. For example, when the user clicks the right button of the mouse in a state in which a mouse pointer is positioned on the icon 5116, a pop-up window 514 for selecting the singing voice expression with the attack reference is displayed, and thus the user is able to change the expression of the singing voice to be imparted.
In the example shown in
The UI unit 30 detects a rotation angle of the dial 5122 in response to a user operation. The UI unit 30 identifies six maximum values of the amount of morphing corresponding to the detected rotation angle by referring to the table shown in
The present disclosure is not limited to the embodiments described above, and various modifications can be made. In the following, several modifications will be described. Two or more of the following modifications may be used in combination.
(1) A target to which an expression is imparted is not limited to a singing voice and may be a voice that is not sung. That is, the expression of the singing voice may be an expression of a spoken voice. Further, a voice to which the voice expression is imparted is not limited to a voice synthesized by a computer device, and may be an actual human voice. Further, the target to which the expression of the singing voice is imparted may be a voice which is not based on a human voice.
(2) A functional configuration of the voice synthesis device 1 is not limited to the configuration shown in the embodiment. Some of the functions shown in the embodiment may be omitted. For example, at least some of the functions of the timing calculator 21, the temporal expansion/contraction mapper 22, or the short-time spectrum operator 23 may be omitted from the voice synthesis device 1.
(3) A hardware configuration of the voice synthesis device 1 is not limited to the configuration shown in the embodiment. The voice synthesis device 1 may be of any hardware configuration as long as the hardware configuration can realize required functions. For example, the voice synthesis device 1 may be a client device that works in cooperation with a server device on a network. That is, the functions of the voice synthesis device 1 may be distributed to the server device on the network and the local client device.
(4) A program that is executed by the CPU 101 or the like may be provided by a storage medium such as an optical disk, a magnetic disk, or a semiconductor memory, or may be downloaded via a communication means such as the Internet.
(5) The following are aspects of the present disclosure derivable from the specific forms exemplified above.
A voice synthesis method according to an aspect (a first aspect) of the present disclosure includes: altering a series (time series) of synthesis spectra in a partial period of a synthesis voice based on a series of amplitude spectrum envelope contours of a voice expression to obtain a series of changed spectra to which the voice expression has been imparted; and synthesizing a series of voice samples to which the voice expression has been imparted, based on the series of changed spectra.
A voice synthesis method according to a second aspect is the voice synthesis method according to the first aspect, in which the altering includes altering amplitude spectrum envelope contours of the synthesis spectrum through morphing performed based on the amplitude spectrum envelope contours of the voice expression.
A voice synthesis method according to a third aspect is the voice synthesis method according to the first aspect or the second aspect, in which the altering includes altering the series of synthesis spectra based on the series of amplitude spectrum envelope contours of the voice expression and a series of amplitude spectrum envelopes of the voice expression.
A voice synthesis method according to a fourth aspect is the voice synthesis method according to any one of the first to the third aspects, in which the altering includes positioning the series of amplitude spectrum envelope contours of the voice expression so that a feature point of the synthesized voice on a time axis aligns with an expression reference time that is set for the voice expression, and altering the series of synthesis spectra based on the positioned series of amplitude spectrum envelope contours.
A voice synthesis method according to a fifth aspect is the voice synthesis method according to the fourth aspect, in which the feature point of the synthesized voice is a vowel start time of the synthesized voice. Further, a voice synthesis method according to a sixth aspect is the voice synthesis method according to the fourth aspect, in which the feature point of the synthesized voice is a vowel end time of the synthesized voice or a pronunciation end time of the synthesized voice.
A voice synthesis method according to a seventh aspect is the voice synthesis method according to the first aspect, in which the altering includes expanding or contracting the series of amplitude spectrum envelope contours of the voice expression on a time axis to match a time length of the period of the part of the synthesized voice, and altering the series of synthesis spectra based on the expanded or contracted series of amplitude spectrum envelope contours.
A voice synthesis method according to an eighth aspect is the voice synthesis method according to the first aspect, in which the altering includes shifting a series of pitches of the voice expression based on a pitch difference between a pitch in the period of the part of the synthesized voice, and a representative value of the pitches of the voice expression, and altering the series of synthesis spectra based on the shifted series of pitches and the series of amplitude spectrum envelope contours of the voice expression.
A voice synthesis method according to a ninth aspect is the voice synthesis method according to the first aspect, in which the altering includes altering the series of synthesis spectra based on a series of at least one of amplitude spectrum envelopes or phase spectrum envelopes in the voice expression.
(6) A voice synthesis method according to a first viewpoint of the present disclosure includes the following steps:
It is of note that step 1 may be performed before step 2 or after step 3 or may be intercede between step 2 and step 3. Further, a specific example of the “first spectrum envelope” is the amplitude spectrum envelope Hv(f), the amplitude spectrum envelope contour Gv(f), or the phase spectrum envelope Pv(f), and a specific example of the “first fundamental frequency” is the fundamental frequency F0v. A specific example of the “second spectrum envelope” is the amplitude spectrum envelope Hp(f) or the amplitude spectrum envelope contour Gp(f), and a specific example of the “second fundamental frequency” is the fundamental frequency F0p. A specific example of the “third spectrum envelope” is the amplitude spectrum envelope Hvp(f) or the amplitude spectrum envelope contour Gvp(f), and a specific example of the “third fundamental frequency” is the fundamental frequency F0vp.
(7) As described above, there is a tendency that the amplitude spectrum envelope contributes to the perception of lyrics or a vocalizer, and that the amplitude spectrum envelope contour does not depend on the lyrics and the vocalizer. Given the above tendency, for the transformation of the amplitude spectrum envelope Hv(f) of the synthesized voice, the amplitude spectrum envelope Hp(f) or the amplitude spectrum envelope contour Gp(f) of an expressive sample may be used by appropriately switching therebetween. Specifically, when a lyric or a vocalizer is substantially the same in the synthesized voice and the expressive sample, the amplitude spectrum envelope Hp(f) can be used for the deformation of the amplitude spectrum envelope Hv(f), and when the lyric or the vocalizer is not the substantially the same in the synthesized voice and the expressive sample, the amplitude spectrum envelope contour Gp(f) can be used for the deformation of the amplitude spectrum envelope Hv(f).
The voice synthesis method according to a viewpoint described above (hereafter, a “second viewpoint”) includes the following steps.
It is of note that in the second viewpoint, a specific example of the “first spectrum envelope” is the amplitude spectrum envelope Hv(f). A specific example of the “second spectrum envelope” is the amplitude spectrum envelope Hp(f), and a specific example of the “contour of the second spectrum envelope” is the amplitude spectrum envelope contour Gp(f). A specific example of the “third spectrum envelope” is the amplitude spectrum envelope Hvp(f).
In an example of the second viewpoint, determining whether the predetermined condition is satisfied includes determining that the predetermined condition is satisfied in a case where a vocalizer of the first voice and a vocalizer of the second voice are substantially the same. In another example of the second viewpoint, determining whether the predetermined condition is satisfied includes determining that the predetermined condition is satisfied in a case where lyrics of the first voice and lyrics of the second voice are substantially the same.
(8) A voice synthesis method according to a third viewpoint of the present disclosure includes the following steps.
The “first spectrum envelope” is, for example, the amplitude spectrum envelope Hvp(f) or the amplitude spectrum envelope contour Gvp(f) generated by the feature synthesizer 2411A in
In an example of the third viewpoint, the fine variation is extracted from the voice to which the voice expression has been imparted through frequency analysis in which the frame synchronized with the voice has been used.
In an example of the third aspect, in step 1, the first spectrum envelope is acquired by synthesizing (morphing) the second spectrum envelope of the voice with the third spectrum envelope of the voice to which the voice expression has been imparted according to a second change amount. The “second spectrum envelope” is, for example, the amplitude spectrum envelope Hv(f) or the amplitude spectrum envelope contour Gv(f), and the “third spectrum envelope” is, for example, the amplitude spectrum envelope Hp(f) or the amplitude spectrum envelope contour Gp(f). The second change amount is, for example, the coefficient aG in Equation (1) or the coefficient aH in Equation (2) described above.
In an example of the third viewpoint, in step 1, the first fundamental frequency is acquired by synthesizing the second fundamental frequency of the voice with the third fundamental frequency of the voice to which the voice expression has been imparted, according to a third change amount. The “second fundamental frequency” is, for example, the fundamental frequency F0v, and the “third fundamental frequency” is, for example, the fundamental frequency F0p.
In an example of the third viewpoint, in step 5, the first voice signal and the second voice signal are mixed in a state in which a pitch mark of the first voice signal and a pitch mark of the second voice signal substantially match on the time axis. The “pitch mark” is a feature point, on the time axis, of a shape in a waveform of the voice signal in the time domain. For example, a peak and/or a valley of the waveform is a specific example of the “pitch mark”.
Number | Date | Country | Kind |
---|---|---|---|
2016-217378 | Nov 2016 | JP | national |
This application is a Continuation Application of PCT Application No. PCT/JP2017/040047, filed Nov. 7, 2017, and is based on and claims priority from Japanese Patent Application No. 2016-217378, filed Nov. 7, 2016, the entire contents of each of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2017/040047 | Nov 2017 | US |
Child | 16395737 | US |