The present application claims priority from Japanese application JP 2013-166311 filed on Aug. 9, 2013, the content of which is hereby incorporated by reference into this application.
1. Field of the Invention
The present invention relates to a voice analysis method, a voice analysis device, a voice synthesis method, a voice synthesis device, and a computer readable medium storing a voice analysis program.
2. Description of the Related Art
There is proposed a technology for generating a time series of a feature amount of a sound by using a probabilistic model for expressing a probabilistic transition between a plurality of statuses. For example, in a technology disclosed in Japanese Patent Application Laid-open No. 2011-13454, a probabilistic model using a hidden Markov model (HMM) is used to generate a time series (pitch curve) of a pitch. A singing voice for a desired music track is synthesized by driving a sound generator (for example, sine-wave generator) in accordance with the time series of the pitch generated from the probabilistic model and executing filter processing corresponding to phonemes of lyrics. However, in the technology disclosed in Japanese Patent Application Laid-open No. 2011-13454, a probabilistic model is generated for each combination of adjacent notes, and hence probabilistic models need to be generated for a large number of combinations of notes in order to generate singing voices for a variety of music tracks.
Japanese Patent Application Laid-open No. 2012-37722 discloses a configuration for generating a probabilistic model of a relative value (relative pitch) between the pitch of each of notes forming a music track and the pitch of the singing voice for the music track. In the technology disclosed in Japanese Patent Application Laid-open No. 2012-37722, the probabilistic model is generated by using the relative pitch, which is advantageous in that there is no need to provide a probabilistic model for each of the large number of combinations of notes.
However, in the technology disclosed in Japanese Patent Application Laid-open No. 2012-37722, a pitch of each of notes of a music track fluctuates discretely (discontinuously), and hence a relative pitch fluctuates discontinuously at a time point of a boundary between the respective notes different in pitch. Therefore, a synthesized voice generated by applying the relative pitch may sound an auditorily unnatural voice. In view of the above-mentioned circumstances, an object of one or more embodiments of the present invention is to generate a time series of a relative pitch capable of generating a synthesized voice that sounds auditorily natural.
In one or more embodiments of the present invention, a voice analysis method includes a variable extraction step of generating a time series of a relative pitch. The relative pitch is a difference between a pitch generated from music track data, which continuously fluctuates on a time axis, and a pitch of a reference voice. The music track data designate respective notes of a music track in time series. The reference voice is a voice obtained by singing the music track. The pitch of the reference voice is processed by an interpolation processing for a voiceless section from which no pitch is detected. The voice analysis method also includes a characteristics analysis step of generating singing characteristics data that define a model for expressing the time series of the relative pitch generated in the variable extraction step.
In one or more embodiments of the present invention, a voice analysis device includes a variable extraction unit configured to generate a time series of a relative pitch. The relative pitch is a difference between a pitch generated from music track data, which continuously fluctuates on a time axis, and a pitch of a reference voice. The music track data designate respective notes of a music track in time series. The reference voice is a voice obtained by singing the music track. The pitch of the reference voice is processed by an interpolation processing for a voiceless section from which no pitch is detected. The voice analysis device also includes a characteristics analysis unit configured to generate a singing characteristics data that defines a model for expressing the time series of the relative pitch generated by the variable extraction unit.
In one or more embodiments of the present invention, a non-transitory computer-readable recording medium having stored thereon a voice analysis program, the voice analysis program includes a variable extraction instruction for generating a time series of a relative pitch. The relative pitch is a difference between a pitch generated from music track data, which continuously fluctuates on a time axis, and a pitch of a reference voice. The music track data designate respective notes of a music track in time series. The reference voice is a voice obtained by singing the music track. The pitch of the reference voice is processed by an interpolation processing for a voiceless section from which no pitch is detected. The voice analysis program also includes a characteristics analysis instruction for generating singing characteristics data that define a model for expressing the time series of the relative pitch generated by the variable extraction instruction.
In one or more embodiments of the present invention, a voice synthesis method includes a variable setting step of generating a relative pitch transition based on synthesis-purpose music track data and at least one singing characteristic data. The synthesis-purpose music track data designate respective notes of a first music track to be subjected to voice synthesis in time series. The at least one singing characteristic data define a model expressing a time series of a relative pitch. The relative pitch is a difference between a first pitch and a second pitch. The first pitch is generated from music track data for designating respective notes of a second music track in time series and continuously fluctuates on a time axis. The second pitch is a pitch of a reference voice that is obtained by singing the second music track. The second pitch is processed by interpolation processing for a voiceless section from which no pitch is detected. The voice synthesis method also includes a voice synthesis step of generating a voice signal based on the synthesis-purpose music track data, phonetic piece group indicating respective phonemes, and the relative pitch transition.
In one or more embodiments of the present invention, a voice synthesis device includes a variable setting unit configured to generate a relative pitch transition based on synthesis-purpose music track data and at least one singing characteristic data. The synthesis-purpose music track data designate respective notes of a first music track to be subjected to voice synthesis in time series. The at least one singing characteristic data define a model expressing a time series of a relative pitch. The relative pitch is a difference between a first pitch and a second pitch. The first pitch is generated from music track data for designating respective notes of a second music track in time series and continuously fluctuates on a time axis. The second pitch is a pitch of a reference voice that is obtained by singing the second music track. The second pitch is processed by interpolation processing for a voiceless section from which no pitch is detected. The voice synthesis device also includes a voice synthesis unit configured to generate a voice signal based on the synthesis-purpose music track data, phonetic piece group indicating respective phonemes, and the relative pitch transition.
In order to solve the above-mentioned problems, a voice analysis device according to one embodiment of the present invention includes a variable extraction unit configured to generate a time series of a relative pitch serving as a difference between a pitch which is generated from music track data for designating each of notes of a music track in time series and which continuously fluctuates on a time axis and a pitch of a reference voice obtained by singing the music track; and a characteristics analysis unit configured to generate singing characteristics data that defines a probabilistic model for expressing the time series of the relative pitch generated by the variable extraction unit. In the above-mentioned configuration, the time series of the relative pitch serving as the difference between the pitch which is generated from the music track data and which continuously fluctuates on the time axis and the pitch of the reference voice is expressed as a probabilistic model, and hence a discontinuous fluctuation of the relative pitch is suppressed compared to a configuration in which a difference between the pitch of each of the notes of the music track and the pitch of the reference voice is calculated as the relative pitch. Therefore, it is possible to generate the synthesized voice that sounds auditorily natural.
According to a preferred embodiment of the present invention, the variable extraction unit includes: a transition generation unit configured to generate the pitch that continuously fluctuates on the time axis from the music track data; a pitch detection unit configured to detect the pitch of the reference voice obtained by singing the music track; an interpolation processing unit configured to set a pitch for a voiceless section of the reference voice from which no pitch is detected; and a difference calculation unit configured to calculate a difference between the pitch generated by the transition generation unit and the pitch that has been processed by the interpolation processing unit as the relative pitch. In the above-mentioned configuration, the pitch is set for the voiceless section from which no pitch of the reference voice is detected, to thereby shorten a silent section. Therefore, there is an advantage in that the discontinuous fluctuation of the relative pitch can be effectively suppressed. According to a further preferred embodiment of the present invention, the interpolation processing unit is further configured to: set, in accordance with the time series of the pitch within a first section immediately before the voiceless section, a pitch within a first interpolation section of the voiceless section immediately after the first section; and set, in accordance with the time series of the pitch within a second section immediately after the voiceless section, a pitch within a second interpolation section of the voiceless section immediately before the second section. In the above-mentioned embodiment, the pitch within the voiceless section is approximately set in accordance with the pitches within a voiced section before and after the voiceless section, and hence the above-mentioned effect of suppressing the discontinuous fluctuation of the relative pitch within the voiced section of the music track designated by the music track data is remarkable.
According to a preferred embodiment of the present invention, the characteristics analysis unit includes: a section setting unit configured to divide the music track into a plurality of unit sections by using a predetermined duration as a unit; and an analysis processing unit configured to generate the singing characteristics data including, for each of a plurality of statuses of the probabilistic model: a decision tree for classifying the plurality of unit sections obtained by the dividing by the section setting unit into a plurality of sets; and variable information for defining a probability distribution of the time series of the relative pitch within each of the unit sections classified into the respective sets. In the above-mentioned embodiment, the probabilistic model is defined by using a predetermined duration as a unit, which is advantageous in that, for example, singing characteristics (relative pitch) can be controlled with precision irrespective of a length of a duration compared to a configuration in which the probabilistic model is assigned by using the note as a unit.
When a completely independent decision tree is generated for each of a plurality of statuses of the probabilistic model, characteristics of the time series of the relative pitch within the unit section may differ between the statuses, with the result that the synthesized voice may become a voice that gives an impression of sounding unnatural (for example, voice that cannot be pronounced in actuality or voice different from an actual pronunciation). In view of the above-mentioned circumstances, the analysis processing unit according to the preferred embodiment of the present invention generates a decision tree for each status from a basic decision tree common across the plurality of statuses of the probabilistic model. In the above-mentioned embodiment, the decision tree for each status is generated from the basic decision tree common across the plurality of statuses of the probabilistic model, which is advantageous in that, compared to a configuration in which a mutually independent decision tree is generated for each of the statuses of the probabilistic model, a possibility that the characteristics of the transition of the relative pitch excessively differs between adjacent statuses is reduced, and the synthesized voice that sounds auditorily natural (for example, voice that can be pronounced in actuality) can be generated. Note that, the decision trees for the respective statuses generated from the common basic decision tree are partially or entirely common to one another.
According to a preferred embodiment of the present invention, the decision tree for each status contains a condition corresponding to a relationship between each of phrases obtained by dividing the music track on the time axis and the unit section. In the above-mentioned embodiment, the condition relating to the relationship between the unit section and the phrase is set for each of nodes of the decision tree, and hence it is possible to generate the synthesized voice that sounds auditorily natural in which the relationship between the unit section and the phrase is taken into consideration.
(Voice Analysis Device 100)
As exemplified in
The storage device 14 according to the first embodiment stores reference music track data XB and reference voice data XA used to generate the singing characteristics data Z. As exemplified in
The processor unit 12 illustrated in
The variable extraction unit 22 acquires a time series of a feature amount of the reference voice expressed by the reference voice data XA. The variable extraction unit 22 according to the first embodiment successively calculates, as the feature amount, a difference (hereinafter referred to as “relative pitch”) R between a pitch PB of a voice (hereinafter referred to as “synthesized voice”) generated by the voice synthesis to which the reference music track data XB is applied and a pitch PA of the reference voice expressed by the reference voice data XA. That is, the relative pitch R may also be paraphrased as a numerical value of a pitch bend of the reference voice (fluctuation amount of the pitch PA of the reference voice with reference to the pitch PB of the synthesized voice). As exemplified in
The transition generation unit 32 sets a transition (hereinafter referred to as “synthesized pitch transition”) CP of the pitch PB of the synthesized voice generated by the voice synthesis to which the reference music track data XB is applied. In concatenative voice synthesis to which the reference music track data XB is applied, the synthesized pitch transition (pitch curve) CP is generated in accordance with the pitches and the pronunciation periods designated by the reference music track data XB for the respective notes, and phonetic pieces corresponding to the lyrics on the respective notes are adjusted to the pitches PB of the synthesized pitch transition CP to be concatenated with each other, thereby generating the synthesized voice. The transition generation unit 32 generates the synthesized pitch transition CP in accordance with the reference music track data XB on the reference music track. As understood from the above description, the synthesized pitch transition CP corresponds to a model (typical) trace of the pitch PB of the reference music track by the singing voice. Note that, the synthesized pitch transition CP may be used for the voice synthesis as described above, but on the voice analysis device 100 according to the first embodiment, it is not essential to actually generate the synthesized voice as long as the synthesized pitch transition CP corresponding to the reference music track data XB is generated.
The pitch detection unit 34 illustrated in
Specifically, the interpolation processing unit 36 sets the time series of the pitch PA within an interpolation section (first interpolation section) ηA2, which has a predetermined length and is located on a start point side of the voiceless section σ0, in accordance with the time series of the pitch PA within a section (first section) ηA1, which has a predetermined length and is located on an end point side of the voiced section σ1. For example, each numerical value on an approximate line (for example, regression line) L1 of the time series of the pitch PA within the section ηA1 is set as the pitch PA within the interpolation section ηA2 immediately after the section ηA1. That is, the time series of the pitch PA within the voiced section σ1 is also extended to the voiceless section σ0 so that the transition of the pitch PA continues across from the voiced section σ1 (section ηA1) to the subsequent voiceless section σ0 (interpolation section ηA2).
Similarly, the interpolation processing unit 36 sets the time series of the pitch PA within an interpolation section (second interpolation section) ηB2, which has a predetermined length and is located at an end point side of the voiceless section σ0, in accordance with the time series of the pitch PA within a section (second section) ηB1, which has a predetermined length and is located on a start point side of the voiced section σ2. For example, each numerical value on an approximate line (for example, regression line) L2 of the time series of the pitch PA within the section ηB1 is set as the pitch PA within the interpolation section ηB2 immediately before the section ηB1. That is, the time series of the pitch PA within the voiced section σ2 is also extended to the voiceless section σ0 so that the transition of the pitch PA continues across from the voiced section σ2 (section ηB1) to the voiceless section σ0 (interpolation section ηB2) immediately before. Note that, the section ηA1 and the interpolation section ηA2 are set to a mutually equal time length, and the section ηB1 and the interpolation section ηB2 are set to a mutually equal time length. However, the time length may be different between the respective sections. Further, the time length may be either different or the same between the section ηA1 and the section ηB1, and the time length may be either different or the same between the interpolation section ηA2 and the interpolation section ηB2.
As exemplified in
The characteristics analysis unit 24 illustrated in
The section setting unit 42 divides the time series of the relative pitch R generated by the variable extraction unit 22 into a plurality of sections (hereinafter referred to as “unit section”) UA on the time axis. Specifically, as understood from
The analysis processing unit 44 illustrated in
The analysis processing unit 44 generates the decision tree T[n] by machine learning (decision tree learning) for successively determining whether or not a predetermined condition (question) relating to the unit section UA is successful. The decision tree T[n] is a classification tree for classifying (clustering) the unit sections UA into a plurality of sets, and is expressed as a tree structure in which a plurality of nodes ν (νa, νb, and νc) are concatenated with one another over a plurality of tiers. As exemplified in
At the root node νa and the internal nodes νb, for example, it is determined whether conditions are met (context) such as whether the unit section UA is the silent section, whether the note within the unit section UA is shorter than the sixteenth note, whether the unit section UA is located at the start point side of the note, and whether the unit section UA is located at the end point side of the note. A time point to stop the classification of the respective unit sections UA (time point to determine the decision tree T[n]) is determined in accordance with, for example, a minimum description length (MDL) reference. A structure (for example, the number of internal nodes νb, and conditions thereof, and the number k of leaf nodes νc) of the decision tree T[ n] is different between the respective statuses St of the probabilistic model M.
The variable information D[ n] on the unit data z[n] illustrated in
On the other hand, the section setting unit 42 refers to the reference music track data XB, so as to divide the reference music track into the plurality of unit sections UA for each segment (SA5). The analysis processing unit 44 generates the decision tree T[n] for each status St of the probabilistic model M by the machine learning to which each of the unit sections UA is applied (SA6), and generates the variable information D[n] corresponding to the relative pitch R within each of the unit sections UA classified into each of the leaf nodes νc of the decision tree T[n] (SA7). Then, the analysis processing unit 44 stores, on the storage device 14, the singing characteristics data Z including the unit data z[n], which includes the decision tree T[n] generated in Step SA6 and the variable information D[n] generated in Step SA7, for each of the statuses St of the probabilistic model M (SA8). The above-mentioned operation is repeated for each combination of the reference singer (reference voice data XA) and the reference music track data XB, so as to accumulate, on a storage device 54, a plurality of pieces of the singing characteristics data Z corresponding to the mutually different reference singers.
(Voice Synthesis Device 200)
As described above, the voice synthesis device 200 illustrated in
The display device 56 (for example, liquid crystal display panel) displays an image as instructed by the processor unit 52. The input device 57 is an operation device for receiving an instruction issued to the voice synthesis device 200 by a user, and includes, for example, a plurality of operators to be operated by the user. Note that, a touch panel formed integrally with the display device 56 may be employed as the input device 57. The sound emitting device 58 (for example, speakers and headphones) reproduces, as a sound, the voice signal V generated by the voice synthesis to which the singing characteristics data Z is applied.
The storage device 54 stores programs (GB1, GB2, and GB3) executed by the processor unit 52 and various kinds of data (phonetic piece group YA and synthesis-purpose music track data YB) used by the processor unit 52. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of kinds of recording medium may be arbitrarily employed as the storage device 54. The singing characteristics data Z generated by the voice analysis device 100 is transferred from the voice analysis device 100 to the storage device 54 of the voice synthesis device 200 through the intermediation of, for example, a communication network such as the Internet or a portable recording medium. A plurality of pieces of singing characteristics data Z corresponding to separate reference singers may be stored in the storage device 54.
The storage device 54 according to the first embodiment stores the phonetic piece group YA and the synthesis-purpose music track data YB. The phonetic piece group YA is a set (library for voice synthesis) of a plurality of phonetic pieces used as materials for the concatenative voice synthesis. The phonetic piece is a phoneme (for example, vowel or consonant) serving as a minimum unit for distinguishing a linguistic meaning or a phoneme chain (for example, diphone or triphone) that concatenates a plurality of phonemes. Note that, an utterer of each phonetic piece and the reference singer may be either different or the same. The synthesis-purpose music track data YB expresses a musical notation of a music track (hereinafter referred to as “synthesis-purpose music track”) to be subjected to the voice synthesis. Specifically, the synthesis-purpose music track data YB is time-series data (for example, VSQ-format file) for designating the pitch, the pronunciation period, and the lyric for each of the notes forming the synthesis-purpose music track in time series.
The storage device 54 according to the first embodiment stores an editing program GB1, a characteristics giving program GB2, and a voice synthesis program GB3. The editing program GB1 is a program (score editor) for creating and editing the synthesis-purpose music track data YB. The characteristics giving program GB2 is a program for applying the singing characteristics data Z to the voice synthesis, and is provided as, for example, plug-in software for enhancing a function of the editing program GB1. The voice synthesis program GB3 is a program (voice synthesis engine) for generating the voice signal V by executing the voice synthesis. Note that, the characteristics giving program GB2 may also be integrated partially with the editing program GB1 or the voice synthesis program GB3.
The processor unit 52 executes the programs (GB1, GB2, and GB3) stored in the storage device 54 and realizes a plurality of functions (an information editing unit 62, a variable setting unit 64, and a voice synthesis unit 66) for editing the synthesis-purpose music track data YB and for generating the voice signal V. The information editing unit 62 is realized by the editing program GB1, the variable setting unit 64 is realized by the characteristics giving program GB2, and the voice synthesis unit 66 is realized by the voice synthesis program GB3. Note that, a configuration in which the respective functions of the processor unit 52 are distributed to a plurality of devices or a configuration in which a part of the functions of the processor unit 52 is realized by a dedicated electronic circuit (for example, DSP) may also be employed.
The information editing unit 62 edits the synthesis-purpose music track data YB in accordance with an instruction issued through the input device 57 by the user. Specifically, the information editing unit 62 displays a musical notation image 562 illustrated in
The user appropriately operates the input device 57 so as to instruct the startup of the characteristics giving program GB2 (that is, application of the singing characteristics data Z) and select the singing characteristics data Z on a desired reference singer from among the plurality of pieces of singing characteristics data Z within the storage device 54. The variable setting unit 64 illustrated in
Specifically, the variable setting unit 64 refers to the synthesis-purpose music track data YB and divides the synthesis-purpose music track into a plurality of unit sections UB on the time axis. Specifically, as understood from
Then, the variable setting unit 64 applies each unit section UB to the decision tree T[n] of the unit data z[n] corresponding to the n-th status St of the probabilistic model M within the singing characteristics data Z, to thereby identify one leaf node νc to which the each unit section UB belongs from among K leaf nodes νc of the decision tree T[n], and uses the respective variables ω (ω0, ω1, ω2, and ωd) of the variable group Ω[k] corresponding to the one leaf node νc within the variable information D[n] to identify the time series of the relative pitch R. The above-mentioned processing is successively executed for each of the statuses St of the probabilistic model M, to thereby identify the time series of the relative pitch R within the unit section UB. Specifically, the duration of each status St is set in accordance with the variable cod of the variable group Ω[k], and each relative pitch R is calculated so as to obtain a maximum simultaneous probability of the occurrence probability of the relative pitch R defined by the variable ω0, the occurrence probability of the time variation ΔR of the relative pitch R defined by the variable ω1, and the occurrence probability of the second derivative value Δ2R of the relative pitch R defined by the variable ω2. The relative pitch transition CR over the entire range of the synthesis-purpose music track is generated by concatenating the time series of the relative pitch R on the time axis across the plurality of unit sections UB.
The information editing unit 62 adds the relative pitch transition CR generated by the variable setting unit 64 to the synthesis-purpose music track data YB within the storage device 54, and as exemplified in
The voice synthesis unit 66 illustrated in
The singing style of the reference singer (for example, away of singing, such as expression contours, unique to the reference singer) is reflected on the relative pitch transition CR generated from the singing characteristics data Z, and hence the reproduced sound of the voice signal V corresponding to the synthesized pitch transition CP corrected by the relative pitch transition CR is perceived as the singing voice (that is, such a voice as obtained by the reference singer singing the synthesis-purpose music track) for the synthesis-purpose music track to which the singing style of the reference singer is given.
The processor unit 52 determines whether or not the startup (giving of the singing style corresponding to the singing characteristics data Z) of the characteristics giving program GB2 has been instructed by the user (SB2). When the startup of the characteristics giving program GB2 is instructed (SB2: YES), the variable setting unit 64 generates the relative pitch transition CR corresponding to the synthesis-purpose music track data YB at the current time point and the singing characteristics data Z selected by the user (SB3). The relative pitch transition CR generated by the variable setting unit 64 is displayed on the display device 56 as the transition image 564 in the next Step SB1. On the other hand, when the startup of the characteristics giving program GB2 has not been instructed (SB2: NO), the generation (SB3) of the relative pitch transition CR is not executed. Note that, the relative pitch transition CR is generated above by using the user's instruction as a trigger, but the relative pitch transition CR may also be generated in advance (for example, on the background) irrespective of the user's instruction.
The processor unit 52 determines whether or not the start of the voice synthesis (startup of the voice synthesis program GB3) has been instructed (SB4). When the start of the voice synthesis is instructed (SB4: YES), the voice synthesis unit 66 first generates the synthesized pitch transition CP in accordance with the synthesis-purpose music track data YB at the current time point (SB5). Second, the voice synthesis unit 66 corrects each pitch PB of the synthesized pitch transition CP in accordance with each relative pitch R of the relative pitch transition CR generated in Step SB3 (SB6). Third, the voice synthesis unit 66 generates the voice signal V by adjusting the phonetic pieces corresponding to the lyrics designated by the synthesis-purpose music track data YB within the phonetic piece group YA to the respective pitches PB of the synthesized pitch transition CP subjected to the correction in Step SB6 and concatenating the respective phonetic pieces with each other (SB7). When the voice signal V is supplied to the sound emitting device 58, the singing voice for the synthesis-purpose music track to which the singing style of the reference singer is given is reproduced. On the other hand, when the start of the voice synthesis has not been instructed (SB4: NO), the processing from Step SB5 to Step SB7 is not executed. Note that, the generation of the synthesized pitch transition CP (SB5), the correction of each pitch PB (SB6), and the generation of the voice signal V (SB7) may be executed in advance (for example, on the background) irrespective of the user's instruction.
The processor unit 52 determines whether or not the end of the processing has been instructed (SB8). When the end has not been instructed (SB8: NO), the processor unit 52 returns the processing to Step SB1 to repeat the above-mentioned processing. On the other hand, when the end of the processing is instructed (SB8: YES), the processor unit 52 brings the processing of
As described above, in the first embodiment, the relative pitch R corresponding to a difference between each pitch PB of the synthesized pitch transition CP generated from the reference music track data XB and each pitch PA of the reference voice is used to generate the singing characteristics data Z on which the singing style of the reference singer is reflected. Therefore, compared to a configuration in which the singing characteristics data Z is generated in accordance with the time series of the pitch PA of the reference voice, it is possible to reduce a necessary probabilistic model (number of variable groups Ω[k] within the variable information D[n]). Further, the respective pitches PA of the synthesized pitch transition CP are continuous on the time axis, which is also advantageous in that, as described below in detail, a discontinuous fluctuation of the relative pitch Rat a time point of the boundary between the respective notes that are different in pitch is suppressed.
Further, in the first embodiment, the voiceless section σ0 from which the pitch PA of the reference voice is not detected is refilled with a significant pitch PA. That is, the time length of the voiceless section σ0 of the reference voice in which the pitch PA does not exist is shortened. Therefore, it is possible to effectively suppress the discontinuous fluctuation of the relative pitch R within a voiced section other than a voiceless section νX of the reference music track (the synthesized voice) designated by the reference music track data XB. Particularly in the first embodiment, the pitch PA within the voiceless section ν0 is approximately set in accordance with the pitches PA within the voiced sections (σ1 and σ2) before and after the voiceless section σ0, and hence the above-mentioned effect of suppressing the discontinuous fluctuation of the relative pitch R is remarkable. Note that, as understood from
Note that, in the first embodiment, the respective unit sections U (UA or UB) obtained by dividing the reference music track or the synthesis-purpose music track for each unit of segment are expressed by one probabilistic model M, but it is also conceivable to employ a configuration (hereinafter referred to as “Comparative Example 2”) in which one note is expressed by one probabilistic model M. However, in Comparative Example 2, the notes are expressed by a mutually equal number of statuses St irrespective of the duration, and hence it is difficult to precisely express the singing style of the reference voice for the note having a long duration by the probabilistic model M. In the first embodiment, one probabilistic model M is given to the respective unit sections U (UA or UB) obtained by dividing the music track for each unit of segment. In the above-mentioned configuration, as the note has a longer duration, a total number of statuses St of the probabilistic model M that expresses the note increases. Therefore, compared to Comparative Example 2, there is an advantage in that the relative pitch R is controlled with precision irrespective of a length of the duration.
A second embodiment of the present invention is described below. Note that, components of which operations and functions are the same as those of the first embodiment in each of the embodiments exemplified below are denoted by the same reference numerals referred to in the description of the first embodiment, and a detailed description of each thereof is omitted appropriately.
The decision tree T[n] generated for each status St by the analysis processing unit 44 according to the second embodiment includes nodes ν for which conditions relating to a relationship between respective unit sections UA and the phrase Q including the respective unit sections UA are set. Specifically, it is determined at each internal node νb (or root node νa) whether or not the condition relating to the relationship between a note within the unit section U and each of the notes within the phrase Q is successful, as exemplified below:
whether or not the note within the unit section UA is located on the start point side within the phrase Q;
whether or not the note within the unit section UA is located on the end point side within the phrase Q;
whether or not a distance between the note within the unit section UA and the highest sound within the phrase Q exceeds a predetermined value;
whether or not a distance between the note within the unit section UA and the lowest sound within the phrase Q exceeds a predetermined value; and
whether or not a distance between the note within the unit section UA and the most frequent sound within the phrase Q exceeds a predetermined value.
The “distance” in each of the above-mentioned conditions may be both a distance on the time axis (time difference) and a distance on the pitch axis (pitch difference), and when a plurality of notes within the phrase Q are concerned, for example, it may be the shortest distance from the note within the unit section UA. Further, the “most frequent sound” means a note having the maximum number of times of pronunciation within the phrase Q or a pronunciation time (or a value obtained by multiplying both)
The variable setting unit 64 of the voice synthesis device 200 divides the synthesis-purpose music track into the plurality of unit sections UB in the same manner as in the first embodiment, and further divides the synthesis-purpose music track into the plurality of phrases Q on the time axis. Then, as described above, the variable setting unit 64 applies each unit section UB to a decision tree in which the condition relating to the phrase Q is set for each of the nodes ν, to thereby identify one leaf node νc to which the each unit section UB belongs.
The second embodiment also realizes the same effect as that of the first embodiment. Further, in the second embodiment, the condition relating to a relationship between the unit section U (UA or UB) and the phrase Q is set for each node ν of the decision tree T[n]. Accordingly, it is advantageous in that it is possible to generate the synthesized voice that sounds auditorily natural in which the relationship between the note of each unit section U and each note within the phrase Q is taken into consideration.
The variable setting unit 64 of the voice synthesis device 200 according to a third embodiment of the present invention generates the relative pitch transition CR in the same manner as in the first embodiment, and further sets a control variable applied to the voice synthesis performed by the voice synthesis unit 66 to be variable in accordance with each relative pitch R of the relative pitch transition CR. The control variable is a variable for controlling a musical expression to be given to the synthesized voice. For example, a variable such as a velocity of the pronunciation or a tone (for example, clearness) is preferred as the control variable, but in the following description, the dynamics Dyn is exemplified as the control variable.
As understood from
Dyn=tan h(R×β/8192)×64+64 (A)
A coefficient β of Expression (A) is variable for causing the ratio of a change in the dynamics Dyn to the relative pitch R to differ between a positive side and a negative side of the relative pitch R. Specifically, the coefficient β is set to four when the relative pitch R is a negative number, and set to one when the relative pitch R is a non-negative number (zero or a positive number). Note that, the numerical value of the coefficient β and contents of Expression (A) are merely examples for the sake of convenience, and may be changed appropriately.
The third embodiment also realizes the same effect as that of the first embodiment. Further, in the third embodiment, the control variable (dynamics Dyn) is set in accordance with the relative pitch R, which is advantageous in that the user does not need to manually set the control variable. Note that, the control variable (dynamics Dyn) is set in accordance with the relative pitch R in the above description, but the time series of the numerical value of the control variable may be expressed by, for example, a probabilistic model. Note that, the configuration of the second embodiment may be employed for the third embodiment.
When the condition for each node ν of the decision tree T[n] is appropriately set, a temporal fluctuation of the relative pitch R on which the characteristics of a vibrato of the reference voice has been reflected appears in the relative pitch transition CR corresponding to the singing characteristics data Z. However, when generating the relative pitch transition CR using the singing characteristics data Z, a periodicity of the fluctuation of the relative pitch R is not always guaranteed, and hence, as exemplified in part (A) of
Specifically, the variable setting unit 64 calculates a zero-crossing number of the derivative value ΔR of the relative pitch R of the relative pitch transition CR. The zero-crossing number of the derivative value ΔR of the relative pitch R corresponds to a total number of crest parts (maximum points) and trough parts (minimum points) on the time axis within the relative pitch transition CR. In the section in which the vibrato is given to the singing voice, the relative pitch R tends to fluctuate alternately between a positive number and a negative number at a suitable frequency. In consideration of the above-mentioned tendency, the variable setting unit 64 identifies a section in which the zero-crossing number (that is, the number of crest parts and trough parts within a unit time) of the derivative value ΔR within a unit time falls within a predetermined range, as the correction section B. However, a method of identifying the correction section B is not limited to the above-mentioned example. For example, a second half section of the note that exceeds a predetermined length (that is, section to which the vibrato is likely to be given) among the plurality of notes designated by the synthesis-purpose music track data YB may be identified as the correction section B
When the correction section B is identified, the variable setting unit 64 sets a period (hereinafter referred to as “target period”) τ of the corrected vibrato (SC3). The target period τ is, for example, a numerical value obtained by dividing the time length of the correction section B by the number (wave count) of crest parts or trough parts of the relative pitch R within the correction section B. Then, the variable setting unit 64 corrects each relative pitch R of the relative pitch transition CR so that the interval between the respective crest parts (or respective trough parts) of the relative pitch transition CR within the correction section B is closer to (ideally, matches) the target period τ (SC4). As understood from the above description, the intervals between the crest parts and the trough parts are non-uniform in the relative pitch transition CR before the correction as shown in part (A) of
The fourth embodiment also realizes the same effect as that of the first embodiment. Further, in the fourth embodiment, the intervals between the crest parts and the trough parts of the relative pitch transition CR on the time axis become uniform. Accordingly it is advantageous in that the synthesized voice to which an auditorily natural vibrato has been given is generated. Note that, the correction section B and the target period τ are set automatically (that is, irrespective of the user's instruction) in the above description, but the characteristics (section, period, or amplitude) of the vibrato may also be set variably in accordance with an instruction issued by the user. Further, the configuration of the second embodiment or the third embodiment may be employed for the fourth embodiment.
In the first embodiment, the decision tree T[n] independent for each of the statuses St of the probabilistic model M has been taken as an example. As understood from
As described above, in the fifth embodiment, N decision trees T[1] to T[N] are derivatively generated from the common basic decision tree T0 serving as an origin, and hence conditions (hereinafter referred to as “common conditions”) set for the respective nodes ν (root node νa and internal node νb) located on an upper layer are common across the N decision trees T[1] to T[N].
The fifth embodiment also realizes the same effect as that of the first embodiment. Where the decision trees T[n] are generated completely independently for the respective statuses St of the probabilistic model M, the characteristics of the time series of the relative pitch R within the unit section U may differ between the statuses St before and after, with the result that the synthesized voice may be the voice that gives an impression of sounding unnatural (for example, voice that cannot be pronounced in actuality or voice different from an actual pronunciation). In the fifth embodiment, the N decision trees T[1] to T[N] corresponding to the mutually different statuses St of the probabilistic model M are generated from the common basic decision tree T0. Thus, it is advantageous in that, compared to a configuration in which each of the N decision trees T[1] to T[N] is generated independently, a possibility that the characteristics of the transition of the relative pitch R excessively differs between adjacent statuses St is reduced, and the synthesized voice that sounds auditorily natural (for example, voice that can be pronounced in actuality) is generated. It should be understood that a configuration in which the decision tree T[n] is generated independently for each of the statuses St of the probabilistic model M may be included within the scope of the present invention.
Note that, in the above description, the configuration in which the decision trees T[n] of the respective statuses St are partially common has been taken as an example, but all the decision trees T[n] of the respective statuses St may also be common (the decision trees T[n] are completely common among the statuses St). Further, the configuration of any one of the second embodiment to the fourth embodiment may be employed for the fifth embodiment.
In the above-mentioned embodiments, a case where the decision trees T[n] are generated by using the pitch PA detected from the reference voice for one reference music track has been taken as an example for the sake of convenience, but in actuality, the decision trees T[n] are generated by using the pitches PA detected from the reference voices for a plurality of mutually different reference music tracks. In the configuration in which the respective decision trees T[n] are generated from a plurality of reference music tracks as described above, the plurality of unit sections UA included in the mutually different reference music tracks can be classified into one leaf node νc of the decision tree T[n] in a coexisting state and may be used for the generation of the variable group Ω[k] of the one leaf node νc. On the other hand, in a scene in which the relative pitch transition CR is generated by the variable setting unit 64 of the voice synthesis device 200, the plurality of unit sections UB included in one note within the synthesis-purpose music track are classified into the mutually different leaf nodes νc of the decision trees T[n]. Therefore, tendencies of the pitches PA of the mutually different reference music tracks may be reflected on each of the plurality of unit sections UB corresponding to one note of the synthesis-purpose music track, and the synthesized voice (in particular, characteristics of the vibrato or the like) may be perceived to give the impression of sounding auditorily unnatural.
In view of the above-mentioned circumstances, in a sixth embodiment of the present invention, the characteristics analysis unit 24 (analysis processing unit 44) of the voice analysis device 100 generates the respective decision trees T[n] so that each of the plurality of unit sections UB included in one note (note corresponding to a plurality of segments) within the synthesis-purpose music track is classified into each of the leaf nodes νc corresponding to the common reference music within the decision trees T[n] (that is, leaf node νc into which only the unit section UB within the reference music track is classified when the decision tree T[n] is generated).
Specifically, in the sixth embodiment, the condition (context) set for each internal node νb of the decision tree T[n] is divided into two kinds of a note condition and a section condition. The note condition is a condition (condition relating to an attribute of one note) to determine success/failure for one note as a unit, while the section condition is a condition (condition relating to an attribute of one unit section U) to determine success/failure for one unit section U (UA or UB) as a unit.
Specifically, the note condition is exemplified by the following conditions (A1 to A3).
A1: condition relating to the pitch or the duration of one note including the unit section U
A2: condition relating to the pitch or the duration of the note before and after one note including the unit section U
A3: condition relating to a position (position on the time axis or the pitch axis) of one note within the phrase Q
Condition A1 is, for example, a condition as to whether the pitch or the duration of one note including the unit section U falls within a predetermined range. Condition A2 is, for example, a condition as to whether the pitch difference between one note containing the unit section U and a note immediately before or immediately after the one note falls within a predetermined range. Further, Condition A3 is, for example, a condition as to whether one note containing the unit section U is located on the start point side of the phrase Q or a condition as to whether the one note is located on the end point side of the phrase Q.
On the other hand, the section condition is, for example, a condition relating to the position of the unit section U relative to one note. For example, a condition as to whether or not the unit section U is located on the start point side of a note or a condition as to whether or not the unit section U is located on the end point side of the note is preferred as the section condition.
The first classification processing SD1 is processing for generating a temporary decision tree (hereinafter referred to as “temporary decision tree”) TA[n] of
The second classification processing SD2 is processing for further branching the respective leaf nodes νc of the temporary decision tree TA[n] by using the above-mentioned section condition, to thereby generate the final decision tree T[n]. Specifically, as understood from
On the other hand, in the same manner as in the first embodiment, the variable setting unit 64 of the voice synthesis device 200 applies the respective unit sections UB obtained by dividing the synthesis-purpose music track designated by the synthesis-purpose music track data YB to each decision tree T[n] generated by the above-mentioned procedure, to thereby classify the respective unit sections UB into one leaf node νc, and generates the relative pitch R of the unit section UB in accordance with the variable group Ω[k] corresponding to the one leaf node νc. As described above, the note condition is determined preferentially to the section condition in the decision tree T[n], and hence each of the plurality of unit sections UB included in one note of the synthesis-purpose music track is classified into each leaf node νc into which only each unit section UA of the common reference music track is classified when the decision tree T[n] is generated. That is, the variable group Ω[k] corresponding to the characteristics of the reference voice for the common reference music track is applied for generating the relative pitch R within the plurality of unit sections UB included in one note of the synthesis-purpose music track. Therefore, there is an advantage in that the synthesized voice that gives the impression of sounding auditorily natural is generated compared to the configuration in which the decision tree T[n] is generated without distinguishing the note condition from the section condition.
The configurations of the second embodiment to the fifth embodiment are applied to the sixth embodiment in the same manner. Note that, when the configuration of the fifth embodiment in which the condition for the upper layer of the decision tree T[n] is fixed is applied to the sixth embodiment, irrespective of which of the note condition and the section condition is concerned, the common condition of the fifth embodiment is fixedly set in the upper layer of the tree structure, and the note condition or the section condition is set for each node ν located in a lower layer of each node ν for which the common condition is set by the same method as that of the sixth embodiment.
When the decision tree T1[n] is generated, a large number of unit sections U are classified into one leaf node νc, and the characteristics are leveled, which gives superiority to the singing characteristics data Z1 in that the relative pitch R is stably generated for a variety of synthesis-purpose music track data YB compared to the singing characteristics data Z2. On the other hand, the classification of the unit sections U is fragmented in the decision tree T2[n], which gives superiority to the singing characteristics data Z2 in that a fine feature of the reference voice is expressed by the probabilistic model M compared to the singing characteristics data Z1.
By appropriately operating the input device 57, the user not only can instruct the voice synthesis (generation of the relative pitch transition CR) using each of the singing characteristics data Z1 and the singing characteristics data Z2, but also can instruct to mix the singing characteristics data Z1 and the singing characteristics data Z2. When the mixing of the singing characteristics data Z1 and the singing characteristics data Z2 is instructed, as exemplified in
Specifically, the variable setting unit 64 generates the singing characteristics data Z by interpolating (for example, interpolating the average and distribution of the probability distribution) the probability distribution defined by the variable group Ω[k] of the mutually corresponding leaf nodes νc between the decision tree T1[n] of the singing characteristics data Z1 and the decision tree T2[n] of the singing characteristics data Z2 in accordance with the mixture ratio λ. The generation of the relative pitch transition CR using the singing characteristics data Z and other such processing is the same as those of the first embodiment. Note that, the interpolation of the probabilistic model M defined by the singing characteristics data Z is also described in detail in, for example, M. Tachibana, et al., “Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing”, IEICE TRANS. Information and Systems, E88-D, No. 11, p. 2484-2491, 2005.
Note that, it is also possible to employ back-off smoothing for dynamic size adjustment at a time of synthesizing the decision tree T[n]. However, the configuration in which the probabilistic model M is interpolated without using the back-off smoothing is advantageous in that there is no need to cause the tree structure (condition or arrangement of respective nodes ν) to be common between the decision tree T1[n] and the decision tree T2[n], and is advantageous in that the probability distribution of the leaf node νc is interpolated (there is no need to consider a statistic of the internal node νb), resulting in a reduced arithmetic operation load. Note that, the back-off smoothing is also described in detail in, for example, Kataoka and three others, “Decision-Tree Backing-off in HMM-Based Speech Synthesis”, Corporate Juridical Person, The Institute of Electronics, Information and Communication Engineers, TECHNICAL REPORT OF IEICE SP2003-76 (2003-08).
The seventh embodiment also realizes the same effect as that of the first embodiment. Further, in the seventh embodiment, the mixing of the singing characteristics data Z1 and the singing characteristics data Z2 is followed by generating the singing characteristics data Z that indicates the intermediate singing style between both, which is advantageous in that the synthesized voice in a variety of singing styles is generated compared to a configuration in which the relative pitch transition CR is generated solely by using the singing characteristics data Z1 or the singing characteristics data Z2. Note that, the configurations of the second embodiment to the sixth embodiment may be applied to the seventh embodiment in the same manner.
Each of the embodiments exemplified above may be changed variously. Embodiments of specific changes are exemplified below. It is also possible to appropriately combine at least two embodiments selected arbitrarily from the following examples.
(1) In each of the above-mentioned embodiments, the relative pitch transition CR (pitch bend curve) is calculated from the reference voice data XA and the reference music track data XB that are provided in advance for the reference music track, but the variable extraction unit 22 may acquire the relative pitch transition CR by an arbitrary method. For example, the relative pitch transition CR estimated from an arbitrary reference voice by using a known singing analysis technology may also be acquired by the variable extraction unit 22 and applied to the generation of the singing characteristics data Z performed by the characteristics analysis unit 24. As the singing analysis technology used to estimate the relative pitch transition CR (pitch bend curve) for example, it is preferable to use a technology disclosed in T. Nakano and M. Goto, VOCALISTENER 2: A SINGING SYNTHESIS SYSTEM ABLE TO MIMIC A USER'S SINGING IN TERMS OF VOICE TIMBRE CHANGES AS WELL AS PITCH AND DYNAMICS”, In Proceedings of the 36th International Conference on Acoustics, Speech and Signal Processing (ICASSP2011), p. 453-456, 2011.
(2) In each of the above-mentioned embodiments, the concatenative voice synthesis for generating the voice signal V by concatenating phonetic pieces with each other has been taken as an example, but a known technology is arbitrarily employed for generating the voice signal V. For example, the voice synthesis unit 66 generates a basic signal (for example, sinusoidal signal indicating an utterance sound of a vocal cord) adjusted to each pitch PB of the synthesized pitch transition CP to which the relative pitch transition CR generated by the variable setting unit 64 is added, and executes filter processing (for example, filter processing for approximating resonance inside an oral cavity) corresponding to the phonetic piece of the lyric designated by the synthesis-purpose music track data YB for the basic signal, to thereby generate the voice signal V.
(3) As described above in the first embodiment, the user of the voice synthesis device 200 can instruct to change the relative pitch transition CR by appropriately operating the input device 57. The instruction to change the relative pitch transition CR may also be reflected on the singing characteristics data Z stored in the storage device 14 of the voice analysis device 100.
(4) In each of the above-mentioned embodiments, the relative pitch R has been taken as an example of the feature amount of the reference voice, but the configuration in which the feature amount is the relative pitch R is not essential to a configuration (for example, configuration characterized in the generation of the decision tree T[n]) that is not premised on an intended object of suppressing the discontinuous fluctuation of the relative pitch R. For example, the feature amount acquired by the variable extraction unit 22 is not limited to the relative pitch R in the configuration of the first embodiment in which the music track is divided into the plurality of unit sections U (UA or UB) for each segment, in the configuration of the second embodiment in which the phrase Q is taken into consideration of the condition for each node ν, in the configuration of the fifth embodiment in which N decision trees T[1] to T[N] are generated from the basic decision tree T0, in the configuration of the sixth embodiment in which the decision tree T[n] is generated in the two stages of the first classification processing SD1 and the second classification processing SD2, or in the configuration of the seventh embodiment in which the plurality of pieces of singing characteristics data Z are mixed. For example, the variable extraction unit 22 may also extract the pitch PA of the reference voice, and the characteristics analysis unit 24 may also generate the singing characteristics data Z that defines the probabilistic model M corresponding to the time series of the pitch PA.
A voice analysis device according to each of the above-mentioned embodiments is realized by hardware (electronic circuit) such as a digital signal processor (DSP) dedicated to processing for a sound signal, and is also realized in cooperation between a general-purpose processor unit such as a central processing unit (CPU) and a program. The program according to the present invention may be installed on a computer by being provided in a form of being stored in a computer-readable recording medium. The recording medium is, for example, a non-transitory recording medium, whose preferred examples include an optical recording medium (optical disc) such as a CD-ROM, and may include a known recording medium of an arbitrary format such as a semiconductor recording medium or a magnetic recording medium. Further, for example, the program according to the present invention may be installed on the computer by being provided in a form of being distributed through the communication network. Further, the present invention is also defined as an operation method (voice analysis method) for the voice analysis device according to each of the above-mentioned embodiments.
Number | Date | Country | Kind |
---|---|---|---|
2013-166311 | Aug 2013 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5621182 | Matsumoto | Apr 1997 | A |
5641927 | Pawate et al. | Jun 1997 | A |
5804752 | Sone et al. | Sep 1998 | A |
5889224 | Tanaka | Mar 1999 | A |
5955693 | Kageyama | Sep 1999 | A |
6307140 | Iwamoto | Oct 2001 | B1 |
20010044721 | Yoshioka et al. | Nov 2001 | A1 |
20030055646 | Yoshioka et al. | Mar 2003 | A1 |
20090306987 | Nakano et al. | Dec 2009 | A1 |
20090326950 | Matsumoto | Dec 2009 | A1 |
20100126331 | Golovkin et al. | May 2010 | A1 |
20110000360 | Saino et al. | Jan 2011 | A1 |
20120031257 | Saino | Feb 2012 | A1 |
20120103167 | Saino et al. | May 2012 | A1 |
20150255088 | Roberge et al. | Sep 2015 | A1 |
Number | Date | Country |
---|---|---|
1 239 457 | Sep 2002 | EP |
1 455 340 | Sep 2004 | EP |
2 270 773 | Jan 2011 | EP |
2 276 019 | Jan 2011 | EP |
2 416 310 | Feb 2012 | EP |
2003-323188 | Nov 2003 | JP |
2011-013454 | Jan 2011 | JP |
2012-037722 | Feb 2012 | JP |
Entry |
---|
Nakano, T. et al. (2011). “Vocalistener 2: A singing synthesis system able to mimic a user's singing in terms of voice timbre changes as well as pitch and dynamics,” In Proceedings of the 36th International Conference on Acoustics, Speech and Signal Processing (ICASSP2011), p. 453-456. |
Tachibana, M. et al. (2005). “Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing,” IEICE Trans, Information and systems, E88-D, No. 11, pp. 2484-2491. |
European Search Report dated Nov. 11, 2015, for EP Application No. 15185625.9, four pages. |
European Search Report dated Dec. 1, 2015, for EP Application No. 15185624.2, six pages. |
Stables, R. et al. (Mar. 9, 2011). “Fundamental Frequency Modulation in Singing Voice Synthesis,”, Speech, Sound and Music Processing, Embracing Research in India, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 104-119. |
Number | Date | Country | |
---|---|---|---|
20150040743 A1 | Feb 2015 | US |