Embodiments described herein relate generally to a speech synthesis device, a speech synthesis method, and a computer program product.
In speech synthesis, regarding the speaker of synthetic speech to be generated, aside from selecting a speaker from a small number of candidates provided in advance, there is a demand for newly creating the speaker individuality that is suitable for the contents to be read or for newly creating the speaker individuality that is unique to the user. As a way to meet such a demand, for example, a technology has been proposed that enables creation of new speaker individualities by manipulating the parameters related to speaker individuality.
Along with the sophistication of such technology, if users become able to freely create various speaker individualities having a high degree of originality, it is expected to see a rise in the demand for exclusively using the newly-created speaker individuality as one's own distinctive speaker individuality. However, such a demand cannot be met, because there may be cases in which speaker individuality, which is identical or similar to the speaker individuality created by a particular user speaker individuality, is accidentally created by another user and is used in actual products/services.
A speech synthesis device according to an embodiment includes a speech synthesizing unit, a speaker parameter storing unit, an availability determining unit, and a speaker parameter control unit. Based on a speaker parameter value representing a set of values of parameters related to the speaker individuality, the speech synthesizing unit is capable of controlling the speaker individuality of synthesized speech. The speaker parameter storing unit is used to store already-registered speaker parameter values. Based on the result of comparing an input speaker parameter value with each already-registered speaker parameter value, the availability determining unit determines the availability of the input speaker parameter value. The speaker parameter control unit prohibits or restricts the use of the input speaker parameter value that is determined to be unavailable (unusable) by the availability determining unit.
Exemplary embodiments of a speech synthesis device, a speech synthesis method, and a computer program product are described below in detail with reference to the accompanying drawings. In the following explanation, the constituent elements having identical functions are referred to by the same reference numerals, and the redundant explanation is not repeated.
The speech synthesizing unit 10 receives input of text information, and generates a speech waveform of the synthetic speech using various models and rules stored in the speech synthesis model storing unit 20. At that time, if a speaker parameter value representing the values of the parameters related to the speaker individuality is also input from the speaker parameter control unit 40, then the speech synthesizing unit 10 generates a speech waveform while controlling the speaker individuality according to the input speaker parameter value. The speaker individuality represents the features of the voice unique to the speaker and, for example, has a plurality of factors such as age, brightness, hardness, and clarity. The speaker parameter value represents the set of values corresponding to such factors of the speaker individuality.
The speech synthesis model storing unit 20 is used to store an acoustic model formed by modeling the acoustic features of speech; a prosody model formed by modeling the prosody such as intonation/rhythm; and a variety of other information required in speech synthesis. Moreover, in the speech synthesis device according to the first embodiment, a model required in controlling the speaker individuality is also stored in the speech synthesis model storing unit 20.
In a speech synthesis method based on the hidden Markov model (HMM), the prosody model and the acoustic model stored in the speech synthesis model storing unit 20 are formed by modeling the correspondence relationship between text information, which is extracted from texts, and the parameter sequence of prosody or acoustic. Generally, the text information is configured with phonological information, which corresponds to the manner of reading a text and the accent, and language information such as separation of phrases and the part of speech. A model is configured with: a decision tree in which each parameter is clustered on a state-by-state basis according to the phonological/language environment; and the probability distribution of parameters assigned to each leaf node of the decision tree.
The prosody parameters include a pitch parameter indicating the pitch of the voice and the duration length indicating the length of the sound. The acoustic parameters include a spectral parameter indicating the features of the vocal tract and an aperiodic index indicating the extent of aperiodicity of the source signal. Herein, a state implies the internal state attained when the temporal change of each parameter is modeled using the HMM. Usually, since each phoneme section is modeled using the HMM of three to five states that make transition from left to right without any back tracking, the phoneme section includes three to five states. Herein, for example, in the decision tree for the first state of the pitch parameter, the probability distribution of the pitch values in the leading section of the phoneme sections is subjected to clustering according to the phonological/language environment and, by tracing the decision tree based on the phonological/language information related to the target phoneme section, the probability distribution of the pitch parameter of the leading section of those phonemes can be obtained. It is often the case that the normal distribution is used as the probability distribution of parameters and, in that case, the distribution is expressed using the average vector, which represents the center of the distribution, and the covariance matrix, which indicates the spread of the distribution.
In the speech synthesizing unit 10, based on the input text information, the probability distribution with respect to each state of each parameter is selected in the decision tree described above; parameter sequences having the highest probability are generated based on the probability distributions; and a speech waveform is generated based on those parameter sequences. In the case of the method based on the general HMM, a source waveform is generated based on the generated pitch parameters and the aperiodic index, and then a speech waveform is generated by convoluting, in the source waveform, a vocal tract filter that undergoes temporal changes in the filter characteristics according to the generated spectral parameters.
In the speech synthesizing unit 10 of the speech synthesis device according to the first embodiment, the speaker individuality can be controlled as a result of specification of the speaker parameter value by the speaker parameter control unit 40. As a method for implementing that control, for example, as disclosed in Patent Literature 1, the desired speaker individuality can be achieved as follows: a plurality of acoustic models formed by modeling the voices of a plurality of speakers having different voice qualities is stored in the speech synthesis model storing unit 20; a few of the acoustic models are selected according to the specified speaker parameter value; and the acoustic parameters of the selected acoustic models are interpolated using the weighted sum.
Alternatively, the control of the speaker individuality can be implemented using the speech synthesizing unit 10 and the speech synthesis model storing unit 20 having a configuration as illustrated in
The base model can be a model called an average voice model that expresses the average speaker individuality of a plurality of speakers, or can be a model that expresses the speaker individuality of a particular speaker. As far as the specific configuration of the base model is concerned, for example, in an identical manner to the prosody model or the acoustic model obtained according to the method based on the HMM, the base model is configured with: a decision tree in which each parameter is clustered on a state-by-state basis according to the phonological/language environment; and the probability distribution of parameters assigned to each leaf node of the decision tree.
The speaker individuality control model can also be configured with a decision tree and the probability distribution assigned to each leaf node of the decision tree. However, in this model, the probability distribution represents the differences in the prosody/acoustic parameters attributed to the differences in the factors of the speaker individuality. More particularly, in this model, following sub-models are included: an age model obtained by modeling the differences in the prosody/acoustic parameters attributed to the differences in the age; a brightness model obtained by modeling the differences in the prosody/acoustic parameters attributed to the differences in the brightness of voice; a hardness model obtained by modeling the differences in the prosody/acoustic parameters attributed to the differences in the hardness of voice; and a clarity model obtained by modeling the differences in the prosody/acoustic parameters attributed to the differences in the clarity of voice.
The speech synthesizing unit 10 having the configuration illustrated in
Herein, the weight of a sub-model is obtained by the weight setting unit 15 by conversion of the speaker parameter value assigned by the speaker parameter control unit 40. A specific example is illustrated in
The adding unit 12 performs the abovementioned addition operation in each state of each parameter, and generates a sequence of probability distributions for which a weighted addition is performed.
Regarding each parameter such as the spectral parameter and the pitch parameter, based on the sequence of probability distributions assigned by the adding unit 12, the parameter generating unit 13 generates a parameter sequence having the maximum probability. Based on the generated parameter sequence, the waveform generating unit 14 generates a speech waveform of the synthetic speech.
As described above, the speech synthesizing unit 10 having the configuration illustrated in
Returning to the explanation with reference to
The speaker parameter control unit 40 performs operations related to the speaker parameter value in coordination with the display/input control unit 30 and the availability determining unit 60. For example, when the speaker parameter value that is input by the user are received from the display/input control unit 30, the speaker parameter control unit 40 sends speaker parameter value and the user information to the availability determining unit 60 and instructs determination about the availability of the speaker parameter value. If the speaker parameter value that is input is determined to be available (usable), then the speaker parameter control unit 40 sends the speaker parameter value to the speech synthesizing unit 10 and enables their use in speech synthesis. On the other hand, if it is determined that the speaker parameter value that is input by the user is unavailable, then the speaker parameter control unit 40 prohibits or restricts the use of those speaker parameters and sends information about the prohibition of use or restriction on use to the display/input control unit 30. Herein, restriction on use implies that the use is allowed with an condition. Meanwhile, when an instruction for calling the already-registered speaker parameter values is given by the display/input control unit 30, the speaker parameter control unit 40 identifies the user and retrieves the corresponding already-registered speaker parameter values from the speaker parameter storing unit 50, and sends them to the display/input control unit 30 or the speech synthesizing unit 10.
The speaker parameter storing unit 50 is used to store the already-registered speaker parameter values that are held by each user. In the first embodiment, it is assumed that the speaker parameter values are registered by some other device other than the speech synthesis device illustrated in
In
The availability determining unit 60 receives from the speaker parameter control unit 40 the input of speaker parameter value and user information as input by a user; collates the input information with the already-registered speaker parameter values and the supplementary information; and determines the availability of the input speaker parameter value and sends the determination result to the speaker parameter control unit 40.
Explained below with reference to
Subsequently, the availability determining unit 60 refers to the speaker parameter storing unit 50 and obtains the already-registered speaker parameter value and the supplementary information for the speaker individuality ID “j” (Step S103). Then, the system control proceeds to Step S104. Herein, regarding the speaker individuality ID “j”, the speaker parameter value is assumed to be P(j)={pj(0), pj(1), pj(2), . . . , pj(C−1)}. Meanwhile, N represents the total number of already-registered speaker parameter values that are stored in the speaker parameter storing unit 50.
At Step S104, based on the user information obtained at Step S101 and the supplementary information obtained at Step S103, the availability determining unit 60 determines whether or not the user who inputs the speaker parameter value is the owner of the already-registered speaker parameter value corresponding to the speaker individuality ID “j” (Step S104). If the user who inputs the speaker parameter value is the owner of the already-registered speaker parameter value corresponding to the speaker individuality ID “j” (Yes at Step S104), then the system control proceeds to Step S109. On the other hand, if the user who inputs the speaker parameter value is not the owner of the already-registered speaker parameter value corresponding to the speaker individuality ID “j” (No at Step S104), then the system control proceeds to Step S105.
At Step S105, based on the supplementary information obtained at Step S103, the availability determining unit 60 determines whether or not the use of the speaker parameter value by the user goes against the usage condition set for the already-registered speaker parameter value corresponding to the speaker individuality ID “j” (Step S105). If the use does not go against the usage condition (No at Step S105), then the system control proceeds to Step S109. However, if the use goes against the usage condition (Yes at Step S105), then the system control proceeds to Step S106. The determination method for determining whether or not the use is against the usage condition set for the already-registered speaker parameter value is different depending on the usage condition for the already-registered speaker parameter value that is stored as supplementary information in the speaker parameter storing unit 50. For example, for the already-registered speaker parameter value corresponding to the speaker individuality ID “j”, if the usage condition is set to unavailable, then it is determined that the use goes against the usage condition. Moreover, regarding the already-registered speaker parameter value corresponding to the speaker individuality ID “j”, if the usage condition indicates that the use is allowed only for a predetermined period of time; then, for example, as long as the current timing is within that predetermined period of time, it is determined that the use does not go against the usage condition. However, if the current timing is outside of the predetermined period of time, then it is determined that the use goes against the usage condition.
At Step S106, from the speaker parameter value received at Step S101 (i.e., the speaker parameter value input by the user) and from the already-registered speaker parameter value obtained at Step S103 (i.e., the already-registered speaker parameter value corresponding to the speaker individuality ID “j”), the availability determining unit 60 calculates a Diff(Pin, P(j)), which represents the difference between the two speaker parameter values, using a predetermined evaluation function. Then, the system control proceeds to Step S107.
At Step S107, the availability determining unit 60 compares the value of Diff(Pin, P(j)) calculated at Step S106 with a first threshold value representing a boundary of the range of already-registered speaker parameter values. If the value of Diff(Pin, P(j)) is equal to or smaller than the first threshold value (Yes at Step S107), that is, if the speaker parameter value input by the user is similar to the already-registered speaker parameter value corresponding to the speaker individuality ID “j”, then the availability determining unit 60 determines at Step S108 that the speaker parameter value input by the user is “unavailable” and sends the determination result to the speaker parameter control unit 40. It marks the end of the operations. On the other hand, if the value of Diff (Pin, P(j)) is greater than the first threshold value (No at Step S107), then the system control proceeds to Step S109.
At Step S109, the availability determining unit 60 checks whether j=N holds true, that is, checks whether collation has completed for all already-registered speaker parameter values and supplementary information stored in the speaker parameter storing unit 50. If j=N does not hold true (No at Step S109), then the availability determining unit 60 increments the counter j of the speaker individuality ID at Step S110, and again performs the operations from Step S103 onward. On the other hand, if j=N (Yes at Step S109), at Step S111, the availability determining unit 60 determines that the speaker parameter value input by the user is “available” and sends the determination result to the speaker parameter control unit 40. It marks the end of the operations.
Given below is the explanation about the difference Diff(P1, P2) that is used at Step S106 as the difference between two speaker parameter values P1 and P2. For example, as given below in Equation (1), Diff(P1, P2) can be defined as the weighted sum of the difference of each factor of the speaker individuality that constitutes the speaker parameter value.
Where, P1 is represented as {p1(0), p1(1), p1(2), . . . , p1(C−1)} and P2 is represented as {p2(0), p2(1), p2(2), . . . p2(C−1)}. Moreover, λ(k) represents the weight of the k-th element, and d(k) represents the difference at the k-th element. Regarding an element expressed as a continuous value, d(k) (p1(k), p2(k)) can be defined as the square error of p1(k) and p2(k). Regarding an element expressed as a discrete category, d(k) (p1(k), p2(k)) can be defined as “0” if p1(k) and p2(k) are identical and can be defined as “1” otherwise. Regarding the weight λ(k), it is desirable that the elements that have a large effect on the subjective differences in the speaker individualities have a proportionally large weight. For example, it is possible to think of performing subjective assessment of the differences in the speaker individualities in the speeches generated by combining various P1 and P2, and the result thereof is subjected to multiple linear regression analysis, so that the relationship between d(0) (p1(0), p2(0), . . . , d(c−1) (p1(C−1), p2(C−1)) and the subjective assessment value is obtained; and using the coefficient of the resultant multiple linear equation as the weight.
Regarding the example of Diff(P1, P2), it is assumed that each element independently affects the differences in the speaker individualities. However, from the data of a large number of combinations of d(0) (p1(0), p2(0), . . . , d(c−1) (p1(C−1), p2(C−1)) and the subjective assessment value as obtained by performing the abovementioned subjective assessment in high volume, if a neural network for estimating the difference Diff(P1, P2) is learnt using a deep learning method, then it becomes possible to estimate the difference Diff(P1, P2) in which the mutual action among the elements is also reflected to some extent.
The first threshold value that is used in the determination at Step S107 either can be a common value for all already-registered speaker parameter values stored in the speaker parameter storing unit 50 or can be a different value for each already-registered speaker parameter value. In the latter case, the supplementary information stored in the speaker parameter storing unit 50 not only contains the information about the owners and the usage condition but also contains the first-type threshold values indicating the registration ranges of the already-registered speaker parameter values. For example, if an owner wishes to exclusively use a particular already-registered speaker parameter value over a wider range, he or she can register a larger first threshold value corresponding to that already-registered speaker parameter value so that the range determined to be unavailable can be widened.
Given below is the explanation of an example of the interactive operations performed by the speech synthesis device in response to the user operations, along with explaining a specific example of the user interface that is provided by the display/input control unit 30 to the user.
Once the speech synthesis device according to the first embodiment is activated and when a user performs login according to a predetermined procedure; for example, a screen 100 illustrated in
From the pulldown menu 102 of the screen 100 illustrated in
Meanwhile, from the pulldown menu 102 of the screen 100 illustrated in
The radar chart 111 has, on the axis corresponding to each factor of the speaker individuality, an operator for changing the value corresponding to that factor. The user can operate the operators provided on the radar chart 111 and input the desired speaker parameter value. The synthetic speech in which the input speaker parameter value is reflected can be checked by inputting a text for trial listening in the text box 113 and pressing the “trial listening” button 114.
Moreover, after inputting the desired speaker parameter value using the radar chart 111, when the user inputs the user information in the text box 112 and presses the “use current settings” button 115, the speaker parameter value and the user information as input by the user gets transferred from the display/input control unit 30 to the speaker parameter control unit 40. Upon receiving the speaker parameter value and the user information from the display/input control unit 30, the speaker parameter control unit 40 sends the speaker parameter value and the user information to the availability determining unit 60 and requests for availability determination. Then, the availability determining unit 60 implements, for example, the method described earlier to determine the availability of the speaker parameter value input by the user, and sends the determination result to the speaker parameter control unit 40.
If the determination result obtained by the availability determining unit 60 indicates unavailability, then the speaker parameter control unit 40 sends information related to the prohibition of use or restriction on use to the display/input control unit 30. Then, the display/input control unit 30 reflects the information, which is received from the speaker parameter control unit 40, on the screen of the user interface. For example, when the information related to the prohibition of use is received from the speaker parameter control unit 40, the display/input control unit 30 displays, on the screen 110, a popup error message 116 notifying the user that the input speaker parameter value is not available. When an “OK” button 116a provided in the error message 116 is pressed, the display returns to the screen 110 illustrated in
Meanwhile, if the determination result obtained by the availability determining unit 60 indicates availability, then the screen of the interface changes from the screen 110 illustrated in
Using the screen 120, the user inputs in the text box 101 the text information to be subjected to speech synthesis, adjusts the voice quality parameters by operating the slide bars 103a, 103b, and 103c as may be necessary, and then presses the “synthesize” button 104. As a result, a speech waveform of the synthetic speech to which the speaker parameter value input by the user is applied gets generated by the speech synthesizing unit 10. Moreover, if the user presses the “store” button 105, the speech waveform of the synthetic speech as generated by the speech synthesizing unit 10 gets stored at a predetermined storage location.
Meanwhile, if the user performs an operation for selecting the option of “registered speaker individuality” from the pulldown menu 102 of the screen 100 illustrated in
When a user inputs the user information in the text box 131, a list of the already-registered speaker parameter values held by that user is displayed in a selectable manner. Subsequently, when the user selects the desired already-registered speaker parameter value from the pulldown menu 132, selects a text for trial listening in the text box 133, and presses the “trial listening” button 134; he or she becomes able to check the synthetic speech in which the selected already-registered speaker parameter value is reflected. Moreover, after selecting the desired already-registered speaker parameter value from the pulldown menu 132, when the user presses the “use current settings” button 135, the already-registered speaker parameter value that is selected by the user is set in the speaker parameter control unit 40, and the screen 130 illustrated in
Using the screen 140, the user inputs in the text box 101 the text information to be subjected to speech synthesis, adjusts the voice quality parameters by operating the slide bars 103a, 103b, and 103c as may be necessary, and then presses the “synthesize” button 104. As a result, a speech waveform of the synthetic speech to which the already-registered speaker parameter value selected by the user is applied gets generated by the speech synthesizing unit 10. Moreover, if the user presses the “store” button 105, the speech waveform of the synthetic speech as generated by the speech synthesizing unit 10 gets stored at a predetermined storage location.
Meanwhile, the explanation above is given about an example in which an already-registered speaker parameter value is selected and used without modification. Alternatively, the selected already-registered speaker parameter value can be further adjusted in the screen 110, which is illustrated in
In this way, as described above in detail with reference to specific examples, according to the first embodiment, based on the result of comparison of the input speaker parameter value with the already-registered speaker parameter values, the availability of the input speaker parameter value is determined and the speaker parameter value determined to be unavailable is prohibited or restricted for use. Hence, when the speaker parameter value representing the desired speaker individuality is registered, it becomes possible to exclusively use that desired speaker individuality.
Given below is the explanation of a second embodiment. In the first embodiment, the explanation is given on the premise that a speaker parameter value is registered using some other device other than the speech synthesis device. However, if a speaker parameter value can be registered using the speech synthesis device that sets and uses the speaker parameter value, it would lead to enhancement in the user-friendliness. In that regard, in the second embodiment, the speech synthesis device is equipped to have the function of registering the speaker parameters.
In the second embodiment, using the user interface provided by the display/input control unit 30, a user can check the registrability of the input speaker parameter value and can give a registration request. When a user gives an instruction for checking the registrability, the display/input control unit 30 sends to the speaker parameter control unit 40 the instruction for checking the registrability and information such as the speaker parameter value to be registered and the user information, and then the speaker parameter control unit 40 sends all that information to the availability determining unit 60. In the second embodiment, the availability determining unit 60 has a function for determining the registrability and a function for calculating the registration fee. When the determination of registrability is requested by the speaker parameter control unit 40, the availability determining unit 60 determines the registrability by referring to the speaker parameter storing unit 50, calculates the registration fee in the case in which the speaker parameter value is registrable, and sends the result to the speaker parameter control unit 40. Then, the determination result and the registration fee for a registrable value are sent from the speaker parameter control unit 40 to the display/input control unit 30, and are then notified to the user via the user interface provided by the display/input control unit 30.
Regarding the speaker parameter value determined to be registrable, the user can give a registration request using the user interface provided by the display/input control unit 30. If a registration fee needs to be paid, then the billing processing unit 80 is notified about the registration fee so that it can perform billing with respect to the user. When the receipt of the registration fee is confirmed, the billing processing unit 80 notifies the display/input control unit 30 about the same. Then, the display/input control unit 30 sends the speaker parameter value, the user information, and the information related to the usage condition to the speaker parameter control unit 40. Subsequently, the speaker parameter control unit 40 sends that information along with a registration instruction to the speaker parameter registering unit 70. In response to the registration instruction received from the speaker parameter control unit 40, the speaker parameter registering unit 70 stores the specified speaker parameter value along with the supplementary information such as the user information and the usage condition in the speaker parameter storing unit 50.
The determination method by which the availability determining unit 60 determines registrability of the speaker parameter value is fundamentally identical to the determination method for determining the availability, except for the difference that the registration range of the speaker parameter value to be registered is taken into account in the registrability determination. The difference between the availability determination and the registration difference is explained with reference to
In the registrability determination, if overlapping of the registration ranges is not allowed; then, in the determination that is equivalent to Step S107 in the flowchart illustrated in
Diff(Pin, P(j))≤(THRE(j)+THREin) (2)
Meanwhile, when the registration ranges are overlapping, if the use by the owner of the already-registered speaker parameter value is to be given priority in the overlapped range; then, in an identical manner to the availability determination, the availability determining unit 60 determines registrability using the conditional expression given below in Equation (3). However, if the conditional expression given earlier in Equation (2) is satisfied despite the determination that the speaker parameter value is registrable, then the availability determining unit 60 determines that the speaker parameter value is registrable with an condition. In that case, the availability determining unit 60 gives a notification using the user interface provided by the display/input control unit 30, and makes an inquiry to the user about whether or not to perform registration after adjusting the speaker parameter value and the registration range.
Diff(Pin, P(j))≤(THRE(j)) (3)
For example, the availability determining unit 60 obtains a speaker parameter value Pinsubset that is adjusted to satisfy Equation (4) given below.
Diff(Pinsubset, P(j))>(THRE(j)+THREin) (j=0, 1, . . . , C−1) (4)
Then, the availability determining unit 60 sends the adjusted speaker parameter value Pinsubset to the speaker parameter control unit 40, and requests the speaker parameter control unit 40 to inquire about whether or not to register the adjusted speaker parameter value Pinsubset. In response to the request, the speaker parameter control unit 40 instructs the display/input control unit 30 to make an inquiry to the user about whether or not to register the adjusted speaker parameter value Pinsubset. As a result, an inquiry is made to user via the user interface provided by the display/input control unit 30. If the user gives a request for registering the adjusted speaker parameter value Pinsubset, then the speaker parameter control unit 40 instructs the speaker parameter registering unit 70 to register the adjusted speaker parameter value Pinsubset.
Alternatively, the availability determining unit 60 can obtain a substitute second threshold value THREinsubset that is lowered to satisfy Equation (5) given below (i.e., a substitute second threshold value that narrows the registration range of the speaker parameters).
Diff(Pin, P(j))>(THRE(j)+THREinsubset) (j=0, 1, . . . , C−1) (5)
In that case, the availability determining unit 60 sends the substitute threshold value THREinsubset to the speaker parameter control unit 40, and requests the speaker parameter control unit 40 to inquire about whether or not to register the speaker parameter value Pin with a narrower registration range. In response to the request, the speaker parameter control unit 40 instructs the display/input control unit 30 to make an inquiry to the user about whether or not to register the speaker parameter value Pin with a narrower registration range. As a result, an inquiry is made to the user via the user interface provided by the display/input control unit 30. If the user gives a request for registering the speaker parameter value Pin with a narrower registration range, then the speaker parameter control unit 40 instructs the speaker parameter registering unit 70 to register the speaker parameter value Pin with a narrower registration range.
When the speaker parameter value to be registered is determined to be registrable, the availability determining unit 60 calculates the registration fee of that speaker parameter value to be registered. For example, based on the distribution of the already-registered speaker parameter values stored in the speaker parameter storing unit 50, the availability determining unit 60 can calculate the registration fee that is higher in proportion to the popularity of the speaker individuality. That is, the availability determining unit 60 decides on the registration fee according to the number of already-registered speaker parameter values positioned in the surrounding area of the speaker parameter value to be registered. More particularly, regarding a predetermined difference Dadj, the number of such speaker parameter values P(j) is obtained for which Equation (6) given below is satisfied, and the registration fee is calculated using a function that monotonically increases with respect to the number of speaker parameter values P(j).
Diff(Pin, P(j))≤Dadj (6)
Alternatively, the registration fee can be calculated not only by taking into account the already-registered speaker parameter values but also by taking into account the usage frequency of the input speaker parameter value or the surrounding values thereof. In that case, history information of the parameter values used by all users is also stored in the speaker parameter storing unit 50.
Given below is the explanation of an example of the interactive operations related to the registration of speaker parameters as performed by the speech synthesis device, along with explaining a specific example of the user interface that is provided by the display/input control unit 30 to the user.
In the second embodiment, when a user performs an operation for selecting the option of “created speaker individuality” from the pulldown menu 102 in the screen 100 illustrated in
After inputting the desired speaker parameter value using the radar chart 111 in the screen 210 illustrated in
If the determination result obtained by the availability determining unit 60 indicates that the speaker parameter value is registrable, then the speaker parameter control unit 40 notifies the display/input control unit 30 about the confirmation result indicating that the speaker parameter value is registrable; and the screen on the user interface changes from the screen 210 illustrated in
The user can input a variety of information, which is required in the registration of a speaker parameter value, in the screen 220 illustrated in
When the user presses the “registration fee calculation” button 228, the registration fee calculated by the availability determining unit 60 gets displayed in the registration fee display area 229. The user can refer to the registration fee displayed in the registration fee display area 229, and decide on whether or not to give a registration request. Subsequently, when the user presses the “register” button 230, the billing processing unit 80 performs billing. When the receipt of the registration fee is confirmed, the speaker parameter registering unit 70 performs a registration operation for registering the speaker parameter value in response to the registration instruction received from the speaker parameter control unit 40; and the speaker parameter value to be registered and the supplementary information are stored in the speaker parameter storing unit 50. Meanwhile, if the user presses the “cancel” button 231, the registration operation for registering the speaker parameter value is cancelled, and the screen returns to the screen 210 illustrated in
If the determination result obtained by the availability determining unit 60 indicates that the speaker parameter value is not registrable, then the speaker parameter control unit 40 notifies the display/input control unit 30 about the confirmation result indicating that the speaker parameter value is not registrable. In that case, for example, as illustrated in
If the determination result indicates that the speaker parameter value is registrable with an condition, the availability determining unit 60 calculates the adjusted speaker parameter value as described earlier, and requests the speaker parameter control unit 40 to inquire about whether or not to register the adjusted speaker parameter value. Then, the speaker parameter control unit 40 instructs the display/input control unit 30 to inquire about whether or not to register the adjusted speaker parameter value. In that case, for example, as illustrated in
Alternatively, if the determination result indicates that the speaker parameter value is registrable with an condition, the availability determining unit 60 can obtain a substitute plan for narrowing the registration range of the speaker parameters as described earlier, and can request the speaker parameter control unit 40 to inquire about whether or not to register the speaker parameter value with a narrower registration range. In that case, for example, as illustrated in
As described above, according to the second embodiment, registration of the speaker parameter value is also possible in response to the user operations, thereby enabling achieving enhancement in the user-friendliness. Moreover, the billing of the registration fee, which is required for the registration of the speaker parameters, can also be performed in an appropriate manner.
In the second embodiment related to the registration of a speaker parameter value, the explanation is given about the mechanism of billing performed at the time of registration. However, also in the first embodiment that is related to the use of the synthetic speech in which the speaker parameter value is used, a mechanism can be provided for enabling billing at the time of usage. In that case, the usage fee can be set by providing, in the registration conditions regarding the speaker parameter value, an item enabling usage fee setting by a different person. For example, in an identical manner to the registration range, a plurality of fee patterns including the charge-free option can be set, and can be made selectable or can be made freely settable by the registrant. The setting value of this item can be stored, for example, in the speaker parameter storing unit 50 as part of the information illustrated in
Given below is the explanation of a third embodiment. In the first embodiment described earlier, the difference between the input speaker parameter value and the already-registered speaker parameter value is obtained using the speaker parameter value itself. However, in that case, if updating of the speech synthesis model results in changes in the definitions of the speaker parameters or changes in the types of the values; a speaker parameter value before the changes cannot be compared with a speaker parameter value after the changes, and the speaker parameter value registered before the changes becomes unusable after the changes. In that regard, in the third embodiment, at the time of obtaining the difference between the input speaker parameter value and the already-registered speaker parameter value, instead of using the actual value, the speaker parameter value to be compared is mapped onto some other common parameter space, and the difference is calculated in that parameter space.
The speech synthesis device according to the third embodiment has an identical configuration to the configuration illustrated in
If P1SA and P2SB (in parameter spaces SA and SB, respectively) represent the speaker parameter values to be compared, and if mapSA→SX( ) and mapSB→SX( ) represent the functions for mapping the speaker parameter values onto a common parameter space SX; then the difference Diff(P1SA, P2SB) between those speaker parameter values is calculated in the mapped space as given below in Equation (7).
Diff(P1SA, P2SB)=DiffSX(mapSA→SX(P1SA), mapSB→SX(P2SB)) (7)
Where, DiffSX represents the difference between the speaker parameters mapped onto the parameter space SX.
As a result of implementing such a method, the difference can be calculated even between the speaker parameter values having different definitions or different types of values. Moreover, also among the speaker parameter values having the same definition and the same type of values, if the mapping destination space represents the speaker individuality in a more direct manner than the original speaker parameter spaces, a more appropriate difference can be calculated according to this particular method. For example, as the speaker parameter space representing the mapping destination, a general-purpose parameter space such as the vector space of the logarithmic amplitude spectrum can be used that expresses the speaker individuality in a direct manner and can be calculated from various speaker parameter values.
Supplementary Explanation
The speech synthesis device according to the embodiments described above can be implemented using, for example, a general-purpose computer as the fundamental hardware. That is, the functions of the speech synthesis device according to the embodiments described above can be implemented by making the processor installed in a general-purpose computer execute computer programs. At that time, the speech synthesis device can be implemented by installing the computer programs in advance in the computer, or can be implemented by storing the computer programs in a memory medium such as a CD-ROM or distributing the computer programs via a network, and then installing them in the computer.
When the speech synthesis device has the hardware configuration as illustrated in
Alternatively, the functions of some or all of the constituent elements of the speech synthesis device can be implemented using dedicated hardware such as an application specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) (i.e., using a dedicated processor instead of a general-purpose processor). Still alternatively, the functions of the constituent elements can be implemented using a plurality of processors.
Still alternatively, the speech synthesis device according to the embodiments can be configured as a system in which the functions of the constituent elements are implemented in a dispersed manner among a plurality of computers. Still alternatively, the speech synthesis device according to the embodiments can be a virtual machine that runs in a cloud system.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2017-049801 | Mar 2017 | JP | national |
This application is a continuation of PCT International Application No. PCT/JP2017/034648 filed on Sep. 26, 2017, which designates the United States, incorporated herein by reference. The PCT International Application No. PCT/JP2017/034648 claims the benefit of priority from Japanese Patent Application No. 2017-049801, filed on Mar. 15, 2017, incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2017/034648 | Sep 2017 | US |
Child | 16561584 | US |