This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2012-181616, filed Aug. 20, 2012, the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a prosody editing apparatus and method.
In recent years, along with the development of a speech synthesis technique which synthesizes speech from text, natural synthetic speech close to human voice production can be obtained.
A recent speech synthesis system generally uses a method of learning prosody or voice quality statistical model from a speech corpus of recorded human speech data. For example, as a prosody statistical model, a decision tree model, hidden Markov model, and the like are known. Using these statistical models, intonation of arbitrary text which is not included in a learning corpus can be reproduced naturally to some extent.
However, since the statistical model learns average prosodic features from many utterances in the speech corpus, intonation of synthetic speech generated from the statistical model tends to be monotonic. Hence, a system which visually presents a prosodic pattern generated by the statistical model to the user, and allows the user to graphically edit the pattern using a device such as a mouse is known.
Graphical editing allows to create arbitrary prosodies as long as they can be output as synthetic speech. Hence, prosodic pattern editing has a high degree of freedom in editing, but improper prosodic patterns are unwantedly created. That is, it is very difficult for a user who has no knowledge about speech to create an intended prosodic pattern.
In order to solve the problem of the degree of freedom, a method of compressing a parameter space having a very high degree of freedom to a two-dimensional coordinate plane is available. However, since not a prosodic pattern of a phrase but a voice quality of synthetic speech can be edited, an editing target is different, and this method cannot be used for the purpose of editing fundamental frequency and duration of an arbitrary text phrase.
In general, according to one embodiment, a prosody editing apparatus includes a storage, a first selection unit, a search unit, a normalization unit, a mapping unit, a display, a second selection unit, a restoring unit and a replacing unit. The storage is configured to store attribute information items of phrases and one or more first prosodic patterns corresponding to each of the attribute information items of the phrases, the attribute information items each indicating an attribute associated with a phrase, the first prosodic patterns each including parameters which indicate a prosody type of the phrase and expresses prosody of the phrase, the parameters each including elements not less than the number of phonemes of the phrase. The first selection unit is configured to select a phrase including phonemes from text to obtain a selected phrase. The search unit is configured to search the storage for one or more second prosodic patterns corresponding to an attribute information item that matches an attribute information item of the selected phrase to obtain as a prosodic pattern set, the second prosodic patterns being included in the first prosodic patterns. The normalization unit is configured to normalize the second prosodic patterns respectively. The mapping unit is configured to map each of the normalized second prosodic patterns on a low-dimensional space represented by one or more coordinates smaller than the number of the elements to generate mapping coordinates. The display is configured to display the mapping coordinates. The second selection unit is configured to obtain coordinates selected from the mapping coordinates as selected coordinates. The restoring unit is configured to restore a prosodic pattern according to the selected coordinates to obtain a restored prosodic pattern. The replacing unit is configured to replace prosody of synthetic speech generated based on the selected phrase by the restored prosodic pattern.
A prosody editing apparatus, method, and program according to this embodiment will be described hereinafter with reference to the drawings. Note that in the following embodiments, a redundant description will be avoided as needed under the assumption that parts denoted by the same reference numerals perform the same operations.
A prosody editing apparatus according to the first embodiment will be described below with reference to the block diagram shown in
A prosody editing apparatus 100 according to the first embodiment includes a speech synthesis unit 101, phrase selection unit 102, prosodic pattern database 103 (to be referred to as a prosodic pattern DB 103 hereinafter), prosodic pattern search unit 104, prosodic model database 105 (to be referred to as a prosodic model DB 105 hereinafter), prosodic pattern generation unit 106, prosodic pattern normalization unit 107, prosodic pattern mapping unit 108, coordinate selection unit 109, prosodic pattern restoring unit 110, prosodic pattern replacing unit 111, and display 112.
The speech synthesis unit 101 externally receives text, generates synthetic speech by applying speech synthesis to the text, and externally outputs the synthetic speech. As the speech synthesis method, concatenative speech synthesis which concatenates phoneme fragments, HMM speech synthesis which creates prosody and voice quality models using a hidden Markov model, and the like are generally known. In this embodiment, any speech synthesis method may be used as long as a prosodic pattern of synthetic speech can be acquired. A prosodic pattern indicates a format of prosody of a phrase, and means time-series changes of parameters such as fundamental frequency, duration, and power which express prosody of a phrase. Also, parameters which express a prosodic pattern have elements not less than the number of phonemes of a phrase.
The phrase selection unit 102 externally receives text, and selects a phrase as a prosody editing range from the text according to a user input, thus obtaining a selected phrase. A selection method of the selected phrase includes, for example, a mouse, keyboard, touch panel, and the like, and a phrase range can be selected using the mouse and the like. The phrase selection unit 102 acquires attribute information of synthetic speech corresponding to the selected phrase from the speech synthesis unit 101. Attribute information includes attributes associated with a phrase such as a surface expression of the phrase, an arrangement order of a phoneme sequence, the number of morae, and an accent type.
The prosodic pattern DB 103 stores attribute information of a phrase and one or more prosodic patterns of the phrase in association with each other. As a registration method of attribute information and prosodic patterns in the prosodic pattern DB 103, for example, general methods may be used. For example, real voice prosodic patterns extracted from recorded speech may be registered, prosodic patterns which have already been edited by the user may be registered, prosodic patterns automatically generated from a prosody statistical model may be registered, and so forth. The prosodic pattern DB 103 may be referred to as a storage.
The prosodic pattern search unit 104 receives the selected phrase and attribute information from the phrase selection unit 102. The prosodic pattern search unit 104 searches the prosodic pattern DB 103 for a phrase, attribute information which matches that of the selected phrase, and obtains one or more prosodic patterns corresponding to the matched phrase as a prosodic pattern set.
The prosodic model DB 105 stores a statistical model. The statistical model indicates a decision tree model or hidden Markov model, which has learned using a speech corpus. When statistical models of a variety of utterance styles, emotions, and speakers are prepared, a variety of prosodic patterns can be generated in correspondence with the selected phrase designated by the user.
The prosodic pattern generation unit 106 receives the selected phrase and prosodic pattern set from the prosodic pattern search unit 104. The prosodic pattern generation unit 106 generates prosodic patterns associated with the selected phrase using the prosodic model DB 105, and adds the generated prosodic patterns to the prosodic pattern set.
Note that when the number of prosodic patterns included in the prosodic pattern set retrieved by the prosodic pattern search unit 104 is not less than a threshold, the prosodic pattern generation unit 106 need not generate a new prosodic pattern.
The prosodic pattern normalization unit 107 receives the prosodic pattern set from the prosodic pattern search unit 104. Note that when the prosodic pattern is added to the prosodic pattern set by the prosodic pattern generation unit 106, the prosodic pattern normalization unit 107 receives the prosodic pattern set from the prosodic pattern generation unit 106. The prosodic pattern normalization unit 107 normalizes respective prosodic patterns of the generated prosodic pattern set.
The prosodic pattern mapping unit 108 receives the normalized prosodic patterns from the prosodic pattern normalization unit 107, maps the normalized prosodic patterns on a low-dimensional space expressed by coordinates smaller than the number of elements of the parameters, and obtains mapping coordinates for respective prosodic patterns.
The coordinate selection unit 109 selects coordinates according to a user instruction, and obtains selected coordinates.
The prosodic pattern restoring unit 110 receives the mapping coordinates from the prosodic pattern mapping unit 108 and the selected coordinates from the coordinate selection unit 109, respectively. The prosodic pattern restoring unit 110 compares the mapping coordinates and selected coordinates to restore a prosodic pattern of coordinates corresponding to the selected coordinates, thus obtaining a restored prosodic pattern.
The prosodic pattern replacing unit 111 receives the restored prosodic pattern from the prosodic pattern restoring unit 110, and replaces a default prosodic pattern generated by the speech synthesis unit 101 by the restored prosodic pattern.
The display 112 receives a prosodic pattern from the speech synthesis unit 101, and displays the received prosodic pattern. Also, the display 112 receives the mapping coordinates from the prosodic pattern mapping unit 108, and displays the received mapping coordinates.
Note that this embodiment assumes the case in which the prosody editing apparatus 100 includes the speech synthesis unit 101. Alternatively, the prosody editing apparatus 100 may not include the speech synthesis unit 101, and may use an external speech synthesis. In this case, the prosodic pattern replacing unit 111 may output the restored prosodic pattern corresponding to the selected phrase to the external speech synthesis device.
An example of attribute information of phrases stored in the prosodic pattern DB 103 will be described below with reference to
As shown in
The ID 201 indicates an identification number of a phrase. The surface expression 202 indicates a character string of a phrase. The phoneme sequence 203 indicates a character string of phonemes corresponding to the surface expression 202, and is delimited by “/” for each phoneme group. The mora count and accent type 204 indicate an accent when the surface expression 202 is uttered. The pattern count 206 indicates the number of prosodic patterns of the phoneme sequence 203. More specifically, for example, the ID 201 “1”, surface expression 202 “”, phoneme sequence 203 “/K/U/D/A/S/A/I/”, mora count and accent type 204 “4 moras/type 3”, and pattern count 206 “182” are stored in association with each other.
Note that when a language is English, the ID 201, surface expression 202, and phoneme sequence 203 are associated with each other as the attribute information 205, and the pattern count 206 of prosodic patterns is associated with the attribute information 205. More specifically, in the example of
An example of prosodic patterns stored in the prosodic pattern DB 103 will be described below with reference to
For one ID 201 shown in
For example, a phrase “(IKAGADESUKA)” of the ID 201 “9” in
As the aforementioned patterns, patterns varied to the extent possible are desirably prepared. For example, when prosodic patterns of various kinds of paralinguistic information, emotions, styles, and speakers can be prepared, the user can select a desired pattern from a variety of prosodic patterns. Note that in the example of
The relationship among the fundamental frequency, duration, and power in a prosodic pattern will be described below with reference to
The duration can be expressed as time-series data of respective phoneme widths 401. For example, a phoneme “/I/” is expressed by 12 frames, a phoneme “/K/” is expressed by 12 frames, and a phoneme “/A/” is expressed by 11 frames. Data obtained by arranging these phoneme widths along a time series are elements stored in the duration 303 shown in
One frequency value corresponds to each frame on this coordinate space, and the fundamental frequencies can be expressed as one contour 402 which connects the frequency values. In this case, assume that a frequency value is set for each frame. However, the frequency value may be set for various other units (for each phoneme, for each vowel, and the like). Data obtained by arranging these frequency values in turn along a time series are elements stored in the fundamental frequency 302 shown in
The power can be expressed as one contour 403 which connects power values for respective frames in the same manner as the contour 402 of the fundamental frequency.
The operation of the prosody editing apparatus according to this embodiment will be described below with reference to the flowchart shown in
In step S501, the prosodic pattern search unit 104 receives a selected phrase from the user.
In step S502, the prosodic pattern search unit 104 searches the prosodic pattern DB 103 for a phrase, the attribute information of which matches that of the selected phrase, and obtains prosodic patterns corresponding to the phrase matching the attribute information, as a prosodic pattern set. As a search method, using a surface expression as attribute information of the phrase, whether or not a phrase having a surface expression which matches that of the selected phrase may be searched. Alternatively, using a phoneme sequence as attribute information, whether or not a phrase having a phoneme sequence which matches that of the selected phrase may be searched. Furthermore, using a mora count and accent type as attribute information, whether or not a phrase having a mora count and accent type which match those of the selected phrase may be searched.
Since prosodic patterns of phrases having the same mora count and accent type are normally similar to each other, even when the number of prosodic patterns of a phrase which match a surface expression is small, prosodic patterns, a surface expression of which differs but match for a mora count and accent type are used as a prosodic pattern set, thus increasing variations of prosodic patterns.
Note that the prosodic pattern generation unit 106 may generate prosodic patterns of the selected phrase using the statistical models stored in the prosodic model DB 105. Using the statistical models stored in the prosodic model DB 105, even when the selected phrase has attributes which do not match those of prosodic patterns stored in the prosodic pattern DB 103, prosodic patterns can be generated.
In step S503, the prosodic pattern normalization unit 107 respectively normalizes prosodic patterns included in the prosodic pattern set. The normalization processing will be described later with reference to
In step S504, the prosodic pattern mapping unit 108 maps the normalized prosodic patterns of the prosodic pattern set on a low-dimensional space. The mapping processing onto the low-dimensional space can use, for example, principal component analysis. The practical mapping processing will be described later with reference to
In step S505, the display 112 displays mapping coordinates of the mapped prosodic pattern set.
In step S506, the coordinate selection unit 109 obtains coordinates of a region selected by the user as selected coordinates.
In step S507, the prosodic pattern restoring unit 110 restores the selected prosodic pattern, thus generating a restored prosodic pattern. The practical restoring processing will be described later.
In step S508, the prosodic pattern replacing unit 111 replaces the prosodic pattern of the selected phrase by the restored prosodic pattern. In this case, when simple replacing processing is done, since prosody cannot be smoothly connected before and after the phrase, synthetic speech may often become unnatural. In this case, a general method may be used to, for example, correct the fundamental frequency contour.
In step S509, the speech synthesis unit 101 executes speech synthesis using the restored prosodic pattern.
It is determined in step S510 whether or not the restored prosodic pattern is a prosodic pattern of synthetic speech desired by the user. If it is determined that the restored prosodic pattern is the prosodic pattern of synthetic speech desired by the user, processing ends. Whether or not the restored prosodic pattern is the prosodic pattern of synthetic speech desired by the user can be determined by seeing, for example, if the user selects an OK button displayed on the display 112. On the other hand, if it is determined that the restored prosodic pattern is not a prosodic pattern of synthetic speech desired by the user, the process returns to step S506, and the user selects another prosodic pattern from the mapping coordinates displayed on the display 112. In this manner, the operation of the prosody editing apparatus 100 according to this embodiment ends.
The normalization processing in the prosodic pattern normalization unit 107 will be described below with reference to
In general, fundamental frequencies have different average values, i.e., different voice pitches, according to person. For this reason, an average value of fundamental frequencies is adjusted to be zero, and the average value is adjusted using fundamental frequencies of a target speaker upon restoring a prosodic pattern. Also, since data lengths of fundamental frequencies differ according to prosodic patterns, each data length is linearly compressed to be an arbitrary fixed length set for each phoneme to adjust the data lengths of other prosodic patterns. Finally, the fundamental frequencies and frames of duration are normalized to have an average=0 and a standard deviation=1. With these processes, units of fundamental frequencies and duration can be adjusted. Note that original average and standard deviation data used in normalization are held to be able to restore original values.
The mapping processing of the prosodic pattern mapping unit 108 will be described below with reference to
This embodiment will exemplify mapping of the prosodic pattern set on the low-dimensional space using principal component analysis. Note that it is desirable to map prosodic patterns on a coordinate space of three dimensions or less as the low-dimensional space. However, the low-dimensional space is not limited to a two-dimensional coordinate plane as long as a coordinate plane can display a prosodic pattern using coordinates smaller than the number of elements of the parameters.
As shown in
A matrix X 801 of the prosodic pattern set is defined by n rows×p columns, as simply shown in
where XT means a transposed matrix of X. This variance-covariance matrix V 802 has a size of p rows×p columns. Next, eigenvalues and eigenvectors of the variance-covariance matrix V 802 are calculated to obtain p eigenvectors (column vectors) corresponding to p eigenvalues. A coefficient matrix A 803 is generated by arranging eigenvectors in descending order of eigenvalue, and a matrix A′ 804 is generated by extracting first two columns (up to second principal components) of the coefficient matrix A 803. That is, the matrix A′ 804 has a matrix size of p rows×2 columns.
Next, each prosodic pattern of the prosodic pattern set is converted into two-dimensional coordinates using:
Z=XA′ (2)
A matrix Z has a size of n rows×2 columns. That is, each row of the matrix Z is used as data obtained by converting each prosodic pattern into two-dimensional coordinates, which are used as mapping coordinates.
An example of mapping coordinates displayed on the display 112 will be described below with reference to
The restored prosodic pattern generation processing in the prosodic pattern restoring unit 110 will be described below.
Assuming that the user selects coordinates z from the two-dimensional coordinate plane shown in
x=zA′T (3)
Note that since the restored prosodic pattern x is normalized, a restored prosodic pattern is obtained by respectively restoring fundamental frequencies to a unit of Hz and duration to a unit of frames using the saved average and standard deviation data.
Note that the user can select not only coordinates, a point of which is displayed, but also arbitrary coordinates. For example, when the user selects a point 904 indicated by a wavy circle in
An example of the user interface displayed on the display 112 will be described below with reference to
The parameter graph shows contours 1003, 1004, and 1005 of prosodic patterns of the phrase “”. The contour 1003 of the prosodic pattern is displayed when a cursor is located at a position of coordinates 1006 on the two-dimensional coordinate plane 1002. Likewise, the contours 1004 and 1005 of the remaining prosodic patterns are displayed when the cursor is located respectively at positions of coordinates 1007 and 1008.
The user can recognize various changes of prosodic patterns in real time by moving the cursor on the two-dimensional coordinate plane 1002. Also, the user can reproduce synthetic speech to which a target prosodic pattern is applied by designating coordinates on the two-dimensional coordinate plane 1002 using a pointing device such as a mouse or touching coordinates on the screen with the finger or the like. Hence, the user can audibly confirm the selected prosodic pattern as desired.
Also, since the aforementioned mapping processing maps similar prosodic patterns to be located at close positions and non-similar prosodic patterns to be located at distant positions, the user can visually recognize different prosodic patterns, and can easily try different prosodic patterns.
Note that the prosody editing apparatus may present only phrases, which are stored in the prosodic pattern DB 103 and can be edited, to the user first, and may prompt the user to select a phrase from the presented phrases, so as to obtain a selected phrase.
According to the first embodiment described above, prosodic patterns of a phrase having attribute information, which matches that of a selected phrase selected by the user, are searched for, and a plurality of prosodic patterns are mapped on the low-dimensional space such as the two-dimensional coordinate plane. Thus, the user can easily obtain a desired prosodic pattern by designating only coordinates. Also, by limiting prosodic patterns, which can be selected by the user, onto the two-dimensional coordinate plane, a prosodic pattern which is not assumed normally can be suppressed from being generated, thus allowing efficient editing of prosody.
(First Modification of the First Embodiment)
In the first embodiment, one matrix is generated by coupling normalized fundamental frequencies and duration, and is mapped on the two-dimensional coordinate plane using principal component analysis. However, in the first modification, matrices of fundamental frequencies and duration are mapped on the two-dimensional coordinate plane respectively.
Mapping processing of the prosodic pattern mapping unit 108 according to the first modification will be described below with reference to
(a) of
As shown in (a) and (b) of
An example of an interface according to the first modification will be described below with reference to
As shown in
The user can edit a prosodic pattern by moving a cursor on the two-dimensional coordinate plane 1202 or 1203 by the same method as in the first embodiment.
According to the first modification described above, the number of parameters to be controlled is increased, and the parameters are independently controlled, thus increasing a degree of freedom in prosody editing, and allowing generation of a more detailed prosodic pattern.
(Second Modification of the First Embodiment)
In this embodiment, prosodic patterns are displayed as points on the two-dimensional coordinate plane. However, as the number of prosodic patterns becomes larger, the number of points increases, and the user cannot visually confirm them. Hence, in the second modification, some points are clustered, and a representative point is displayed. Thus, the user can easily discriminate prosodic pattern groups from each other.
A display example of a two-dimensional coordinate plane after clustering according to the second modification will be described below with reference to
The prosodic pattern mapping unit 108 generates a cluster which combines one or more prosodic patterns by clustering prosodic patterns. Since the clustering can use a general method, a description thereof will not be given. The representative point can be set as a central point of the cluster (that of a circle in
According to the second modification described above, prosodic pattern groups can be easily discriminated from each other by clustering prosodic patterns.
(Third Modification of the First Embodiment)
In the third modification, in addition to the fundamental frequency 302 and duration 303, which are stored in the prosodic pattern DB 103, a label which expresses a prosodic feature of a prosodic pattern may be stored in association with them.
As shown in
A display example on a two-dimensional coordinate plane after clustering according to the third modification will be described below with reference to
When a label is stored in the prosodic pattern DB 103, the prosodic pattern mapping unit 108 tallies classes of labels associated with prosodic patterns in respective clusters after clustering of prosodic patterns, and displays classes of highest frequencies as labels 1501, 1502, and 1503. In this manner, the user can recognize prosodies even when he or she actually listens to synthetic speech.
According to the third modification described above, since labels are assigned to groups obtained by clustering prosodic patterns, prosodies of classes of prosodic pattern groups can be easily distinguished from each other.
In the first embodiment, the prosodic pattern restoring unit restores a prosodic pattern by restoring coordinates selected by the user using equation (3). However, processing for mapping prosodic patterns on a two-dimensional coordinate plane by principal component analysis is often irreversible processing, and a prosodic pattern stored in the prosodic pattern DB cannot always be completely restored from coordinates on the two-dimensional coordinate plane.
Hence, in the second embodiment, a prosodic pattern stored in a prosodic pattern DB 103 is applied without executing restoring processing given by equation (3).
A prosody editing apparatus according to the second embodiment will be described below with reference to the block diagram shown in
A prosody editing apparatus 1600 according to the second embodiment includes a speech synthesis unit 101, phrase selection unit 102, prosodic pattern DB 103, prosodic pattern search unit 104, prosodic model DB 105, prosodic pattern generation unit 106, prosodic pattern normalization unit 107, prosodic pattern mapping unit 108, coordinate selection unit 109, prosodic pattern restoring unit 1601, prosodic pattern replacing unit 111, and display 112. Since the units other than the prosodic pattern restoring unit 1601 are the same as those of the prosody editing apparatus 100 according to the first embodiment, a description thereof will not be repeated.
The prosodic pattern restoring unit 1601 receives selected coordinates selected by the user from the coordinate selection unit 109, and mapping coordinates from the prosodic pattern mapping unit 108. The prosodic pattern restoring unit 1601 determines whether or not a plurality of mapping coordinates include mapping coordinates, a distance from the selected coordinates of which is not more than a threshold. If mapping coordinates, a distance of which is not more than the threshold, are found, fundamental frequencies and duration of an original prosodic pattern corresponding the found mapping coordinates are acquired from the prosodic pattern DB 103 as a restored prosodic pattern.
Processing of the prosodic pattern restoring unit 1601 according to the second embodiment will be described below with reference to
The prosodic pattern restoring unit 1601 determines whether or not mapping coordinates are found within a threshold distance range from the coordinates 1701. As this determination method, whether or not a prosodic pattern point is found within a circle 1702 having a constant distance from the coordinates 1701. In
According to the second embodiment described above, when a prosodic pattern point is found with a threshold distance range from the selected coordinates, a corresponding prosodic pattern is acquired from the database, thus suppressing deterioration of a prosodic pattern, and allowing easy and efficient prosody editing.
Note that the prosody editing apparatus according to the aforementioned embodiments may be implemented by hardware.
The flowcharts of the embodiments illustrate methods and systems according to the embodiments. It will be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be loaded onto a computer or other programmable apparatus to produce a machine, such that the instructions which execute on the computer or other programmable apparatus create means for implementing the functions specified in the flowchart block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instruction stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer programmable apparatus which provides steps for implementing the functions specified in the flowchart block or blocks.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2012-181616 | Aug 2012 | JP | national |