This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2013-192359, filed on Sep. 17, 2013; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a prosody editing device and method and a computer program product.
Recent speech synthesis technologies for generating a synthetic speech from a text use a statistical prosody model, thereby significantly improving the quality of the generated synthetic speech. Even if an elaborated prosody model is constructed from a large amount of speech corpus, however, average prosody generated from the prosody model may possibly be insufficient in the cases of colloquial expressions and word-ending expressions, such as greetings having various types of prosody. To address this, there has been proposed a device that edits prosody generated from a prosody model in response to a user operation.
Such a device that edits prosody in response to a user operation needs to provide natural prosody desired by the user with an intuitive and simple operation to prevent deterioration in the quality of a synthetic speech caused by unnaturalness of edited prosody and to improve user operability in the editing work.
According to an embodiment, a prosody editing device includes an approximate contour generator, a setter, a display controller, an operation receiver, and an updater. The approximate contour generator approximates a contour representing a time series of prosody information with a parametric curve including a control point to generate an approximate contour. The setter sets, on the approximate contour, an operation point corresponding to the control point. The display controller displays, on a display device, an operation screen including the approximate contour on which the operation point is shown. The operation receiver receives an operation to move the operation point optionally selected on the operation screen. The updater calculates a position of the control point from a moving amount of the operation point and updates the approximate contour.
The speech synthesizer 101 receives a text from the outside to generate prosody and a synthetic speech. To generate prosody, a statistical prosody model is used, for example. As for a speech synthesis method, a desired method may be employed, including publicly known unit selection speech synthesis and Hidden Markov Model speech synthesis. The speech synthesizer 101 may also receive prosody edited by a user operation (an updated approximate contour, which will be described later), thereby generating a synthetic speech to which the edited prosody is applied. The synthetic speech generated by the speech synthesizer 101 is output from the speaker 110.
Examples of prosody information (parameters capable of being handled by a calculator) indicating prosody of a speech include a fundamental frequency (F0), and duration and power of a phoneme. A time series of F0 can be represented by a line, where an abscissa represents time and an ordinate represents the frequency. The time series of F0 represented by such a line is referred to as an F0 contour. Editing the F0 contour makes it possible to generate a synthetic speech having various types of intonation.
The following describes a case where an F0 contour generated by the speech synthesizer 101 is a target to be edited. However, the prosody information to be edited is not limited to an F0 contour. The prosody editing method according to the present embodiment is widely applicable to any time series of prosody information capable of being represented by a line (a contour). A time series of duration of a phoneme, for example, can be represented by a line (a contour), where an abscissa represents generation time of the phoneme and an ordinate represents the time length. A time series of power can be represented by a line (a contour), where an abscissa represents time and an ordinate represents the magnitude of the power. The present embodiment is also applicable to editing of the time series of duration of a phoneme and the time series of power.
The approximate contour generator 102 approximates the F0 contour generated by the speech synthesizer 101 with a parametric curve in a predetermined unit, thereby generating an approximate contour. Examples of the parametric curve include a spline curve, a B-spline curve, and a Bézier curve. The present embodiment uses a Bézier curve as the parametric curve to generate an approximate contour. The parametric curve used for approximation is not limited to a Bézier curve.
The Bézier curve is the (N−1)th order parametric curve defined by N control points. Because the Bézier curve can express a continuous curve with a small number of parameters, the Bézier curve is frequently used to draw a smooth curve. The equation of the m-th order Bézier curve is expressed by the following equation (1):
where m represents an order of the Bézier curve, ti represents a parameter, i represents an index of the parameter, and Pk represents coordinates of the k-th control point on a two-dimensional coordinate plane. The parameter ti varies from 0 to 1, thereby constructing one Bézier curve.
The shape of the m-th order Bézier curve is uniquely determined by a set of m+1 control points (P0, P1, P2, . . . , Pm). The equation of a cubic Bézier curve, for example, is defined by the following equation (2):
q(ti)=(1−ti)3P0+3ti(1−ti)2P1+3ti2(1−ti)P2+ti3P3 (2)
The approximate contour generator 102 segments the F0 contour generated by the speech synthesizer 101 in a predetermined unit and approximates each segment with a Bézier curve, thereby generating an approximate contour. The present embodiment uses the least-squares method to calculate the control points of the Bézier curve with which each segment of the F0 contour is approximated. While the explanation will be made of an example of approximation with a cubic Bézier curve to simplify the explanation, an approximation with an m-th order Bézier curve other than a cubic Bézier curve may be generalized by a similar way.
The approximate contour generator 102 estimates the control point Pk that minimizes the sum of square errors defined by the following Equation (3), where pi (i=1 to n) represents coordinates of a certain segment of the F0 contour on the two-dimensional coordinate plane, and q(ti) represents the Bézier curve. In this equation, n represents the number of data of the parameter t.
With the least-squares method, the coordinate Pk of the control point is eventually calculated by the following equations (4) and (5). Because P0 and P3 correspond to the end points of the Bézier curve, the coordinates of these points are equal to those of pl and pn serving as end points of the certain segment of the F0 contour. Constants in equations (4) and (5) are defined by the following equations (6) to (10).
In this way, the control points of the Bézier curve with which each segment of the F0 contour is approximated are calculated. A curve obtained by connecting the Bézier curves of the segments in chronological order corresponds to an approximate contour. The present embodiment performs editing considering the approximate contour as the F0 contour.
In the present embodiment, it is assumed that an input text is written in Japanese and that the predetermined unit in which the F0 contour is segmented is an accentual phrase unit. In other words, the F0 contour is approximated with the Bézier curve in each accentual phrase. In this case, the order of the Bézier curve with which a segment of the F0 contour is approximated is preferably set to a value equal to or larger than the number of morae included in the accentual phrase of the segment. This can reduce an approximation error of the approximate contour (Bézier curve) with respect to the F0 contour. The predetermined unit in which the F0 contour is segmented is not limited to an accentual phrase. Any desired unit that prevents the approximation error from increasing may be employed.
The setter 103 sets, on the approximate contour, operation points corresponding to the control points of the Bézier curve with which the F0 contour is approximated (that is, on the Bézier curve). The operation point is operated by the user on an operation screen, which will be described later, to edit the F0 contour using the approximate contour and is always present on the approximate contour. The control points of the Bézier curve and the operation points on the approximate contour make a pair and are in one-to-one correspondence. Setting the operation points means storing the coordinates of the operation points.
As described above, the control points other than the end points of the Bézier curve are not necessarily present on the Bézier curve. In the present embodiment, the operation points corresponding to the control points of the Bézier curve are set on the approximate contour. This enables the user to edit the F0 contour (approximate contour) by operating the operation points on the approximate contour. The user can operate the operation points present on the approximate contour more intuitively than the control points not present on the approximate contour. The control points serving as the end points of the Bézier curve may be set as the operation points.
In the example illustrated in
An assumption is made that the X-coordinates of the control points 402 coincide with those of the morae as illustrated in
The translation of the control points 402 slightly changes the shape of the Bézier curve. This may possibly increase an error (an approximation error) between the Bézier curve and the original F0 contour. In the case where the approximation error exceeds a threshold, the control points 402 may be projected directly vertically (in the Y-axis direction) onto the approximate contour 401 without being parallel translated, thereby setting the operation points 403. More sophisticatedly, a constrained least-squares method may be used to approximate the F0 contour with the Bézier curve. The constrained least-squares method has constraint that causes the X-coordinates of the control points 402 to coincide with the X-coordinates of the morae, thereby minimizing the approximation error. Alternatively, another operation point 403 may be added at a generation position of a mora on the approximate contour 401 using a function of adding another operation point in response to a user operation (which will be described later as a modification).
The display controller 104 displays an operation screen including the approximate contour on which the operation points are shown on the display device 120.
Similarly to the example in
The user performs an operation to move a desired operation point 502 in the Y-axis direction on the operation screen 501 illustrated in
The format of the operation screen displayed on the display device 120 is not limited to that illustrated in
The operation receiver 105 receives the user operation to move the desired operation point on the operation screen displayed on the display device 120 and transmits the moving amount of the operation point to the updater 106.
The updater 106 calculates the position of a control point corresponding to the moved operation point from the moving amount of the operation point received from the operation receiver 105 and updates the approximate contour. The updated approximate contour corresponds to an edited F0 contour.
The operation points on the approximate contour are in one-to-one correspondence with the control points of the Bézier curve forming the approximate contour. As an operation point moves, a control point corresponding thereto also moves. Because the moving amount of the operation point is not equal to that of the control point, it is necessary to calculate the position (coordinates) of the control point from the moving amount of the operation point by making a calculation below.
To simplify the calculation, two assumptions are made. The first assumption is that the user is restricted to moving an operation point only in the vertical direction (Y-axis direction). The second assumption is that the coordinates of control points other than the control point corresponding to the operation point moved by the user are constant. Introduction of the two assumptions facilitates calculation of the moving amount of the control point corresponding to the operation point from the moving amount of the operation point on the approximate contour as follows.
P2 represents the control point corresponding to the moved operation point, for example. Given t represents a value of the parameter at the position of the operation point corresponding to the control point P2, Δq represents a moving amount of the operation point in the vertical direction, and ΔP represents a moving amount of the control point P2 in the vertical direction, the following equation (11) is satisfied:
q(t)+Δq=(1−t)3P0+3t(1−t)2P1+3t2(1−t)(P2+ΔP)+t3P3 (11)
By substituting q(t) of equation (2) given above into equation (11) and organizing the equation, the following equation (12) is obtained:
With equation (12), it is possible to derive the moving amount ΔP of the control point from the moving amount Δq of the known operation point. By adding ΔP to the Y-coordinate of the control point P2 and then performing update, the coordinates of a new control point P2 can be obtained. By deriving the moving amount of a control point from that of a desired operation point in the same manner, the position of a new control point can be obtained.
The updater 106 obtains the position of the control point from the moving amount of the operation point by the calculation described above. The updater 106 redraws the Bézier curve using the new control point, thereby updating the approximate contour.
As illustrated in
After the updater 106 updates the approximate contour, the speech synthesizer 101 receives the updated approximate contour as another F0 contour and generates a synthetic speech using the F0 contour. The synthetic speech is then output from the speaker 110. The user listens to the synthetic speech output from the speaker 110, thereby checking the effects of the editing.
After the updater 106 updates the approximate contour, the setter 103 newly sets operation points on the updated approximate contour. The display controller 104 displays, on the display device 120, an operation screen including the updated approximate contour on which the newly set operation points are shown. Thus, the operation screen displayed on the display device 120 is updated. The user can perform the editing work further on the updated operation screen.
The following described an operation of the prosody editing device 100 according to the present embodiment.
First, the speech synthesizer 101 uses a statistical prosody model created in advance, for example, to generate an F0 contour of an input text (Step S101).
Subsequently, the approximate contour generator 102 approximates the F0 contour generated at Step S101 with a Bézier curve in a predetermined unit such as an accentual phrase, thereby generating an approximate contour (Step S102).
Subsequently, the setter 103 sets, on the approximate contour generated at Step S102, operation points corresponding to control points of the Bézier curve with which the F0 contour is approximated (Step S103).
Subsequently, the display controller 104 displays an operation screen including the approximate contour on which the operation points set at Step S103 are shown on the display device 120 (Step S104). The user uses the operation screen displayed on the display device 120 to perform an editing work to edit the F0 contour.
The prosody editing device 100 according to the present embodiment inquires of the user whether to finish the editing work as needed (Step S105). If the user issues no instruction to finish the editing work (No at Step S105), editing at Step S106 is repeated. If the user issues an instruction to finish the editing work (Yes at Step S105), the series of processes is ended.
First, the user performs an operation to move a desired operation point on the operation screen displayed on the display device 120 with the input device 130. The operation receiver 105 receives the operation of the user and transmits the moving amount of the operation point to the updater 106 (Step S201).
Subsequently, the updater 106 calculates the position of a new control point corresponding to the moved operation point from the moving amount of the operation point with the method described above (Step S202). The updater 106 then uses the new control point derived at Step S202 to update the approximate contour (Step S203).
Subsequently, the display controller 104 displays another operation screen including the approximate contour updated at Step S203 on the display device 120, thereby updating the operation screen displayed on the display device 120 (Step S204). Displayed on the updated operation screen is the updated approximate contour on which new operation points are shown.
The approximate contour updated at Step S203 is transmitted to the speech synthesizer 101 as an edited F0 contour. The speech synthesizer 101 uses the edited F0 contour to generate a synthetic speech, and the synthetic speech is then output from the speaker 110 (Step S205). The user listens to the synthetic speech, thereby checking whether desired prosody is obtained. To further perform the editing work, the user performs an operation to move a desired operation point on the operation screen updated at Step S204. To finish the editing work, the user issues an instruction to finish the work.
As described in detail with the specific example, the prosody editing device 100 according to the present embodiment approximates a contour representing a time series of prosody information with a parametric curve, thereby generating an approximate contour. The prosody editing device 100 sets operation points corresponding to control points of the parametric curve on the approximate contour. The prosody editing device 100 displays, on the operation screen, an operation screen including the approximate contour on which the operation points are shown, and updates the approximate contour in response to a user operation to move an operation point. The prosody editing device 100 according to the present embodiment edits prosody in this manner and thus can provide natural prosody desired by the user with an intuitive and simple operation.
In other words, the prosody editing device 100 according to the present embodiment approximates a contour representing a time series of prosody information with a parametric curve, thereby generating an approximate contour. The prosody editing device 100 regards the approximate contour as a contour to be edited and updates the approximate contour in response to a user operation performed on an operation point, thereby performing editing. With an operation to move an operation point, the prosody editing device 100 can provide a contour in which a periphery of the operation point besides the position of the operation point is smoothly changed. Thus, the prosody editing device 100 can provide natural prosody desired by the user with a simple operation.
The prosody editing device 100 according to the present embodiment sets, on the approximate contour, the operation points to be operated to edit the contour. This enables the user to edit the contour with an intuitive operation as if the user directly transforms the contour to be edited.
While a method for transforming a curve by moving control points is widely known, the control points are not necessarily present on the curve. Simply applying the method to a technology for editing prosody prevents the user from performing an intuitive operation. There has also been developed a method for providing an interface used for operation separately from a contour to be edited and transforming the contour in response to an operation through the interface. In this case too, the user cannot perform an intuitive operation as if the user directly transforms the contour to be edited. By contrast, in the present embodiment, the approximate contour is updated in response to an operation performed on an operation point on the approximate contour, thereby editing the contour. This enables the user to edit the contour with an intuitive operation as if the user directly transforms the contour to be edited. To achieve this, the prosody editing device 100 according to the present embodiment sets operation points corresponding to control points on an approximate contour and calculates a position of a new control point from a moving amount of an operation point, thereby updating the contour.
Furthermore, in the prosody editing device 100 according to the present embodiment, the speech synthesizer 101 uses the updated approximate contour to generate a synthetic speech, and the synthetic speech is then output from the speaker 110. This enables the user to check the effects of the editing while listening to the synthetic speech.
Furthermore, the prosody editing device 100 according to the present embodiment uses a Bézier curve in particular as a parametric curve with which a contour representing a time series of prosody information is approximated. As a result, the prosody editing device 100 can increase the accuracy of approximation and provide natural prosody. In other words, a Bézier curve among parametric curves can make a change similar to that in the contour representing a time series of prosody information. The prosody editing device 100 generates an approximate contour using a Bézier curve, thereby providing natural prosody.
Furthermore, in the case where the positions (X-coordinates) of the control points 402 in the time-axis direction are different from the generation positions (X-coordinates) of phonemes or morae on the approximate contour 401 as illustrated in
Furthermore, as illustrated in
In the embodiment above, the operation receiver 105 receives a user operation to move an operation point already set on the approximate contour included in the operation screen. The operation receiver 105 may receive an operation to add an operation point at a desired position on the approximate contour besides the operation to move an operation point already set.
The user performs an operation to add an operation point at a desired position on the approximate contour included in the operation screen with the input device 130. In the case where a mouse is used as the input device 130, for example, the user makes a double-click or a right-click with a cursor positioned at a desired position on the approximate contour, thereby adding an operation point at the position of the cursor. In the case where a touch panel is used as the input device 130, the user performs a touch operation on a desired position on the approximate contour, thereby adding an operation point at the touch position.
The operation receiver 105 receives the user operation to add an operation point at a desired position on the approximate contour and transmits position information (coordinates) of the added operation point to the updater 106.
The updater 106 obtains the position of a control point corresponding to the operation point by making a calculation below based on the position information of the operation point added by the user operation and updates the approximate contour.
Assuming that q represents the coordinates of the operation point added by the user operation, t represents a value of the parameter at the position, Pk represents the position of a control point corresponding to the added operation point, and the coordinates of control points other than the control point are constant, the following equation (13) is satisfied:
Equation (13) indicates that the term of the added control point Pk in the right side is equal to the change amount of the operation point in the left side. Thus, the coordinate Pk of the control point corresponding to the added operation point is calculated from the following equation (14):
The updater 106 redraws the Bézier curve using the new control point thus calculated in this manner as well as the existing control points, thereby updating the approximate contour. In the example illustrated in
After the approximate contour is updated, an operation screen including the updated approximate contour is displayed on the display device 120 similarly to the embodiment above. The user can edit the F0 contour in the same manner as in the embodiment above on the updated operation screen.
In this modification, an operation point can be added at a desired position on the approximate contour, thereby further improving user operability. In the case where the X-coordinates of the control points do not coincide with those of the phonemes or the morae on the approximate contour as described above, for example, operation points can be added at positions corresponding to the X-coordinates of the phonemes or the morae without making an adjustment to parallel translate the control points in the X-axis direction. This can reduce the approximation error.
The prosody editing device according to the present embodiment can be provided by using a general-purpose computer as basic hardware, for example.
Instructions on the processing described in the embodiment above are executed based on a computer program serving as software, for example. The instructions on the processing described in the embodiment above are recorded in a recording medium such as a magnetic disk (e.g., a flexible disk (FD) and a hard disk), an optical disc (e.g., a compact disc read only memory (CD-ROM), a compact disc recordable (CD-R), a compact disc rewritable (CD-RW), a digital versatile disc ROM (DVD-ROM), a DVD±R, a DVD±RW, and a Blu-ray (registered trademark) disc), a semiconductor memory, and the like as a computer-executable program. The recording medium may have any storage format as long as it is a computer-readable recording medium.
The computer reads the computer program from the recording medium and executes the instructions described in the computer program with the CPU 150 based on the computer program. Thus, the computer functions as the prosody editing device 100 according to the embodiment above. The computer may acquire or read the computer program via a network.
Based on the instructions of the computer program installed in the computer from the recording medium, an operating system (OS) operating on the computer and middleware (MW), such as database management software and a network, may perform a part of the processing to provide the present embodiment, for example.
The recording medium in the present embodiment is not limited to a medium independent of the computer and may be a recording medium that downloads and permanently or temporarily stores therein the computer program transmitted via a LAN, the Internet, or the like.
The recording medium is not limited to a single recording medium, and a plurality of media may perform the processing as the recording medium in the present embodiment. The recording media may have any configuration.
The computer program executed by the computer has a module configuration including the processing units constituting the prosody editing device 100 according to the present embodiment (the speech synthesizer 101, the approximate contour generator 102, the setter 103, the display controller 104, the operation receiver 105, and the updater 106). In an actual hardware configuration, the CPU 150 reads and executes the computer program from the memory 140 to load the processing units on the main memory, for example. Thus, the processing units are loaded and generated on the main memory.
The computer in the present embodiment performs the processing in the present embodiment based on the computer program stored in the recording medium. The computer may have any configuration, including a single device, such as a personal computer and a microcomputer, and a system in which a plurality of devices are connected via a network, for example. The computer in the present embodiment is not limited to a personal computer and may be an arithmetic processing unit included in an information processor and a microcomputer, for example. The computer collectively indicates equipment and devices capable of carrying out the functions in the present embodiment based on the computer program.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2013-192359 | Sep 2013 | JP | national |