The present invention relates to a technique for emphasizing or depressing a prosody (e.g., modulation of a volume, pitch, etc.) of voice.
Heretofore, there have been proposed techniques for varying a prosody of voice. Japanese Patent Application Laid-open Publication No. 2004-252085, for example, discloses a technique for depressing a prosody by decreasing variation widths of a volume and pitch of a voice signal to predetermined ranges (hereinafter referred to as “reference ranges”). The reference ranges are fixedly set in accordance with standard variation widths of volumes and pitches of voice uttered or generated in a calm state.
However, with the technique disclosed in the No. 2004-252085 publication, where the fixedly-set reference ranges are used to depress a volume and pitch irrespective of characters of a voice signal to be actually processed, it is difficult to perform appropriate voice prosody control corresponding to the characters of the voice signal. For example, if the volume and pitch of a voice signal to be processed fall within the reference ranges, there would occur no change in prosody between before and after the processing.
In view of the foregoing, it is an object of the present invention to provide an improved voice processing apparatus and method which can appropriately control a prosody of voice in accordance with a character of a voice signal.
In order to accomplish the above-mentioned object, the present invention provides an improved voice processing apparatus, which comprises: a character extraction section that extracts character amounts, pertaining to a prosody of voice, from a voice signal sequentially in a time-serial manner; a difference calculation section that calculates a difference value between each of the character amounts extracted by the character extraction section sequentially in a time-serial manner and a reference value; a processing value generation section that generates processing values, corresponding to individual ones of the character amounts, in accordance with respective ones of the difference values; and a voice processing section that controls the individual character amounts of the voice signal in accordance with the processing values corresponding to the character amounts and thereby generates an output signal having a prosody changed from the prosody of the voice signal.
According to the voice processing apparatus of the present invention constructed in the aforementioned manner, an output signal having a prosody changed from the prosody of the voice signal is generated by use of the processing values corresponding to the difference values between the individual character amounts of the voice signal and the reference value. Thus, the voice processing apparatus of the present invention can appropriately control the prosody in accordance with the individual character amounts of the voice signal, as compared to the prior art technique disclosed in the No. 2004-252085 publication where the volume and pitch of a voice signal are restricted to within the respective fixed reference ranges.
In a preferred implementation, the processing value generation section calculates, as the processing value, a numerical value obtained by subtracting the difference value from a predetermined function value calculated using the difference value as an independent variable, and the voice processing section generates the output signal by changing the individual character amounts of the voice signal by the corresponding processing values. Such an arrangement can advantageously control increase/decrease of character amounts of the output signal on the basis of the reference value while accurately reflecting the character amounts of the voice signal in the output signal.
Preferably; when the prosody is to be emphasized, the processing value generation section calculates the processing value on the basis of the function value set such that the absolute value of the function value exceeds the absolute value of the difference value, but, when the prosody is to be emphasized, the processing value generation section calculates the processing value on the basis of the function value set such that the absolute value of the function value falls below the absolute value of the difference value. Such an arrangement can achieve both emphasis and depression of the prosody.
In a preferred implementation, the processing value generation section calculates the processing value such that a rate of change, relative to the difference value, of the processing value increases as the absolute value of the difference value increases (see, for example, functions F2A and F2B in
In a preferred implementation, the processing value generation section calculates the processing value such that the rate of change, relative to the difference value, of the processing value decreases as the absolute value of the difference value increases (see, for example, functions F3A and F3B in
In a preferred implementation, the processing value generation section variably controls relationship between the difference values and the processing values. Such an arrangement can advantageously generate an output signal having a diversely changed prosody, as compared to a case where relationship between the difference values and the processing values is fixed. In this case, the processing value generation section may variably control the relationship between the difference values and the processing values in any desired manner. For example, there may be employed a scheme in which any one of different kinds of functions (e.g., functions F1A-F3A, F1B-F3B) defining relationship between the difference values and the processing values is selectively used, or where a coefficient of one kind of function defining relationship between the difference values and the processing values (e.g., slope of a function F1A or F1B in
Note that the reference value to be used by the difference calculation section may be set in any desired manner. For example, the reference value may be set at a predetermined value irrespective of the voice signal. However, with a viewpoint to restricting a discrepancy in characteristic between the output signal and the voice signal, it is preferable to set the reference value in accordance with a plurality of character amounts extracted by the character extraction section. For example, the maximum or minimum value of the plurality of character amounts may be set as the reference value, or an average value of the plurality of character amounts may be set as the reference value. With a viewpoint to effectively restricting a discrepancy in characteristic (e.g., volume feeling or pitch feeling) between the output signal and the voice signal, it is particularly advantageous to set an average value of the plurality of character amounts as the reference value.
The voice processing apparatus according to the aforementioned preferred implementations of the present invention may be implemented by hardware (electronic circuitry), such as a DSP (Digital Signal processor) dedicated to the inventive voice processing, as well as by cooperation between a general-purpose arithmetic operation processing device, such as a CPU (Central processing Unit), and a software program.
Further, the present invention may also be practiced as a method implemented by a computer for processing voice, or as a computer readable storage medium containing a group of instructions for causing a computer to perform a voice processing procedure. The method, storage medium or program can accomplish generally the same behavior and advantageous benefits as the aforementioned preferred implementations of the voice processing apparatus. The program of the present invention may not only be supplied to a user stored in a computer-readable storage medium and then installed in a computer of the user, but also be delivered from a server apparatus via a communication network and then installed in a computer of a user.
The following will describe embodiments of the present invention, but it should be appreciated that the present invention is not limited to the described embodiments and various modifications of the invention are possible without departing from the basic principles. The scope of the present invention is therefore to be determined solely by the appended claims.
For better understanding of the object and other features of the present invention, its preferred embodiments will be described hereinbelow in greater detail with reference to the accompanying drawings, in which:
The arithmetic operation processing device 10 functions as a prosody control section 20 and a voice processing section 30 by executing programs stored in the storage device 12. The voice processing section 30 changes (emphasizes or depresses) the prosody of the voice signal SO to thereby generate an output signal SOUT. The term “prosody” is used herein to mean modulation (intonation) or tone of voice (utterer's feeling) perceived by a listener by virtue of acoustic characters (typically, volume and pitch) of the voice. Voice with an emphasized prosody gives the listener an emotional or sentimental impression, while voice with a depressed prosody gives the listener with an inorganic or intellectual impression. The voice processing section 30 in the instant embodiment generates an output signal SOUT by changing the volume and pitch of the voice signal SO. Thus, the instant embodiment can advantageously generate an output signal SOUT of a desired prosody even where a plurality of voice signals SO of different prosodies are not prepared in advance; accordingly, the instant embodiment can reduce the necessary capacity of the storage device 12 for storing such voice signals SO.
The prosody control section 20 of
Input device 14 and sounding device 16 are connected to the arithmetic operation processing device 10. The input device 14 includes operating members (operators) operable by a human operator or user to give various instructions to the voice processing apparatus 100. By appropriately operating the input device 14, the user can give control parameter values (hereinafter sometimes referred to as “control values”) U, indicative for example of a direction of a prosody change (i.e., whether the prosody is to be emphasized or depressed) and a degree of the prosody change. The sounding device 16, comprising for example a speaker or headphone, radiates voice corresponding to an output signal SOUT generated by the arithmetic operation processing device 10.
The reference setting section 24 variably sets reference values R (RV, RP) in accordance with the character amounts F (FV, FP) extracted by the character extraction section 22. For example, for each of the character types, i.e. volume and pitch in this case, an average of a plurality of the character amounts F is set as the reference value R. Namely, the reference setting section 24 calculates an average value of volumes FV, extracted for all of the segments of the voice signal SO, as the reference value RV, and calculates an average value of pitches FP, extracted for all of the segments of the voice signal SO, as the reference value RP.
The difference calculation section 26 calculates a difference value D (DV, DP) between each of the character amounts F identified by the character extraction section 22 for each of the unit segments and the reference value R set by the reference setting section 24 on the basis of the character amount F. More specifically, the difference calculation section 26 calculates a difference value DV by subtracting the extracted reference value RV from the volume FV for each of the unit segments (DV=FV−RV) and calculates a difference value DP by subtracting the reference value RP from the extracted pitch FP for each of the unit segments (DP=FP−RP). Namely, such difference values D (DV, DP) are calculated for each of the unit segments.
The variable determination section (processing value generation section) 28 generates, for each of the unit segments, processing values C (CV, CP), corresponding to the character amounts F, in accordance with the difference values D (DV, DP) calculated by the difference calculation section 26. More specifically, for each of the unit segments, the variable determination section 28 calculates a processing value CV corresponding to the difference value DV and a processing value CP corresponding to the difference value DP.
The slope of the function F1A (i.e., change rate of the function value f relative to the difference value D) is variably set, in accordance with the control parameter value U, within a range greater than “1”. Therefore, the absolute value of the function value f(D) of the function F1A exceeds the absolute value of the difference value D. The slope of the function F1B, on the other hand, is variably set, in accordance with the control parameter value U, within a positive value range smaller than “1”. Therefore, the absolute value of the function value f(D) of the function F1B falls below the absolute value of the difference value D. The control parameter value U may be variably generated in response to operation of a human operator, or variably automatically generated in accordance with some factor, such as an ambient environment.
The variable determination section 28 subtracts the difference value D from the function value f(D), corresponding to the difference value D, of the function F1 (F1A or F1B) and sets a value obtained by the subtraction as a processing value C(C=f(D)−D). Thus, the processing value C varies in accordance with (i.e., in proportion to) the difference value D; that is, as the absolute value of the difference value D increases, the absolute value of the processing value C increases. Further, in a case where the difference value D is a positive value, the processing value C when the prosody is to be emphasized (i.e., when the function F1A is to be used) is set at a positive value, while the processing value C when the prosody is to be depressed (i.e., when the function F1B is to be used) is set at a negative value. Furthermore, in a case where the difference value D is a negative value, the processing value C when the prosody is to be emphasized (i.e., when the function F1A is to be used) is set at a negative value, while the processing value C when the prosody is to be depressed (i.e., when the function F1B is to be used) is set at a positive value. Note that, where the control parameter value U is a neutral value, the processing value C is “0” irrespective of the difference value D.
In accordance with the processing value C determined by the variable determination section 28 for each of the unit segments of the voice signal SO, the voice processing section 30 of
The volume change section 32 changes the volume amount FV of each of the unit segments of the voice signal SO in accordance with the processing value CV of the unit segment. Namely, the volume change section 32 changes the volume FV of each of the unit segments of the voice signal So to a sum between the volume amount FV and the processing value CV. Similarly, the pitch change section 34 changes the pitch FVP of each of the unit segments of the voice signal SO in accordance with the processing value CV of the unit segment. Namely, the pitch change section 34 changes the pitch FP of each of the unit segments of the voice signal SO to a sum between the pitch FP and the processing value CP. Through the conversion of the volume FV by the volume change section 32 and the conversion of the pitch FP by the pitch change section 34, an output signal SOUT is generated from the voice signal SO.
Because the character amount F of each of the unit segments of the voice signal SO corresponds to a sum between the reference value R and the difference value D(F=R+D), the sum between the volume amount FV of the voice signal SO and the processing value CV (i.e., character amount of the output signal SOUT) equals a sum between the reference value R and the function value f(D) as follows:
As described above with reference to
In a case where depression of the prosody has been instructed, on the other hand, the processing value C is set at a negative value when the corresponding difference value D is a positive value, but set at a positive value when the corresponding difference value D is a negative value. Thus, as shown in
With the instant embodiment, as set forth above, the degree of depression of the prosody is variably controlled in accordance with the character amounts F of the voice signal SO, it is possible to appropriately control the prosody in accordance with the character amounts F of the voice signal SO as compared to the prior art technique disclosed in patent literature 1 above where the volume and pitch of the voice signal SO are merely depressed to within the reference ranges. For example, even when the voice signal SO has a small volume, the instant embodiment can control the prosody reliably and finely. Further, because the rate of change (or slope) of the function F1 (F1A, F1B), which is to be used for calculating a processing value C from the difference value D, is variably controlled, the instant embodiment can also appropriately adjust the rate of change of the prosody in the output signal SOUT.
Further, with the prior art technique disclosed in patent literature 1, where the reference ranges are set independently of the voice signal, there would arise the problem that, where, for example, the volume and pitch of the voice signal substantially deviate from middle values of their respective reference ranges, the voice characters would undesirably vary prominently between before and after depression of the prosody. By contrast, the instant embodiment of the invention is arranged to generate an output signal SOUT by changing the character amounts F of the voice signal SO by amounts corresponding to the processing values C each calculated by subtracting the difference value D from the function value f(D) of the function F1. Thus, as seen from Mathematical Expression (1) above and
The following describe a second embodiment of the present invention. Similar elements to those in the first embodiment are indicated by the same reference numerals and characters as used for the first embodiment and will not be described in detail here to avoid unnecessary duplication.
In the second embodiment, the variable determination section 28 retains three different kinds of functions F (F1-F3). The variable determination section (processing value generation section) 28 selectively uses any one of the three different kinds of functions F (F1-F3) to calculate a processing value C. Any one of the three different kinds of functions F (F1-F3) which is to be selected by the variable determination section 28 is designated by the user via the input device 14. Manner in which the variable determination section 28 calculate a processing value C from a difference value D using the function F2 or F3 is the same as in the aforementioned first embodiment in which a processing value C is calculated on the basis of the function F1.
For each of the functions F2A and F3B, as shown in
As understood from the foregoing, when the function F2 (F2A, F2B) is selected, the rate of change of the processing value C relative to the difference value D increases as the absolute value of the difference value D increases; namely, the absolute value of the processing value C increases exponentially in response to variation of the absolute value of the difference value D. Thus, in this case, an amount of variation (variation width) of the character amount F of the output signal SOUT relative to the character amount of the voice signal SO increases as compared to that in the case where the function F1 is used. Namely, in this case, it is possible to increase the degree of variation (emphasis or depression) of the prosody as compared to the case where the function F1 is used.
When the function F3 (F3A, F3B) is selected, the rate of change of the processing value C relative to the difference value D decreases as the absolute value of the difference value D increases. Thus, for a unit segment where the difference value D is great, an amount of variation (variation width) in the character amount of the output signal SOUT relative to the voice signal SO decreases as compared to that in the case where the function F1 is used. Namely, in this case, it is possible to decrease the degree of variation (emphasis or depression) of the prosody as compared to the case where the function F1 is used.
In the above-described second embodiment, where any one of the plurality of kinds of functions F (F1-F3) is selectively used for calculation of the processing value C, it is possible to appropriately adjust a change of the prosody as necessary. Especially, the second embodiment, which allows the user to designate a desired function F to be used for calculation of the processing value C, can advantageously provide an output signal SOUT having a user-desired prosody.
Voice signal SO of voice related to use of the electric apparatus (hereinafter referred to “guide voice”) is stored in the storage device 12. The guide voice is, for example, voice presenting to the user how to use the electric apparatus and voice informing the user of an operating state of the electric apparatus and giving the user a warning. The prosody control section 20 and voice processing section 30 generates an output signals SOUT by changing the prosody of the voice signal SO in generally the same manner as in the first embodiment.
The control section 40 variably controls the control value U in accordance with the current time t counted by the timer section 42. For example, if the current time t is in the morning time zone, the control section generates and outputs, to the prosody control section 20, a control value U instructing emphasis of the prosody. If, on the other hand, the current time t is in the night time zone, the control section generates and outputs, to the prosody control section 20, a control value U instructing depression of the prosody. Thus, guide voice with an emphasized prosody is reproduced in the morning time zone, while guide voice with a depressed prosody is generated in the night time zone. In this way, the instant embodiment can generate guide voice with a prosody suitable for the time zone when the electric apparatus is used. Further, because there is no need to store in the storage device 12 voice signals SO of different prosodies, the instant embodiment can reduce the necessary capacity of the storage device 12.
<Modification>
The above-described embodiments may be modified variously, and the following are among specific examples of modifications. Note that two or more of the following modifications may be combined as desired.
(Modification 1)
Whereas the above-described embodiments have been constructed to calculate a processing value C (CV, CP) by the variable determination section 28 by performing arithmetic operations using the function F (F1-F3), there may be employed any other suitable way for determining a processing value C on the basis of the difference value D. For example, a data table having various difference values D and various processing values C stored in association with each other may be prepared in advance so that the variable determination section 28 can acquire, from the data table, a particular processing value C corresponding to the difference value D calculated by the difference calculation section 26 and thereby outputs the acquired processing value C to the voice processing section 30.
(Modification 2)
Whereas the above-described embodiments have been constructed to use an average of a plurality of character amounts F as the reference value R, there may be employed any other suitable way for calculating the reference value R. For example, the reference value R may be calculated on the basis of a plurality of character amounts F extracted by the character extraction section 22, or the maximum or minimum value of the plurality of character amounts F extracted by the character extraction section 22 may be used as the reference value R. Alternatively, the reference value R may be set irrespective of the voice signal SO.
Further, whereas the above-described embodiments have been constructed to use the same or common reference value R for calculation of a processing value C in every unit segment of the voice signal SO, the reference value R to be used for calculation of a processing value C may be made different for each of the unit segments of the voice signal SO. For example, the voice signal SO may be divided into some of a plurality of voice-present segments each containing voice and a plurality of voice-absent segments each containing no voice or containing only sound noise, in which case the reference setting section 24 calculates, individually for each of the voice-present segments, a reference value R corresponding to character amounts F of unit segments within the voice-present segment. Then, the difference calculation section 26 applies the reference value, calculated for each of the voice-present segments, to calculation of a difference value D for each of the unit segments within the voice-present segment. Such arrangements can appropriately control the prosody of the voice signal SO even when an acoustic character has changed in the middle of the voice signal SO.
(Modification 3)
Whereas the control section 40 in the third embodiment has been described as generating a control value U in accordance with the current time t, it may generate a control value U in accordance with any other suitable condition or factor than the current time t. For example, a separate control value U may be registered in advance individually for each of a plurality of potential users so that the control section 40 selects, from among the registered control values U, a particular control value U corresponding to an actual user and outputs (or designates) the selected control value U to the prosody control section 20. Further, an ambient environment condition, such as sound noise, may be detected so that a control value U suited for the detected ambient environment condition is automatically generated.
(Modification 4)
The character amounts F to be used for control of a prosody should not be understood as limited to those of volume FV and pitch FP. For example, the character extraction section 22 may extract, as the character amount F, a slope of a straight line approximating a region higher in frequency than a peak having the greatest intensity in a frequency spectrum (power spectrum) of a voice signal SO and then the voice processing section 30 changes the prosody on the basis of the slope; this arrangement too can generate an output signal SOUT presenting a prosody changed from that of the voice signal SO. Further, only one of the volume FV and pitch FP may be extracted as the character amount F. As understood from the foregoing, any numerical value pertaining to (i.e., characterizing) a prosody of voice is suitable as the character amount F.
(Modification 5)
Whereas the preferred embodiments have been described above as emphasizing or depressing a prosody of a voice signal SO, they may be suitably applied to a case where only one of emphasis or depression of a prosody is to be performed. For example, the voice processing apparatus 100 is dedicated only to emphasis of a prosody, the variable determination section 28 uses, for calculation of a processing value C, a function F (F1A, F2A, F3A) defining relationship such that the absolute value of the function value f exceeds the absolute value of the difference value D.
(Modification 6)
Supply source of a voice signal SO should not be understood as limited to the storage device 12. For example, the supply source may be a voice pickup device (microphone) that picks up ambient voice and generates a voice signal SO, or a reproduction device that reproduces a voice signal SO stored in a mobile or portable recording medium. Alternatively, there may be employed a construction where an output signal SOUT is generated from a voice signal SO synthesized through a conventionally-known voice synthesis technique.
(Modification 7)
Destination of an output signal SOUT generated by the voice processing section 30 should not be understood as limited to the sounding device 16. For example, there may be employed a construction where an output signal SOUT is retained in the storage device 12, or where an output signal SOUT is transmitted to another device via a communication network.
This application is based on, and claims priority to, JP PA 2008-191973 filed on 25 Jul. 2008. The disclosure of the priority application, in its entirety, including the drawings, claims, and the specification thereof, is incorporated herein by reference.
Number | Date | Country | Kind |
---|---|---|---|
2008-191973 | Jul 2008 | JP | national |