1. Technical Field of the Invention
The present invention relates to technology for processing a voice signal.
2. Description of the Related Art
A technology for converting voice characteristics is proposed, for example, in Japanese Patent Application Laid-Open Publication No. 2014-002338 (hereinafter referred to as “JP 2014-002338”). This reference discloses a technology for converting voice characteristics of a voice signal that is a processing target (hereinafter referred to as “target signal”) into distinguishing (non-modal or non-harmonic) voice characteristics such as gruffness or hoarseness. In the technology disclosed in JP 2014-002338, a spectrum of a target voice signal that has been adjusted to a fundamental frequency of an object signal is divided into segments comprising a plurality of bands (hereinafter referred to as “unit bands”), with a harmonic frequency residing at a center of each of the unit bands, and each component of each of the unit bands then being reallocated along a frequency axis. Next, amplitude and phase are adjusted for each of the unit bands such that an amplitude and phase of a harmonic frequency in each of the reallocated unit bands corresponds to an amplitude and phase of the target signal.
In the technology disclosed in JP 2014-002338 the amplitude and phase for each unit band is adjusted after a plurality of unit bands has been defined such that an intermediary point between a harmonic frequency and a next adjacent harmonic frequency on a frequency axis constitutes a boundary. A drawback of this technique is that an amplitude and phase at the boundary of each unit band (i.e., at the intermediary point between adjacent harmonic frequencies) become discontinuous. Presuming generation of a voice that has a predominance of harmonic components over non-harmonic components, with respect to the intermediary point between harmonic frequencies (i.e. at the point in which there is sufficiently low intensity) of the generated voice, any discontinuity in amplitude and phase of the non-harmonic components will hardly be perceived by a listener. However, where a particular subject voice that has a predominance of non-harmonic components, such as in the case of a gruff or hoarse voice, a discontinuity in the amplitude and phase at the intermediary point between harmonic frequencies becomes apparent, with the result that an acoustically unnatural voice may be perceived by the listener.
In view of the above-mentioned issues, an object of the present invention is to generate an acoustically natural voice from a voice type that has a predominance of non-harmonic components.
In one aspect, the present invention provides a voice processing method including: adjusting a fundamental frequency of a first voice signal of a voice having target voice characteristics to a fundamental frequency of a second voice signal of a voice having initial voice characteristics that differ from the target voice characteristics; allocating one of a plurality of unit band components in each one of a plurality of frequency bands, the plurality of the unit band components being obtained dividing into segments a spectrum of the first voice signal of the fundamental frequency that is adjusted to the fundamental frequency of the second voice signal, where a plurality of harmonic frequencies corresponds to the second fundamental frequency constituting boundaries, with each frequency band being defined by two harmonic frequencies from among the plurality of harmonic frequencies corresponding to the second fundamental frequency, such that one unit band component is disposed adjacent a corresponding one unit band component in a spectrum of the first voice signal of the fundamental frequency before adjustment to the fundamental frequency of the second voice signal; and generating a converted spectrum by adjusting component values of each of the unit band components after allocation, in accordance with component values of a spectrum of the second voice signal, and by adapting component values of the spectrum of the second voice signal to each of a plurality of specific bands of the spectrum of the first voice signal of the unit band components after allocation, with each specific band including one of the harmonic frequencies corresponding to the second fundamental frequency.
Preferably, the unit band components are allocated such that the band of a unit band component substantially matches a frequency band (i.e., pitch) defined by two harmonic frequencies and corresponding to the second fundamental frequency. The band of a unit band component may or may not entirely match the frequency band. However, even if the band of a unit band component after the allocation does not match a frequency band defined by two harmonic frequencies adjacent each other on the frequency axis corresponding to the second fundamental frequency, so long as the difference between a pitch corresponding to the second voice signal and that corresponding to the converted spectrum is not perceivable by the listener, for all practical purposes it can be said that a substantial match is attained. A typical example of two harmonic frequencies defining a frequency band is two harmonic frequencies that are adjacent each other along a frequency axis from among the plurality of harmonic frequencies corresponding to the second fundamental frequency.
In the above configuration, since component values are adjusted for each of a plurality of unit band components obtained by segmenting, with a plurality of harmonic frequencies corresponding to the second fundamental frequency and constituting boundaries, after a spectrum of the first voice signal of the fundamental frequency is adjusted to the fundamental frequency of the second voice signal, a discontinuity of component values in a non-harmonic component between harmonic frequencies is reduced. Therefore, in comparison with a configuration in which a plurality of unit band components are defined with a point between harmonic frequencies constituting the boundary, the present invention has an advantage of generating an acoustically natural voice despite the source voice containing a predominance of non-harmonic components. However, since a plurality of unit band components is defined with a plurality of harmonic frequencies constituting boundaries, a discontinuity in component values at a harmonic frequency can be problematic. In the above aspect of the present invention, since the component values of the second voice signal are applied to a specific band including a harmonic frequency, the present invention has an advantage of reducing the discontinuity in the component values at the harmonic frequency, so as to accurately reproduce target voice characteristics.
The bandwidth of each specific band preferably is a predetermined value common to the plurality of specific bands, or it may be variable. In a case where the bandwidth of each specific band is variable and where the component values include amplitude components, a specific band corresponding to each harmonic frequency may be defined by two end points, each of which has a respective smallest amplitude component value relative to each harmonic frequency in-between. Alternatively, each specific band may be set so as to enclose each of a plurality of peaks in a spectrum of the first voice signal after allocation of the unit band components. Variable specific bands are advantageous in that the specific bands are set to have bandwidths suited to characteristics of the spectrum after allocation of unit band components.
In one aspect, the component values of each unit band component may be adjusted such that a component value at one of the harmonic frequencies corresponding to the second fundamental frequency, the component value being one of the component values of each unit band component after allocation, matches a component value at the same harmonic frequency in the spectrum of the second voice signal. This configuration is advantageous in that a voice signal is generated that accurately maintains phonemes of the second voice signal. This is because component values at the harmonic frequency, of the respective unit band components after allocation, are adjusted to correspond to the component values at the harmonic frequency of the spectrum of the second voice signal.
In one aspect, where the component values include phase components, adjusting the component values may include changing phase shift quantities for respective frequencies in each of the unit band components such that shifting quantities along the time axis of respective frequency components included in each of the unit band components after allocation remain unchanged. Since this configuration sets phase shift quantities that vary for respective frequencies in a unit band component such that shifting quantities along the time axis of the respective frequencies remain unchanged, a voice that accurately reflects the target characteristics can be generated. This configuration is described in the third embodiment of the present specification by way of a non-limiting example.
In one aspect, the voice processing method further segments the first voice signal into a plurality of unit periods along the time axis, so as to calculate a spectrum of the first voice signal for each of the unit periods, wherein the plurality of unit periods is segmented by use of an analysis window that has a predetermined positional relationship with respect to each of peaks in a time waveform of the first voice signal of the fundamental frequency after adjustment, in a fundamental period corresponding to the second fundamental frequency; and segments the second voice signal into a plurality of unit periods along the time axis, so as to calculate a spectrum of the second voice signal for each of the unit periods, with the plurality of unit periods being segmented by use of an analysis window having the predetermined positional relationship with respect to each of peaks in a time waveform of the second voice signal in the fundamental period corresponding to the second fundamental frequency. In this configuration, since the positional relationship of the analysis window to each peak in a time waveform of the first voice signal is the same as that of the analysis window with regard to each peak in a time waveform of the second voice signal, a voice that accurately reflects the target characteristics of the first voice signal can be generated.
Preferably, as a form of the predetermined relationship, the analysis window used for segmenting the first voice signal has its center at a peak of a time waveform of the first voice signal, and the analysis window used for segmenting the second voice signal has a center at each peak of the time waveform of the second voice signal, the analysis window constituting a function wherein, when the center of the analysis window matches each peak in a time waveform, its center is a maximum value. In this way, it is possible to generate a spectrum in which each peak of a time waveform can be accurately reproduced.
In some aspects, the present invention may be identified as a voice processing apparatus that executes the voice processing method of each of the above aspects or as a computer recording medium having recorded thereon a computer program, stored in a computer memory, that causes a computer processor to execute the voice processing method of each of the aspects.
The voice processing apparatus 100 is a signal processing apparatus that generates a voice signal y(t) of the time domain that corresponds to a voice having particular characteristics (hereinafter referred to as “target voice characteristics”) that are different from the characteristics of the voice signal x(t) (hereinafter referred to as “initial voice characteristics”). The target voice characteristics according to the present embodiment are distinctive (non-modal or non-harmonic), compared to the initial voice characteristics. Specifically, the characteristics of a voice created by action of a vocal cord, which action is different from that of a normal voicing, are suitable as the target voice characteristics. As an example, distinguishing characteristics (gruffness, roughness, harshness, growl, or hoarseness) of a voice, such as a gruff voice (including rough voice and growling voice) or hoarse voice, may be exemplified as such target voice characteristics. The target voice characteristics and the initial voice characteristics typically are those of different speakers. Alternatively, different voice characteristics of a single speaker may be used as target voice characteristics and initial voice characteristics. The voice signal y(t) generated by the voice processing apparatus 100 is supplied to a sound output device 14 (speakers and headphones) and output as sound waves.
As
The processing unit 22 implements a plurality of functions (functions of a frequency analyzer 32, a converter 34, and a waveform generator 36) for generating the voice signal y(t) from the voice signal x(t) by executing a computer program stored in the storage unit 24. The voice processing method of the present embodiment is thus implemented via cooperation between the processing unit 22 and the computer program.
For some aspects, the functions of the processing unit 22 may be distributed among a plurality of apparatuses. For some aspects, a part of the functions of the processing unit 22 may be implemented by electric circuitry specialized in voice processing. For some aspects, the processing unit 22 may process the voice signal x(t) of a synthetic voice, which has been generated by a known voice synthesizing process, or may process the voice signal x(t), which has been stored in the storage unit 24 in advance. In these cases, the external device 12 may be omitted.
The computer program of the present embodiment may be stored on a computer readable recording medium, or may be installed in the voice processing apparatus 100 and stored in the storage unit 24. The recording medium is, for example, a non-transitory recording medium, a good example of which is an optical recording medium such as a CD-ROM, and may also be any type of publically known recording medium such as a semiconductor recording medium and a magnetic recording medium. Alternatively, the computer program of the present embodiment may be distributed through a communication network and installed in the voice processing apparatus 100 and stored in the storage unit 24. An example of such a recording medium is a hard disk or the like of a distribution server having recorded thereon the computer program of the present embodiment.
The frequency analyzer 32 generates a spectrum (complex spectrum) X(k) of the voice signal x(t). Specifically, the frequency analyzer 32, by use of an analysis window (e.g., a Hanning window) represented by a predetermined window function, calculates the spectrum X(k) sequentially for each unit period (frame) obtained by segmenting the voice signal x(t) along the time axis. Here, the symbol k denotes a freely-selected frequency from among a plurality of frequencies that is set on the frequency axis. The frequency analyzer 32 of the first embodiment sequentially identifies a fundamental frequency (pitch) PX of the voice signal x(t) for each unit period. The present embodiment may employ a freely-selected one of known pitch detection methods to specify the fundamental frequency PX.
The converter 34 converts the initial voice characteristics into the target voice characteristics of the voice signal x(t) while maintaining the pitch and phonemes of the voice signal x(t). Specifically, the converter 34 of the present embodiment sequentially generates, for each unit period, a spectrum (hereinafter referred to as “converted spectrum”) Y(k) of the voice signal y(t) having target characteristics through a converting process using the spectrum X(k) generated for each unit period by the frequency analyzer 32 and the target voice signal rA(t) stored in the storage unit 24. The process performed by the converter 34 will be described below in detail.
The waveform generator 36 generates the voice signal y(t) of the time domain from the converted spectrum Y(k) generated by the converter 34 for each unit period. It is preferable to use a short-time inverse Fourier transformation to generate the voice signal y(t). The voice signal y(t) generated by the waveform generator 36 is supplied to the sound output device 14 and output as sound waves. It is also possible to mix the voice signal x(t) and the voice signal y(t) in either the time domain or the frequency domain.
A detailed configuration and operation of the converter 34 will now be described.
The pitch adjuster 42 generates a target voice signal rB(t) of the time domain by adjusting a fundamental frequency (first fundamental frequency) PR of the target voice signal rA(t) stored in the storage unit 24 into a fundamental frequency (second fundamental frequency) PX of the voice signal x(t) identified by the frequency analyzer 32. Specifically, the pitch adjuster 42 generates the target voice signal rB(t) of the fundamental frequency PX by re-sampling the target voice signal rA(t) in the time domain. Therefore, the phonemes of the target voice signal rB(t) are substantially the same as those of the target voice signal rA(t), which is pre-adjusted. The rate of re-sampling by the pitch adjuster 42 is set to a rate λ (λ=PX/PR) of the fundamental frequency PX to the fundamental frequency PR. The present embodiment may employ a freely selected one of known pitch detection methods to identify the fundamental frequency PR of the target voice signal rA(t). Alternatively, the fundamental frequency PR, along with the target voice signal rA(t), may be stored in advance in the storage unit 24 and used to calculate the rate λ.
The frequency analyzer 44 of
The voice characteristic converter 46 of
As
As will be understood from
As
Specifically, when the fundamental frequency PX of the voice signal x(t) is less than the fundamental frequency PR of the target voice signal rA(t), as shown in
In view of the repetition and selection of one or more unit band component U(n) as mentioned above, in the following description, the number n of each unit band component U(n) after reallocation by the component allocator 52 is renewed sequentially to a number (index) m starting from the end with a lower frequency. Specifically, the symbol m is represented by the following Equation (1).
In Equation (1), < > denotes a floor function. That is, a function <x+0.5> is an arithmetic operation for rounding a numerical value x to the nearest integer. As will be understood from the above description, the reallocated spectrum S(k), which has a plurality of unit band components U(m) arranged along the frequency axis, is generated. A unit band component U(m) of the reallocated spectrum S(k) is a band component of harmonic frequencies H(m) to H(m+1).
The component adjuster 54 of
Y
0(k)=S(k)g(m)exp(jθ(m)) (2)
The variable g(m) of Equation (2) is a correction value (gain) for adjusting the amplitudes of each unit band component U(m) of the reallocated spectrum S(k) according to the amplitudes of the spectrum X(k) of the voice signal x(t), and it is represented by the following Equation (3).
The symbol AH(m) of Equation (3) is the amplitude of the component of the harmonic component H(m) among the unit band component U(m), and the symbol AX(m) is the amplitude of the component of the harmonic frequency H(m) among the voice signal X(t). The common correction value g(m) is used for the amplitude correction of each frequency within any unit band component U(m). By the above-mentioned correction value g(m), the amplitude AH(m) at the harmonic frequency H(m) of the unit band component U(m) is corrected to the amplitude AX(m) at the harmonic frequency H(m) of the voice signal x(t).
Meanwhile, the symbol θ(m) of Equation (2) is a correction value (phase shift quantity) for adjusting the phase of each unit band component U(m) of the reallocated spectrum S(k) according to the phase of the spectrum X(k) of the voice signal x(t), and it is represented by Equation (4).
The symbol ΦH(m) of Equation (4) is the phase of the component of the harmonic frequency H(m) of the unit band component U(m), and the symbol ΦX(m) is the phase of the component of the harmonic frequency H(m) of the voice signal x(t). The common correction value θ(m) is used for the phase correction of each frequency within any unit band component U(m). By the above-mentioned correction value θ(m), as shown in
As will be understood from the above description, in the first embodiment, because each unit band component U(m) is defined with the harmonic frequency H(m) constituting the boundary, the continuity of the component values of the non-harmonic component between a harmonic frequency H(m) and the next harmonic frequency H(m+1) is retained before and after adjusting the component values (amplitudes and phases) by Equation (2). On the other hand, as a result of the reallocation of each unit band component U(m) by the component allocator 52 and the correction of the component value for each unit band component U(m) by the component adjuster 54, a discontinuity of the component values at each harmonic frequency H(m) may occur after the correction carried out by Equation 2 on the phase, as
In order to reduce the above-mentioned discontinuity of the component value at each harmonic frequency H(m), as
As already mentioned, in a configuration in which the spectrum R(k) of the target voice signal rB(t) is segmented into a plurality of unit band components U(n) with the point between each harmonic frequencies H(n) and H(n+1) adjacent one another along the frequency axis, (e.g., the midpoint of the harmonic frequencies H(n) and H(n+1)) constituting the boundary, the component values of the non-harmonic component becomes discontinuous on the frequency axis. Presuming generation of a normal voice having a sufficiently low intensity in the non-harmonic component, the above discontinuity is hardly perceivable by the listener. However, because a distinguishing voice, such as a gruff or hoarse voice, contains a predominance of non-harmonic components, the discontinuity of the component values of the non-harmonic component becomes apparent and such a voice may be perceived as acoustically unnatural. In contrast with the above configuration, in the first embodiment, because the spectrum R(k) of the target voice signal rB(t) is segmented into a plurality of unit band components U(n) with each harmonic frequency H(n) constituting the boundary, there is no discontinuity in the component values of the frequency of the non-harmonic component after the correction of the component values for each unit band component U(n). Therefore, according to the first embodiment, a voice which contains a predominance of non-harmonic components and is acoustically natural can be generated.
On the other hand, in a configuration in which a plurality of unit band components U(n) is defined with each harmonic frequency H(n) constituting the boundary, the discontinuity of component values at the harmonic frequency H(n) may be problematic. Although a configuration is provided such that each unit band component U(n) is defined with each harmonic frequency H(m) constituting the boundary, in the first embodiment it is possible to avoid the discontinuity of component values at the harmonic frequency H(n) because the component values of the spectrum X(k) of the voice signal x(t) are appropriated for the specific band B(m) including the harmonic frequency H(m).
Also, in the first embodiment it is possible to generate the voice signal y(t) that accurately maintains the phonemes of the voice signal x(t) because the component values of each unit band component U(m) are adjusted such that the component values (AH(m) and ΦH(m)) at the harmonic frequency H(m), among the respective unit band components U(m) that have been reallocated by the component allocator 52, correspond with the component values (AX(m) and ΦX(m)) at the harmonic frequency H(m) of the spectrum X(k) of the voice signal x(t).
A second embodiment of the present invention is now explained.
In each embodiment illustrated below, the same reference numerals and signs will be used for those elements for which actions and elements are the same as those of the first embodiment, and description thereof will be omitted where appropriate.
As
In the second embodiment, as described above referring to
As will be understood from the above mentioned Equations (2) and (4), in the first embodiment, there is described an example configuration in which the phases of all frequencies of a freely selected one unit band component U(m) are changed by the same correction quantity (phase shift quantity) θ(m) (i.e., a configuration in which the phase spectrum of the unit band component U(m) is moved in a parallel direction along the phase axis). However, in this configuration, the time waveform of the target voice signal rB(t) may change because the shift along the time axis, made through the phase shift with the correction value θ(m), is different for each frequency of the unit band component U(m).
In view of the above circumstances, the component adjuster 54 of the third embodiment sets a different correction value θ(m,k) for each frequency within the unit band component U(m) such that the shifts along the time axis of the frequency components, which are enveloped in each unit band component U(m) after allocation by the component allocator 52, are the same. Specifically, the component adjuster 54 calculates the correction value θ(m,k) of a phase according to the following Equation (5).
As will be understood from Equation (5), the correction value θ(m,k) of the third embodiment is a value obtained by multiplying the correction value θ(m) of the first embodiment by a coefficient δk that is frequency-dependent.
fk in Equation 5 denotes a frequency of the order of k on the frequency axis. The coefficient δk used to calculate the correction value θ(m,k) is defined as a ratio of each frequency fk within the unit band component U(m) to the harmonic frequency H(m) of the order of m (i.e., the frequency fk at the left end of the band of the unit band component U(m)). In other words, as will be understood from
The above-described embodiment can be modified in various manners. Detailed modifications will be described below. Two or more embodiments selected from the following embodiments can be combined as appropriate.
1. In the above mentioned embodiments, the target voice signal rB(t) of the fundamental frequency PX is generated by re-sampling the target voice signal rA(t) of the fundamental frequency PR in the time domain. However, it is also possible to generate the spectrum R(k) of the fundamental frequency PX by expanding or compressing the spectrum R0(k) of the target voice signal rA(t) along the frequency axis in the frequency domain.
2. In the above mentioned embodiments, both the amplitude and phase of the reallocated spectrum S(k) are corrected. However, it is also possible to correct one of either the amplitude or the phase. In other words, the component value that is the object of adjustment by the component adjuster 54 is at least one of either the amplitude or the phase. In a configuration in which only the amplitude is adjusted, it is possible to calculate an amplitude spectrum of the target voice signal rB(t) as the spectrum R(k). In a configuration in which only the phase is adjusted, it is possible to calculate a phase spectrum of the target voice signal rB(t) as the spectrum R(k).
3. In the above mentioned embodiments, the bandwidth of the specific band B(m) is set to a prescribed value that is common to a plurality of specific bands B(m). However, it is possible to set each bandwidth of a plurality of the specific band B(m) to a variable value. Specifically, the bandwidth of each specific band B(m) may be set to a variable value according to the characteristics of the reallocated spectrum S(k). In order to suppress the discontinuity of amplitude in the converted spectrum Y(k) of the voice signal y(t), a preferable configuration is to set the specific band B(m) with its end points being two frequencies at which amplitudes are minimized at opposite sides of the harmonic frequency H(m) of the reallocated spectrum s(k). For example, a range is set as the specific band B(m), the lower limit of the range being the frequency with the minimum amplitude that is the closest to the harmonic frequency H(m) within the lower region (H(m−1) to H(m)) of the harmonic frequency H(m), and the upper limit of the range being the frequency with the minimum amplitude that is closest to the harmonic frequency H(m) within the higher region (H(m) to H(m+1)) of the harmonic frequency H(m). Moreover, it is possible to set the bandwidth of the specific band B(m) to be variable according to the bandwidth of the unit band component U(m). In a configuration in which the bandwidth of each specific band B(m) is variable, such as in the above example, it is possible to set each specific band B(m) to a bandwidth suitable for the characteristics of the reallocated spectrum S(k) for example.
4. In the above mentioned embodiments, the voice signal x(t) supplied from the external device 12 is exemplified as the object of processing. However, the object of processing by the voice processing apparatus 100 is not limited to a signal output from the external device 12. Specifically, it is also possible for the voice processing apparatus 100 to process the voice signal x(t) generated by various voice synthesizing technologies. For example, the voice characteristics of the voice signal x(t) generated by a known voice synthesizing technology may be converted by the voice processing apparatus 100, examples of such technology being a piece-connecting voice synthesis that selectively connects a plurality of voice pieces recorded in advance, and a voice synthesis that uses a probability model such as the hidden Markov model.
5. It is also possible to implement the voice processing apparatus 100 in a server device (typically a web server) that communicates with terminal devices via a communication network such as a mobile communication network or the Internet. Specifically, the voice processing apparatus 100 generates, in the same manner as in the above mentioned embodiments, the voice signal y(t) from the voice signal x(t) received from a terminal device via the communication network, and transmits it to the terminal device. By the above configuration, it is possible to provide users of terminal devices with a cloud service that acts as an agent in converting the voice characteristics of the voice signal x(t). Meanwhile, in a configuration in which the spectrum X(k) of the voice signal x(t) is transmitted from terminal devices to the voice processing apparatus 100 (for example, a configuration in which a terminal device has the frequency analyzer 32), the frequency analyzer 32 is omitted in the voice processing apparatus 100. Also, in a configuration in which the converted spectrum Y(k) is transmitted from the voice processing apparatus 100 to terminal devices (e.g., a configuration in which the terminal device has the waveform generator 36), the waveform generator 36 is omitted from the voice processing apparatus 100.
Number | Date | Country | Kind |
---|---|---|---|
2014-263512 | Dec 2014 | JP | national |