Embodiments described herein relate generally to a speech synthesizer, an audio watermarking information detection apparatus, a speech synthesizing method, an audio watermarking information detection method, and a computer program product.
It is widely known that a speech is synthesized by performing filtering, which indicates a vocal tract characteristic, with respect to a sound source signal indicating a vibration of a vocal cord. Further, quality of a synthesized speech is improved and may be used inappropriately. Thus, it is considered that it is possible to prevent or control inappropriate use by inserting watermark information into a synthesized speech.
However, when an audio watermarking is embedded into a synthesized speech, there is a case where sound quality is deteriorated.
According to an embodiment, a speech synthesizer includes a sound source generator, a phase modulator, and a vocal tract filter unit. The sound source generator generates a sound source signal by using a fundamental frequency sequence and a pulse signal. The phase modulator modulates, with respect to the sound source signal generated by the sound source generator, a phase of the pulse signal at each pitch mark based on audio watermarking information. The vocal tract filter unit generates a speech signal by using a spectrum parameter sequence with respect to the sound source signal in which the phase of the pulse signal is modulated by the phase modulator.
Speech Synthesizer
In the following, with reference to the attached drawings, a speech synthesizer according to an embodiment will be described.
As illustrated in
The input unit 10 inputs a sequence (hereinafter, referred to as fundamental frequency sequence) indicating information of a fundamental frequency or a fundamental period, a sequence of a spectrum parameter, and a sequence of a feature parameter at least including audio watermarking information into the sound source unit 2a.
For example, the fundamental frequency sequence is a sequence of a value of a fundamental frequency (F0) in a frame of voiced sound and a value indicating a frame of unvoiced sound. Here, the frame of unvoiced sound is a sequence of a predetermined value which is fixed, for example, to zero. Further, the frame of voiced sound may include a value such as a pitch period or a logarithm F0 each frame of a period signal.
In the present embodiment, a frame indicates a section of a speech signal. When the speech synthesizer 1 performs an analysis at a fixed frame rate, a feature parameter is, for example, a value in each 5 ms.
The spectrum parameter is what indicates spectral information of a speech as a parameter. When the speech synthesizer 1 performs an analysis at a fixed frame rate similarly to a fundamental frequency sequence, the spectrum parameter becomes a value corresponding, for example, to a section in each 5 ms. Further, as a spectrum parameter, various parameters such as a cepstrum, a mel-cepstrum, a linear prediction coefficient, a spectrum envelope, and mel-LSP are used.
By using the fundamental frequency sequence input from the input unit 10, a pulse signal which will be described later, or the like, the sound source unit 2a generates a sound source signal (described in detail with reference to
The vocal tract filter unit 12 generates a speech signal by performing a convolution operation of the sound source signal, a phase of which is modulated by the sound source unit 2a, by using a spectrum parameter sequence received through the sound source unit 2a, for example. That is, the vocal tract filter unit 12 generates a speech waveform.
The output unit 14 outputs the speech signal generated by the vocal tract filter unit 12. For example, the output unit 14 displays a speech signal (speech waveform) as a waveform output as a speech file (such as WAVE file).
The first storage unit 16 stores a plurality of kinds of pulse signals used for speech synthesizing and outputs any of the pulse signals to the sound source unit 2a according to an access from the sound source unit 2a.
For example, the sound source generator 20 determines a reference time and calculates a pitch period in the reference time from a value in a corresponding frame in the fundamental frequency sequence. Further, the sound source generator 20 creates a pitch mark by repeatedly performing, with reference to the reference time, processing of assigning a mark at time forwarded for a calculated pitch period. Further, the sound source generator 20 calculates a pitch period by calculating a reciprocal number of the fundamental frequency.
The phase modulator 22 receives the (pulse) sound source signal generated by the sound source generator 20 and performs phase modulation. For example, the phase modulator 22 performs, with respect to the sound source signal generated by the sound source generator 20, modulation of a phase of a pulse signal at each pitch mark based on a phase modulation rule in which audio watermarking information included in the feature parameter is used. That is, the phase modulator 22 modulates a phase of a pulse signal and generates a phase modulation pulse train.
The phase modulation rule may be time-sequence modulation or frequency-sequence modulation. For example, as illustrated in the following equations (1) and (2), the phase modulator 22 modulates a phase in time series in each frequency bin or performs temporal modulation by using an all-pass filter which randomly modulates at least one of a time sequence and a frequency sequence.
For example, when the phase modulator 22 modulates a phase in time series, the input unit 10 may previously input, into the phase modulator 22, a table indicating a phase modulation rule group which varies in each time sequence (each predetermined period of time) as key information used for audio watermarking information. In this case, the phase modulator 22 changes a phase modulation rule in each predetermined period of time based on the key information used for the audio watermarking information. Further, in an audio watermarking information detection apparatus (described later) to detect audio watermarking information, the phase modulator 22 can increase confidentiality of an audio watermarking by using the table used for changing the phase modulation rule.
Note that a indicates phase modulation intensity (inclination), f indicates a frequency bin or band, t indicates time, ph (t, f) indicates a phase of a frequency f at time t. The phase modulation intensity a is, for example, a value changed in such a manner that a ratio or a difference between two representative phase values, which are calculated from phase values of two bands including a plurality of frequency bins, becomes a predetermined value. Then, the speech synthesizer 1 uses the phase modulation intensity a as bit information of the audio watermarking information. Further, the speech synthesizer 1 may increase the number of bits of the bit information of the audio watermarking information by setting the phase modulation intensity a (inclination) as a plurality of values. Further, in the phase modulation rule, a median value, an average value, a weighted average value, or the like of a plurality of predetermined frequency bins may be used.
Next, processing performed by the speech synthesizer 1 illustrated in
In step S102, the phase modulator 22 performs, with respect to the sound source signal generated by the sound source generator 20, modulation of a phase of a pulse signal at each pitch mark based on a phase modulation rule using audio watermarking information included in the feature parameter. That is, the phase modulator 22 outputs a phase modulation pulse train.
In step S104, the vocal tract filter unit 12 generates a speech signal by performing a convolution operation of the sound source signal, a phase of which is modulated by the sound source unit 2a, by using a spectrum parameter sequence which is received through the sound source unit 2a. That is, the vocal tract filter unit 12 outputs a speech waveform.
First Modification Example of Sound Source Unit 2a: Sound Source Unit 2b
Next, a first modification example (sound source unit 2b) of the sound source unit 2a will be described.
The determination unit 24 determines whether a frame focused by a fundamental frequency sequence included in the feature parameter received from the input unit 10 is a frame of unvoiced sound or a frame of voiced sound. Further, the determination unit 24 outputs information related to the frame of unvoiced sound to the noise source generator 26 and outputs information related to the frame of voiced sound to the sound source generator 20. For example, when a value of the frame of unvoiced sound is zero in the fundamental frequency sequence, by determining whether a value of the frame is zero, the determination unit 21 determines whether the focused frame is a frame of unvoiced sound or a frame of voiced sound.
Here, although the input unit 10 may input, into the sound source unit 2b, a feature parameter identical to a sequence of a feature parameter input into the sound source unit 2a (
All bands in the frame of unvoiced sound are assumed as noise components. Thus, a value of band noise intensity becomes one. On the other hand, band noise intensity of the frame of voiced sound becomes a value smaller than one. Generally, in a high band, a noise component becomes stronger. Further, in a high-band component of voiced fricative sound, band noise intensity becomes a value close to one. Note that the fundamental frequency sequence may be a logarithmic fundamental frequency and band noise intensity may be in a decibel unit.
Then, the sound source generator 20 of the sound source unit 2b sets a start point from the fundamental frequency sequence and calculates a pitch period from a fundamental frequency at a current position. Further, the sound source generator 20 creates a pitch mark by repeatedly performing processing of setting, as a next pitch mark, time in the calculated pitch period from a current position.
Further, the sound source generator 20 may generate a pulse sound source signal divided into n bands by applying n bandpass filters to a pulse signal.
Similarly to the case in the sound source unit 2a, the phase modulator 22 of the sound source unit 2b modulates only a phase of a pulse signal.
By using the white or Gaussian noise signal stored in the second storage unit 18 and the sequence of the feature parameter received from the input unit 10, the noise source generator 26 generates a noise source signal with respect to a frame including an unvoiced fundamental frequency sequence.
Further, the noise source generator 26 may generate a noise source signal to which n bandpass filters are applied and which is divided into n bands.
The adder 28 generates a mixed sound source (sound source signal to which noise source signal is added) by controlling, into a determined ratio, amplitudes of the pulse signal (phase modulation pulse train) phase-modulated by the phase modulator 22 and the noise source signal generated by the noise source generator 26 and by performing superimposition.
Further, the adder 28 may generate a mixed sound source (sound source signal to which noise source signal is added) by adjusting amplitudes of the noise source signal and the pulse sound source signal in each band according to a band noise intensity sequence and by performing superimposition.
Next, processing performed by a speech synthesizer 1 including the sound source unit 2b will be described.
In step S202, the phase modulator 22 performs, with respect to the sound source signal generated by the sound source generator 20, modulation of a phase of a pulse signal at each pitch mark based on a phase modulation rule using audio watermarking information included in the feature parameter. That is, the phase modulator 22 outputs a phase modulation pulse train.
In step S204, the adder 28 generates a sound source signal, to which the noise source signal (noise) is added, by controlling, into a determined ratio, amplitudes of the pulse signal (phase modulation pulse train) phase-modulated by the phase modulator 22 and the noise source signal generated by the noise source generator 26 and by performing superimposition.
In step S206, the vocal tract filter unit 12 generates a speech signal by performing a convolution operation of a sound source signal, in which a phase is modulated (noise is added) by the sound source unit 2b, by using a spectrum parameter sequence which is received through the sound source unit 2b. That is, the vocal tract filter unit 12 outputs a speech waveform.
Second Modification Example of Sound Source Unit 2a: Sound Source Unit 2c
Next, a second modification example (sound source unit 2c) of the sound source unit 2a will be described.
The filter unit 3a includes bandpass filters 30 and 32 which pass signals in different bands and control a band and intensity. For example, the filter unit 3a generates a sound source signal divided into two bands by applying the two bandpass filters 30 and 32 to a pulse signal of a sound source signal generated by the sound source generator 20. Further, the filter unit 3b includes bandpass filters 34 and 36 which pass signals in different bands and control a band and intensity. For example, the filter unit 3b generates a noise source signal divided into two bands by applying the two bandpass filters 34 and 36 to a noise source signal generated by the noise source generator 26. Accordingly, in the sound source unit 2c, the filter unit 3a is provided separately from the sound source generator 20 and the filter unit 3b is provided separately from the noise source generator 26.
Further, the adder 28 of the sound source unit 2c generates a mixed sound source (sound source signal to which noise source signal is added) by adjusting amplitudes of the noise source signal and the pulse sound source signal in each band according to a band noise intensity sequence and by performing super imposition.
Note that each of the above-described sound source unit 2b and sound source unit 2c may include a hardware circuit or software executed by a CPU. The second storage unit 18 includes, for example, an HDD or a memory. Further, software (program) executed by the CPU may be distributed by being stored in a recording medium such as a magnetic disk, an optical disk, or a semiconductor memory or distributed through a network.
In such a manner, in the speech synthesizer 1, the phase modulator 22 modulates only a phase of a pulse signal, that is, a voiced part based on audio watermarking information. Thus, it is possible to insert an audio watermarking without deteriorating quality of a synthesized speech.
Audio Watermarking Information Detection Apparatus
Next, an audio watermarking information detection apparatus to detect audio watermarking information from a synthesized speech into which an audio watermarking is inserted will be described.
As illustrated in
The pitch mark estimator 40 estimates a pitch mark sequence of an input speech signal. More specifically, the pitch mark estimator 40 estimates a sequence of a pitch mark by estimating a periodic pulse from an input signal or a residual signal (estimated sound source signal) of the input signal, for example, by an LPC analysis and outputs the estimated sequence of the pitch mark to the phase extractor 42. That is, the pitch mark estimator 40 performs residual signal extraction (speech extraction).
For example, at each estimated pitch mark, the phase extractor 42 extracts, as a window length, a width which is twice as wide as a shorter one of longitudinal pitch widths and extracts a phase at each pitch mark in each frequency bin. The phase extractor 42 outputs a sequence of the extracted phase to the representative phase calculator 44.
Based on the above-described phase modulation rule, the representative phase calculator 44 calculates a representative phase to be a representative of a plurality of frequency bins or the like from the phase extracted by the phase extractor 42 and outputs a sequence of the representative phase to the determination unit 46.
Based on the representative phase value calculated at each pitch mark, the determination unit 16 determines whether there is audio watermarking information. Processing performed by the determination unit 46 will be described in detail with reference to
Then, the determination unit 46 determines whether there is audio watermarking information according to the inclination. More specifically, the determination unit 46 first creates a histogram of an inclination and sets the most frequent inclination as a representative inclination (mode inclination value). Next, as illustrated in
Next, an operation of the audio watermarking information detection apparatus 4 will be described.
In step S302, at each pitch mark, the phase extractor 42 performs extraction, as a window length, a width which is twice as wide as a shorter one of longitudinal pitch widths and extracts a phase.
In step S304, based on a phase modulation rule, the representative phase calculator 44 calculates a representative phase to be a representative of a plurality of frequency bins from the phase extracted by the phase extractor 42.
In step S306, the CPU determines whether all pitch marks in a frame are processed. When determining that all pitch marks in the frame are processed (S306: Yes), the CPU goes to processing in S308. When determining that not all of the pitch marks in the frame are processed (S306: No), the CPU goes to processing in S302.
In step S308, the determination unit 16 calculates are inclination of a straight line (inclination of representative phase) which is formed by a representative phase in each frame.
In step 310, the CPU determines whether all frames are processed. When determining that all frames are processed (S310: Yes), the CPU goes to processing in S312. Further, when determining that not all of the frames are processed (S310: No), the CPU goes to processing in S302.
In step S312, the determination unit 46 creates a histogram of the inclination calculated in the processing in S308.
In step S314, the determination unit 46 calculates a mode value (mode inclination value) of the histogram created in the processing in S312.
In step S316, based on the node inclination value calculated in the processing in S314, the determination unit 46 determines whether there is audio watermarking information.
In such a manner, the audio watermarking information detection apparatus 1 extracts a phase at each pitch mark and determines whether there is audio watermarking information based on a frequency of an inclination of a straight line formed by a representative phase. Note that the determination unit 46 does not necessarily determine whether there is audio watermarking information by performing the processing illustrated in
Example of Different Processing Performed by Determination Unit 46
The determination unit 46 calculates a correlation coefficient with respect to a representative phase by shifting the reference straight line longitudinally in each analysis frame. As illustrated in
Further, the determination unit 46 may learn a model statistically with an inclination of a straight line, which is formed by a representative phase of synthetic sound including audio watermarking information, as a feature amount and may determine whether there is audio watermarking information with likelihood as a threshold. Further, the determination unit 46 may learn a model statistically with an inclination of a straight line, which is formed by a representative phase of each of synthetic sound including audio watermarking information and synthetic sound not including audio watermarking information, as a feature amount. Then, the determination unit 46 may determine whether there is audio watermarking information by comparing likelihood values.
A program executed in each of the speech synthesizer 1 and the audio watermarking information detection apparatus 4 of the present embodiment is provided by being recorded, as a file in a format which can be installed or executed, in a computer-readable recording medium such as a CD-ROM, a flexible disk (FD), a CD-R, or a digital versatile disk (DVD).
Further, each program of the present embodiment may be stored in a computer connected to a network such as the Internet and may be provided by being downloaded through the network.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
This application is a continuation of PCT international application Ser. No. PCT/JP2013/050990 filed on Jan. 18, 2013 which designates the United States; the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2013/050990 | Jan 2013 | US |
Child | 14801152 | US |