This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No.JP2003-149034 field on May 27, 2003;
The entire contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to a speech rate conversion apparatus for changing a speech rate of a speech signal.
2. Background Art
As a general technique for making rate conversion of speech inputted, a waveform processing method of compression and expansion on the time axis of speech by PICOLA (Pointer Interval Control OverLap and Add) is known (see, for example, “Compression and Expansion on Time Axis of Speech Using Pointer Interval Control OverLap and Add (PICOLA) Method and its Evaluation”, Naotaka Morita and Fumitada Itakura, Discourse Collected Papers of Acoustical Society of Japan, October, 1986, 1-4-14, p.149-150).
In this speech rate conversion, speech data inputted is cut out in a certain frame length and a pitch period in a frame is obtained using an autocorrelation function etc. and compression and expansion processing is performed.
However, in this method, when there is near-random sound such as babble of crowds or sound of the waves as background sound other than the speech in the expansion processing, horrible parasitic sound (probably a kind of musical noise) corresponding to a period of waveform insertion is generated extra.
On the other hand, as a method in which the horrible parasitic sound described above is not emitted, a method for randomizing and superimposing phases is known (see, for example, Japan Patent Application KOKAI No. 5-108095, (Paragraph 0015,
However, also in this method, complicated processing in which phases are randomized and further the generated randomized phase speech segment waveforms are added or superimposed while shifting the waveforms was required, and it is difficult to package this method in a processing system in which real time processing is required, since a load of throughput is large.
As described above, in the conventional art of the speech rate conversion, there was a problem that horrible sound corresponding to a period of waveform insertion is generated extra when there is near-random sound as background sound.
Also, as a solution for this problem, a method in which phases are randomized and further the generated randomized phase speech segment waveforms are added or superimposed while shifting the waveforms was known, but there was a problem that complicated processing is required and it is difficult to package this method in a processing system in which real time processing is required, since a load of throughput is large.
Therefore, the invention is performed in view of the problems as described above, and an object of the invention is to implement a speech rate conversion apparatus with good sound quality by relatively simple processing while horrible parasitic sound is not generated even in speech rate conversion of the case that there is near-random sound as background sound.
In order to achieve the object, the invention is characterized by including a pitch period calculation unit configured to calculate a pitch period from a speech signal inputted, and an expansion processing unit configured to perform expansion processing by cutting a speech waveform out of the speech signal by the pitch period and inserting an inverted waveform in which time axis inversion of the speech waveform is performed into the speech signal.
As a result of this, speech rate conversion with good sound quality without generating horrible parasitic sound can be implemented relatively simply.
The present invention may be more readily described with reference to the accompanying drawings:
An embodiment of the invention will be described below using the drawings.
The speech rate conversion apparatus 100 includes a speech waveform frame extraction part 1, a pitch period calculation part 2 and a time axis expansion part 3. The speech waveform frame extraction part 1 cuts a speech waveform of a predetermined frame length out of an input speech signal in order to obtain a pitch period. The pitch period calculation part 2 calculates a pitch period Tp from a speech signal cut out in the speech waveform frame extraction part 1, and inputs this pitch period Tp to the time axis expansion part 3.
Here, a method for calculating a pitch period using an autocorrelation function will be described as a calculation method of a pitch period. In the calculation method of the pitch period using the autocorrelation function, autocorrelation is obtained assuming that an input speech signal has a finite time length and is present within only an interval (corresponding to the frame length described above) of a frame length Tc and the signal is always zero beyond the interval of the frame length Tc. Such a short-time autocorrelation value Rn(k) is obtained as shown by a mathematical formula 1.
Tc is a time interval assumed that the input speech signal is present, and k is delay time of the case of delaying a speech waveform when the short-time autocorrelation value Rn(k) is calculated, and there is a relation of Tc>>k. Then, when a value of k is obtained in the mathematical formula 1 so that the short-time autocorrelation value Rn(k) is maximized, its value becomes a pitch period. The pitch period Tp obtained is sent to the time axis expansion part 3. In the time axis expansion part 3, expansion processing is performed as described below.
In the expansion processing, as shown in
As shown in
The created speech waveforms of the waveform C1 and the waveform C2 and the speech waveforms of the waveform D1 and the waveform D2 are respectively added and speech waveforms of a waveform C and a waveform D are created (
Finally, the waveform A″ is inserted between the speech waveforms of the waveform A and the waveform B, and a waveform of Tc+Tp=RTp/(R−1) satisfying the expansion coefficient R from a waveform of Tc=Tp/(R−1) is created (
By the configuration described above, horrible parasitic sound, which is generated extra and corresponds to a period every frame cutting out an input speech signal, is not generated since a speech waveform inserted is a waveform converted by time axis inversion. Also, by using a waveform multiplied by a weighting coefficient linearly changing from 0 to 1 or from 1 to 0 as waveforms of initial end and terminal end portions of the speech waveform inserted, contact is made as a waveform having smooth points of contact between the inserted waveform A″ and the waveform A and the waveform B, so that a speech waveform with small distortion is obtained even in the case of performing expansion processing. Further, the speech waveform inserted can be implemented by relatively simple processing of time axis inversion.
Here, the embodiment in which expansion processing is performed by inserting the waveform A″ into which the speech waveform of the waveform A is converted has been described, but it can similarly be applied to the case of converting the speech waveform of the waveform B.
A flow of expansion processing in the embodiment of the invention will be described below using a flowchart of
The waveform A from the end with the waveform B to an Lp portion is multiplied by a weighting coefficient linearly changing from 0 to 1 and a waveform D1 is created. Similarly, the waveform B from the end with the waveform A to an Lp portion is multiplied by a weighting coefficient linearly changing from 1 to 0 and a waveform C1 is created. Further, portions from the initial end and the terminal end of the waveform A′ to Lp portions are multiplied by weighting coefficients linearly changing from 0 to 1 and from 1 to 0, respectively and speech waveforms of a waveform C2 and a waveform D2 are created (S5).
Speech waveforms of the waveform C1 and the waveform C2 are added and a speech waveform of a waveform C is created (S6A) Similarly, speech waveforms of the waveform D1 and the waveform D2 are added and a speech waveform of a waveform D is created (S6B).
Then, by cutting out speech waveforms from an initial point and a terminal point of the waveform A′ to Lp portions and respectively inserting the speech waveforms of the waveform C and the waveform D into the portions cut out, a waveform A″ is combined (S7). Further, a speech waveform of this waveform A″ is inserted between the waveform A and the waveform B (S8) and a speech waveform is expanded when the steps of S1 to S8 are repeatedly performed with respect to the next frame and an input speech signal to be expanded is not inputted, this expansion processing is ended (S9).
Here, the expansion processing implemented in the speech rate conversion apparatus configured in
As described above, according to the invention, speech rate conversion with good sound quality without generating horrible parasitic sound can be implemented by relatively simple processing.
Number | Date | Country | Kind |
---|---|---|---|
P2003-149034 | May 2003 | JP | national |