The invention relates to a method and an apparatus for dealing with errors in the transmission of speech.
In order to transmit speech signals via cable-based or wire-free networks, it is known for a speech signal to be transmitted on the basis of speech signal frames, wherein, after reception of the speech signal frames, a receiver uses these speech signal frames to produce a speech signal to be output. In this case, the speech signal frames are preferably transmitted as data in the form of so-called packets via networks, for example a GSM network, a network based on the Internet Protocol, or a network based on the WLAN protocol, in which case a speech signal frame may be lost because of data being transmitted with errors. It is likewise possible, when data is transmitted in a packet-switched form, for an excessively long time delay to occur in the transmission of a speech signal frame, as a result of which this speech signal frame cannot be considered in the course of a continuous output of a speech signal, because, for example, the delayed transmitted, or else lost, speech signal frame is not available in order to output the speech signal. If no signals at all are inserted at an appropriate point in the speech signal to be output instead of the speech signal frame which has not been received, then this results in failure of the speech signal to be output at the corresponding point, resulting in degradation of the acoustic quality of the speech signal. For this reason, it is necessary to use a substitute speech signal frame in order to achieve so-called error concealment, instead of a speech signal frame which has not been received.
The fundamental principle for transmission of a speech signal on the basis of speech signal frames and for production of the speech signal on the basis of these speech signal frames is illustrated in
According to the exemplary embodiment in
In this case uses only those values for a fundamental frequency which appear to be worthwhile for human speech signals. In the situation where a speech signal without voice is present, has a noise-like character and therefore does not have a clear fundamental frequency, the fundamental frequency 54 is set to a minimum value, in order to reduce artefacts in the high-frequency range, which result from unnatural periodicities in a signal to be determined.
An estimated remaining signal 55 is determined by means of an estimation unit 65, on the basis of the remaining signal 52 and the fundamental frequency 54. The estimated remaining signal 55 is passed to a linear prediction synthesis filter 66, which uses the previously determined linear prediction coefficients 51 to subject the estimated remaining signal 55 to synthesis filtering, as a result of which the speech signal for the substitute speech signal frame 100 is obtained. In this way, the spectral envelope of the speech signal is extrapolated, while the periodic structure of the signal is maintained at the same time.
As shown in
For the situation in which a further, third substitute speech signal frame must be produced, the fundamental frequency 54 is once again varied in order to produce the further, third substitute speech signal frame, by obtaining the fundamental frequency 54 on the basis of that speech signal frame which was received two positions before the most recently received, first speech signal frame 1 in the time sequence. In the situation where further substitute speech signal frames must be produced after three substitute speech signal frames have already been determined, the fundamental frequency is not modified any further. Instead of this, all the further substitute speech signal frames are produced by means of that fundamental frequency 54 which was used to produce the third substitute speech signal frame. This fundamental frequency 54 for production of the third substitute speech signal frame is used until the end of the reception interference.
Substitute speech signal frames produced in this way are used instead of the substitute speech signal frames which have not been received. A smooth transition is preferably used for the speech signal frames when producing the speech signal 11 to be output.
The method according to the invention, in contrast has the advantage that, in order to estimate a speech signal in a substitute speech signal frame, a better signal quality in the speech signal is achieved in those situations in which the speech signal in the substitute speech signal frame is produced on the basis of a received speech signal frame which has a speech signal without voice. This is achieved in that, when a received speech signal frame has a speech signal without voice, the speech signal of the at least one substitute speech signal frame is produced by means of a noise signal. In this case, noise signals are signals which have no clear fundamental frequency. In this case, a random signal with a uniform distribution within a specific value range is preferably used as a noise signal.
According to a further embodiment of the invention, in the situation in which the at least one previously received speech signal frame has a speech signal with voice, the speech signal of the at least one substitute speech signal frame is produced by means of a fundamental frequency signal. This has the advantage that as a result of the distinction as to whether a speech signal does or does not have voice, and an appropriate use of a noise signal or a fundamental frequency signal to produce the speech signal for the substitute speech signal frame, greater flexibility exists for the production of this speech signal.
According to a further embodiment of the invention, a uniformly distributed noise signal multiplied by a scaling factor is used as the noise signal. This has the advantage that scaling of the noise signal allows the amplitude or the signal energy of the noise signal to be adapted, and thus the amplitude or the energy of the speech signal estimated from this in the substitute speech signal frame to be adapted. This results in the advantage that this adaptation results in a speech signal in a substitute speech signal frame, which is as similar as possible to the speech signal in the previously received speech signal frame.
According to a further embodiment of the invention, the scaling factor is determined as a function of the signal energy in such a filtered speech signal which results from filtering of the speech signal of the previously received speech signal frame by means of a linear prediction filter. This has the advantage that a scaling factor that has been determined in this way is used to produce an estimated noise signal by multiplication by the scaling factor, the signal energy of which noise signal is as similar as possible to the signal energy of the speech signal which was previously obtained by linear prediction, specifically because the estimated measurement signal is subsequently filtered again by a linear synthesis filter with linear prediction coefficients of the previous analysis filter, in order to obtain the signal for the substitute speech signal frame.
According to a further embodiment of the invention, after filtering by an analysis filter, for linear prediction, the filtered speech signal is subdivided into respective partial frames and respective speech signal frames, wherein the respective signal energy of the partial speech signal is determined for each partial frame. The scaling factor is determined as a function of that signal energy which has the lowest value of the respective signal energies. This results in scaling factors, and therefore estimated remaining signals, which lead to speech signals for a substitute speech signal frame, which results in a high perceptive quality from the acoustic point of view for a listener, for the production of the speech signal to be output.
According to a further embodiment of the invention, a decision is made as to whether a previously received speech signal frame has a speech signal with or without voice, as a function of a normalized autocorrelation function of the speech signal of the received speech signal frame and as a function of a zero crossing rate of the speech signal of the received speech signal frame. This has the advantage that such linking of a normalized autocorrelation function and a zero crossing rate makes it possible to make a more reliable decision than in the prior art as to whether the speech signal does or does not have voice.
According to another independent claim, a controller is claimed for outputting a speech signal. The controller has a first interface via which the controller receives speech signal frames. Furthermore, the controller has a computation unit, which uses the received speech signal frames in a predetermined sequence to produce the speech signal to be output. The controller according to the invention uses a second interface to output the speech signal to be output. In the situation when at least one speech signal frame to be received has not been received, the computation unit uses a substitute speech signal frame instead of the at least one speech signal frame which has not been received, with the computation unit producing the substitute speech signal frame as a function of at least one previously received speech signal frame. The controller according to the invention is characterized in that, in the situation in which the previously received speech signal frame has a speech signal without voice, the computation unit produces the speech signal of the one substitute speech signal frame by means of a noise signal. This has the advantage that the use of a noise signal to produce the speech signal for the substitute speech signal frame results in better perceptive quality from the acoustic point of view for a listener than in the case of methods according to the prior art, in which a fundamental frequency signal is always used to produce the substitute speech signal frame.
According to another independent claim, a controller is claimed in which in the situation in which the previously received speech signal frame has a speech signal with voice, the computation unit produces the speech signal of the substitute speech signal frame by means of a fundamental frequency signal. This has the advantage that the use of the fundamental frequency signal or of a noise signal to produce the speech signal for the substitute speech signal frame correspondingly makes it possible to produce a speech signal in which it is possible to correspond to the speech signal, with or without voice, in the previously received speech signal frame.
According to a further independent claim, a controller is claimed which furthermore has a memory unit, which provides the noise signal and/or the fundamental frequency signal. This has the advantage that the noise signal and/or the fundamental frequency signal need not itself be produced by the computation unit, for example by a shift register, but that these signals can be called up in a simple manner from the memory unit.
Exemplary embodiments of the invention are illustrated in the drawing and will be explained in more detail in the following description.
Furthermore,
A second switching unit 89 is likewise switched as a function of the modified decision 73 in order to tap off the modified estimated remaining signal 75, such that either the remaining signal produced by a modified fundamental frequency or the remaining signal produced by a noise signal is tapped off depending on whether the speech signal in the received speech signal frame 50 does or does not have voice. This modified estimated remaining signal 75 is passed to a synthesis filter for linear prediction, which uses the linear prediction coefficients 51 obtained for synthesis. The speech signal for the substitute speech signal frame 100 is therefore produced at the output of the synthesis filter of the linear prediction means 66.
The decision as to whether the speech signal in the received speech signal frame 50 does or does not have voice is preferably made in the modified decision unit 83 as a function of a normalized autocorrelation function of the speech signal and of a zero crossing rate of the speech signal. For a preferably digital speech signal x(n) of length N, with the index n=0, . . . , N−1 and a previously determined period length P0 of a fundamental frequency, the normalized autocorrelation function ζ(x(n)) is preferably determined using the calculation rule:
Furthermore, the zero crossing rate zcr(x(n)) for the speech signal x(n) is preferably determined by means of the calculation rule:
where the expression SIGN represents the sign function, that is to say the mathematical sign function. According to the embodiment of the invention, a decision is then made that the signal x(n) has voice when
The first threshold value thr1 is preferably chosen to be the value 0.5. A person skilled in the art would choose the second threshold value thr2 from analysis of empirical data of zero crossing rates zcr(x(n)) of speech signals with and without voice.
According to a further embodiment of the invention, a uniformly distributed noise signal is used as the noise signal 76, with the modified estimated remaining signal being obtained by multiplication of the noise signal by a scaling factor or a gain factor 77. The scaling factor 77 is in this case preferably determined as a function of the signal energy in the filtered speech signal 52. According to one particular embodiment in this case, as shown in
If the minimum E=min{E1,E2,E3,E4} of the signal energies that are present in the partial frames 201 to 204 is now determined in accordance with the exemplary embodiment, the noise signal 76 r(n) is preferably scaled such that √{square root over (E)} is chosen as the scaling factor or gain factor 77. The estimated remaining signal 75 when the speech signal in the received speech signal frame 50 does not have voice is therefore preferably determined to be: {circumflex over (r)}(n)=√{square root over (E)}·r(n).
In the situation in which the previously received speech signal frame has a speech signal with voice, the computation unit 1003 preferably produces the speech signal of the substitute speech signal frame by means of a fundamental frequency signal.
This controller 1000 preferably has a memory unit 1005, which provides a fundamental frequency signal and/or a noise signal.
Number | Date | Country | Kind |
---|---|---|---|
10 2008 042 579 | Oct 2008 | DE | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2009/062527 | 9/28/2009 | WO | 00 | 5/26/2011 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2010/037713 | 4/8/2010 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
4589131 | Horvath et al. | May 1986 | A |
5909663 | Iijima et al. | Jun 1999 | A |
5953697 | Lin et al. | Sep 1999 | A |
7411985 | Lee et al. | Aug 2008 | B2 |
7590531 | Khalil et al. | Sep 2009 | B2 |
7693710 | Jelinek et al. | Apr 2010 | B2 |
7930176 | Chen | Apr 2011 | B2 |
8121835 | Archibald | Feb 2012 | B2 |
8255207 | Vaillancourt et al. | Aug 2012 | B2 |
20040184443 | Lee et al. | Sep 2004 | A1 |
20060271359 | Khalil et al. | Nov 2006 | A1 |
Number | Date | Country |
---|---|---|
9281996 | Oct 1997 | JP |
2001022367 | Jan 2001 | JP |
Entry |
---|
J. Paulus, Codierung breitbandiger Sprachsignale bei niedriger Datenrate. Dissertation, IND, RWTH Aachen, Templergraben 55, 52056 Aachen, 1997. |
P. Vary, U. Heute, W. Hess, Digitale Sprachsignalverarbeitung, B.G. Teubner Verlag, Stuttgart, 1998, ISBN 3-519-06165-1. |
PCT/EP2009/062527 International Search Report. |
Xiaoli, Wang et al. “Reconstruction of Missing Speech Packet Using Trend-Considered Excitation” Singal Processing, 2002 6th International Conference on Aug. 26-30, 2002. vol. 2, pp. 1680-1683. Piscataway, NJ. |
Gündüzhan, Emre et al. “A Linear Prediction Based Packet Loss Concealment Algorithm for PCM Coded Speech” IEEE Transactions on Speech and Audio Processing. New York, NY. vol. 9, No. 8, pp. 778-785. Nov. 2001. |
Number | Date | Country | |
---|---|---|---|
20110218801 A1 | Sep 2011 | US |