The present invention relates to encoding and decoding of broadband signals, in particular audio signals such as speech signals. The invention relates both to an encoder and a decoder, and to an audio bit stream encoded in accordance with the invention and a data storage medium on which such an audio bit stream has been stored.
When transmitting broadband signals, e.g. audio signals sampled at 32 kHz or higher (which includes speech signals), compression or encoding techniques are used to reduce the bit rate of the signal, whereby the bandwidth needed for transmission is reduced correspondingly.
Linear predictive coding (LPC) is a technique often used in speech encoding. The main idea of LPC is to pass the input signal through a prediction filter (analysis) whose output signal is a spectrally flattened signal. The spectrally flattened signal can be encoded using fewer bits. The bit rate reduction is achieved by retaining an important part of the signal structure in the prediction filter parameters, which vary slowly over time. The spectrally flattened signal coming out of the prediction filter is usually referred to as the residual. The terms residual and flattened signal are thus synonyms that are used interchangeably.
In order to further reduce the required bit rate, a modelling process is applied to the flattened signal to derive a new signal called an excitation signal. This procedure is referred to as residual modelling. The excitation signal is computed in such a way that when passed through the prediction synthesis filter, it produces a close approximation (according to an appropriate criterion) of the output produced when the spectrally flattened signal is used in the synthesis. This process is called analysis-by-synthesis. Certain constraints imposed on the form of the excitation signal, make its representation very efficient from a bit rate point-of-view.
Three popular methods of computing the excitation signal are the regular pulse excitation (RPE) [1], the multi-pulse excitation (MPE) [2] and CELP-like methods [10]. They basically differ in the constraints imposed on the excitation signal. In RPE the excitation is bounded to consist of equally spaced non-zero values with zeros in between. For narrowband speech (e.g. 8 kHz sampling), decimation factors of 2, 4 and 8 are common. In MPE, on the other hand, very few pulses are used (typically 3-4 for every 5 ms of narrowband speech) but they are not subject to any grid and can be placed anywhere. Usually, the error introduced by the quantisation is also taken into account when computing the excitation. Both methods, RPE and MPE, have been shown to deliver similar performance for the same bit rate. In CELP, a sparse codebook can be used to attain a high compression factor.
Linear predictive coding removes the short-term correlation among input samples, but due to the short length of the analysis filter LPC can do little to remove long-term correlations. Long-term correlations are often present in the flattened signal and they are mainly caused by (quasi) periodicities, which in the case of speech correspond to the voiced utterances. These periodicities become clearly apparent in the residual signal in the form of pulse trains (see
Although the waveform is not exactly periodic, these deviations from ideal periodicity do not greatly affect the LTP performance in the case of narrowband signals (8 kHz sampling) because the time span covered by a single delay is sufficient to absorb the drift in the waveform period. Moreover, LTPs with 2 or 3 prediction coefficients make the system more robust to these fluctuations. LTPs with more than three prediction coefficients are not practical as the longer the filters are, the more prone to instability they become and the more involved the stabilization procedure is [4]. LTPs are successfully used in most current speech encoders.
The application of LPC and pulse excitation to the encoding of broadband (44.1 kHz sampling) speech and audio signals has also been tested, with limited success, some years ago [5, 6]. However, recent developments in the area of linear prediction [7] have renewed the interest in these techniques and some novel work on linear prediction broadband encoding has recently been published [8, 9].
The use of long-term prediction in broadband speech and audio encoding presents several difficulties, which are not encountered in narrowband speech and are caused by the high sampling rate employed (32 kHz or higher). First, and unlike the narrowband situation, a large number of LTP prediction coefficients are required in the LTP to successfully track the fluctuations in the residual periodicities. As it has already been mentioned, LTPs involving more than a few prediction coefficients are unpractical due to instability problems [4]. Short LTPs (1, 2 or 3 prediction coefficients) can be used but the gain achieved by them is minimal. An additional problem is the high computational complexity of the search for the optimum delay. This is due to the fact that signal segments contain a much larger number of samples in comparison to narrowband signals.
Both reasons make the use of LTP unsuitable in broadband (44.1 kHz sampling) audio or speech encoding. Nevertheless, quasi-periodic pulse trains are present in the residual signal and may cause serious problems to the subsequent pulse modelling stage. As an example,
The final signal quality achieved by a conventional pulse encoder is mainly determined by two parameters, namely, the number of pulses per frame and the number of levels used to quantise the resulting pulses. The higher the number of pulses and the number of quantisation levels, the more accurate the representation of the coded signal becomes. On the other hand, in order to achieve a high degree of compression, the number of pulses and quantisation levels must be minimized.
Independently of the number of pulses per frame used, very coarse quantisation of a signal is problematic whenever the signal exhibits a large dynamic range, as some parts of the signal will not be properly represented. This is the situation encountered in residuals that contain occasional large signal amplitudes in a quasi-periodic way (pulse-train like periodicities). The problem is exacerbated when some of the samples are forced to be zero, as it is done in RPE or MPE and also when sparse codebookns are used as its done in CELP coders.
The inventors appreciate that the different analysis-by-synthesis techniques used currently in speech coding like RPE, MPE or CELP (or variants thereof) for modelling of the residual are insufficient in broadband coding due to the lack of a proper functioning LTP mechanism for this situation. The combination of either RPE and a few extra pulses or CELP and a few extra pulses mitigates this problem because the extra pulses can be effectively used to model the quasi-periodic spikes typically appearing in residual signals exhibiting long-term correlation.
The invention relates to a method of encoding a digital audio signal, wherein for each time segment of the signal the following steps are performed:
spectrally flattening the signal to obtain a spectrally flattened signal,
modelling the spectrally flattened signal by an excitation signal comprising first and second partial excitation signals,
the first partial excitation signal conforming to an excitation signal generated by an RPE or CELP pulse modelling technique,
the second partial excitation signal being a set of extra pulses modelling spikes in the spectrally flattened signal, the extra pulses having arbitrary positions and amplitudes,
and
generating an audio bit stream comprising the first and second partial excitation signals.
The invention also relates to an audio encoder adapted to encode time segments of a digital audio signal, the encoder comprising
a spectral flattening unit for spectrally flattening the signal to output a spectrally flattened signal,
a calculating unit adapted to calculate, an excitation signal comprising first and second partial excitation signals,
the first partial excitation signal conforming to an excitation signal generated by an RPE or CELP technique
the second partial excitation signal being a set of extra pulses modelling spikes in the spectrally flattened signal, the extra pulses having arbitrary positions and amplitudes,
and
an audio bit stream generator for generating an audio bit stream comprising the first and second partial excitation signals.
Further, the invention relates to a method of decoding a received audio bit stream, where the audio bit stream comprises, for each of a plurality of segments of an audio signal:
a first partial excitation signal conforming to an excitation signal generated by an RPE or CELP pulse modelling technique,
a second partial excitation signal being a set of extra pulses modelling spikes in the spectrally flattened signal, the extra pulses having arbitrary positions and amplitudes,
the method comprising means for synthesising an output signal on the basis of the combined first and second excitation signals and the spectral flattening parameters.
Correspondingly, the invention relates to an audio player for receiving and decoding an audio bit stream, where the audio bit stream comprises for each of a plurality of segments of an audio signal:
a first partial excitation signal conforming to an excitation signal generated by an RPE or CELP technique,
a second partial excitation signal being a set of extra pulses modelling spikes in the spectrally flattened signal, the extra pulses having arbitrary positions and amplitudes,
the audio player comprising means for synthesising an output signal from the combined partial excitation signals and spectral flattening parameters.
Finally, the invention relates to an audio bit stream comprising for each of a plurality of segments of an audio signal:
a first partial excitation signal conforming to an excitation signal generated by an RPE or CELP technique,
a second partial excitation signal being a set of extra pulses modelling spikes in the spectrally flattened signal, the extra pulses having arbitrary positions and amplitudes;
and to a storage medium having such an audio bit stream stored thereon.
Embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:
In
In accordance with the invention the problem of encoding of quasi-periodicities in the spectrally flattened signal, in particular pulse-like trains, is solved by extending the pulse model, whereby a conventional RPE signal is supplemented by additional pulses with free gains/positions, i.e. the positions in time of the added pulses are not necessarily dictated by the RPE time-grid nor are the gains of the extra pulses dictated by the quantisation grid of the conventional RPE signal. The objective of these extra pulses is to model the residual spikes that would otherwise not be modelled. Hereby more freedom is given to the RPE signal to model the rest of the signal. The extra pulses are thus added to more closely model the residual spikes. This procedure can be interpreted as the non-obvious fusion of RPE and MPE where the MPE pulses model the signal spikes and the RPE pulses model the rest of the residual. This procedure is non-obvious since until now RPE and MPE are considered to be competing techniques but in absence of an LTP, they can be made to act complementary.
Although the number of extra pulses, K, can be set arbitrarily, it will in practice be limited to 1 or 2 per frame. The reason for this is that the pitch in human speech is within the range 50-400 Hz, and processing usually takes place in 5 ms segments; consequently there are only one or two cycles, i.e. one or two large peaks, in any given segment.
In a preferred embodiment of the method of the invention the number of quantisation levels has been fixed to 3 (1, 0, −1). The decimation factor can be arbitrarily set, although decimations 2 and 8 are preferred for obtaining excellent and good quality, respectively. The very coarse quantisation of the pulses determines to a large extent the performance of the whole RPE scheme even with a decimation factor of 2.
According to the invention the joint RPE/extra pulses optimisation is performed for each frame and it works as follows: we start by computing a normal un-quantised RPE signal [1], the positions corresponding to the K (=number of extra pulses) largest magnitude pulses are selected as the extra pulse locations. The RPE signal is then quantised (3 levels) and a joint optimum computation of the gains for the RPE signal and each of the extra pulses is performed. This procedure is repeated for each possible RPE offset and the solution producing the lowest norm of the reconstruction error is selected. Therefore the excitation signal x will consist of two partial excitations; a conventional RPE excitation signal xRPE and a second partial excitation signal consisting of a sum of delta functions gkδk for k=1, . . . , K, where the delta function is defined as a signal of all zeros with an amplitude equal to 1 at one specific time instant only and gk is its associated gain.
In
In
An efficient algorithm for calculating the two partial excitation signals in accordance with the block 11 ‘Residual Modelling’ from
For every offset j do
Compute optimum RPE un-quantised amplitudes=>A(j)
Select positions of the K largest magnitude pulses
Generate K partial excitation signals=>δk(j), k=1, . . . , K
Quantise A(j)=>Aq(j)
Generate partial excitation signal from Aq(j)=>x(j)
Compute optimum gains=>gx(j), g1(j), . . . , gK(j)
Compose total excitation=>x(j)=gx(j)xRPE(j)+g1(j)δ1(j)+ . . . +gKδK(j)
Compute norm of reconstruction error for current offset j=>e(j) end
Select x(j) with minimum norm=>xopt
The computation of the optimum RPE un-quantised amplitudes is done according to [1]. The calculation of the optimum gains is performed by solving the following linear equation system:
where sx(j) denotes the synthesised signal approximation component due to the RPE excitation (i.e. the convolution of x(j) with the impulse response of the synthesis filter), sδ
Notice that this procedure still conducts a joint, albeit sub-optimal, optimisation of the location and amplitude of the RPE signal and the extra pulses.
In order to design the optimum combined RPE/extra pulses signal, an exhaustive calculation, e.g. as above, is required. The very high complexity of this procedure motivates the need for simpler strategies to compute the joint RPE/extra pulses excitation.
Thus, in a preferred embodiment of the invention the extra pulses are restricted to be on the RPE grid, i.e. to be coincident with the RPE pulses. This means that the extra RPE pulses are not necessarily strictly coincident with the residual pulses that they model but are offset to the next or nearest RPE pulse grid position. This approach has two important advantages: The complexity of the encoder is drastically reduced, and the bit rate is reduced because the number of bits spent in encoding the positions of the extra pulses is reduced.
A consequence of the addition of extra pulses to a conventional RPE or CELP signal is an increase in bit rate. However the increase in bit rate is rather modest when compared to the total bit rate. As an example, the encoding of a 44,100 samples/s flattened signal using RPE with decimation 2 and 3-level quantisation (1.6 bit/pulse) results in a bit rate of around 40 kb/s. Assuming a 5 ms frame length, the addition of two extra pulses using the described technique raises the rate to around 43.6 kb/s.
It will be seen that in the provided algorithm there is no need for an elaborate search the positions of the extra pulses. Yet, the results indicate that the extra pulses obtained in this way and being restricted to the RPE grid are effective in removing pulse-like periodicities from the residuals.
a-c illustrate the performance of the method according to the invention.
FIGS. 7,8 and 9 and the corresponding description reflect the disclosure in a document with the applicant's internal reference PHNL031414EPP suitably adapted to the present invention.
In
A waveform is generated by block TSS (Transient and Sinusoidal Synthesiser) using the transient and sinusoidal parameters (CT and CS) generated by block TSA and modified by the block BRC. This signal is subtracted from input signal s, resulting in signal r1. In general, signal r1 does not contain substantial sinusoids and transient components.
From signal r1, the spectral envelope is estimated and removed in the block (SE) using a Linear Prediction filter, e.g. based on a tapped-delay-line or a Laguerre filter. The prediction coefficients Ps of the chosen filter are written to an audio bit stream AS for transmittal to a decoder as part of the conventional type noise codes CN. Then the temporal envelope is removed in the block (TE) generating, for example, Line Spectral Pairs (LSP) or Line Spectral Frequencies (LSF) coefficients together with a gain, again as described in the prior art. In any case, the resulting coefficients Pt from the temporal flattening are written to the audio bit stream AS for transmittal to the decoder as part of the conventional type noise codes CN. Typically, the coefficients Ps and PT require a bit rate budget of 4-5 kbit/s.
Because pulse train coders employ a first spectral flattening stage, the residual modelling stage 11 from
Experiments have shown that residual modelling sometimes results in a loss in brightness in the reconstructed signal when using few pulses (e.g. RPE with high decimation factors (e.g. D=8) or CELP with sparse codebooks. Adding some low-level noise to the excitation mitigates this problem. In order to determine the level of the noise, a gain (g) is calculated on basis of, for example, the energy/power difference between a signal generated from the excitation and residual signal r2/r3. This gain is also transmitted to the decoder as part of the layer L0 information.
In the applicant's internal reference PHNL031414EPP
In
The excitation signal r2′ is then fed to a spectral envelope generator (SEG) which according to the codes Ps produces a synthesized noise signal r1′. This signal is added to the synthesized signals produced by the conventional transient and sinusoidal synthesizers to produce the output signal {circumflex over (x)}.
In an alternative embodiment, parameters generated by the excitation generator are used (indicated by the hashed line) in combination with the noise code Pt to shape the temporal envelope of the signal outputted by WNG to create a temporally shaped noise signal.
In
The temporal envelope coefficients (Pt) are then imposed on the excitation signal r3′ by the block TEG to provide the synthesized signal r2′ which is processed as before. As mentioned above, this is advantageous because the excitation signal typically gives rise to some loss in brightness, which, with a properly weighted additional noise sequence, can be counteracted. The weighting can comprise simple amplitude or spectral shaping each based on the gain factor g and CN.
As before, the signal is filtered by, for example, a linear prediction synthesis filter in block SEG (Spectral Envelope Generator), which adds a spectral envelope to the signal. The resulting signal is then added to the synthesized sinusoidal and transient signal as before.
It will be seen that in either
It should be noted that in the embodiment of
The hybrid method described above can operate at a wide variety of bit rates, and at every bit rate it offers a quality comparable to that of state-of-the-art encoders. In that method the base layer, which is made up by the data supplied by the parametric (sinusoidal) encoder, contains the main or basic features of the input signal, and medium to high quality audio signal is obtained at a very low bit rate.
Similarly to the change in the encoder of
Number | Date | Country | Kind |
---|---|---|---|
04102880.4 | Jun 2004 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB05/51972 | 6/15/2005 | WO | 00 | 12/13/2006 |