The invention relates to a method and arrangements for audio signal encoding. In particular the invention relates to a method and an audio signal decoder for forming an audio signal as well as to an audio signal encoder.
In many contemporary communication systems and especially in mobile communication systems there is only limited transmission bandwidth available for real time audio transmissions, such as speech or music transmissions for example. In order to transmit as many audio channels as possible over a transmission link with restricted bandwidth, such as a radio network for example, there is therefore frequently provision for compressing the audio signals to be transmitted by using real time or quasi real time audio encoding methods and for decompressing them after transmission In this document the term audio is especially also understood to mean speech.
With these types of audio encoding method the aim is generally to reduce the volume of data to be transmitted and thereby the transmission rate as much as possible without adversely effecting the subjective listening impression or with voice transmissions without adversely effecting comprehensibility.
An efficient compression of audio signals is also a significant factor in connection with storage or archiving of audio signals.
Encoding methods have proved to be especially efficient in which an audio signal synthesized by an audio synthesis filter is compared frame by frame over time with an audio signal to be transmitted by optimization of filter parameters. Such a method of operation is frequently referred to as analysis-by-synthesis. The audio synthesis filter is in this case excited by an excitation signal that is preferably likewise to be optimized. The filtering is frequently also referred to as formant synthesis. So-called LPC coefficients (LPC: Linear Predictive Coding) and/or parameters that specify a spectral and or temporal enveloping of the audio signal can be used as filter parameters for example. The optimized filter parameters as well as the parameters specifying the excitation signal will then be transmitted in time frames to the receiver in order to form a synthetic audio signal there by means of an audio signal decoder provided on the receive-side which is as similar as possible to the original audio signal in respect of subjective audio impression.
Such an audio encoding method is known from ITU-T recommendation G.729. By means of the audio encoding method described therein a real time audio signal with a bandwidth of 4 kHz can be reduced to a transmission rate of 8 kbit/s.
In addition efforts are currently being made to synthesize an audio signal to be transmitted using a higher bandwidth in order to improve the audio impression. In the expansion G.729EV of the G.792 recommendation currently under discussion an attempt is being made to expand the audio bandwidth from 4 kHz to 8 kHz.
The transmission bandwidth and audio synthesis quality able to be achieved largely depend on the creation of a suitable excitation signal.
In the case of a bandwidth expansion for which an excitation signal unb(k) in a low subband, e.g. in the frequency range of 50 Hz to 3.4 kHz, already exists, a bandwidth-expanding excitation signal unb(k) can be formed in a high subband, e.g. in the frequency range from 3.4-7 kHz, as a spectral copy of the narrowband excitation signal unb(k). (The index k is to be taken here and below to be an index of sampling values of the excitation signal or other signals). The copy can be formed in such cases by spectral translation or by spectral mirroring of the narrowband excitation signal unb(k). However the spectrum of the excitation signal is anharmonically distorted and/or a significant audible phase error is caused in the spectrum by such spectral translation or mirroring. This leads however to an audible loss of quality of the audio signal.
The object of the present invention is to specify a method for forming an audio signal which allows an improvement of the audible quality, with the transmission bandwidth not being increased or only being increased slightly. Another object of the invention is to specify an audio signal decoder for executing the method as well as an audio signal encoder.
This object is achieved by a method, by an audio signal decoder as well as by an audio signal encoder with the features of the claims.
In the inventive method for forming an audio signal, frequency components of the audio signal allotted to a first subband are formed by means of a subband decoder on the basis of fundamental period values each specifying a fundamental period of the audio signal. Frequency components of the audio signal allotted to a second subband are formed by exciting an audio synthesis filter means of a specific excitation signal specified for the second subband. For creating the specific excitation signal for the second subband a fundamental period parameter is derived from the fundamental period values by an excitation signal generator. On the basis of the fundamental period parameter pulses with a pulse shape dependent on the fundamental period parameter are formed by the excitation signal generator at an interval specified by the fundamental period parameter and mixed with a noise signal.
Local frequency components of the audio signal occurring in a further second subband which are already provided for a specific subband decoder for the first subband can be synthesized on the basis of fundamental period values. Since no additional audio parameters are generally required either for the creation of the noise signal, the creation of the excitation signal in general does not require any additional transmission bandwidth. The insertion of the local frequency components of the further, second subband enables the audio quality of the audio signal to be significantly improved, especially since a harmonic content determined by the fundamental period values can be reproduced in the second subband.
Advantageous embodiments and developments of the invention are specified in the dependent claims.
In accordance with an advantageous embodiment of the invention the fundamental period parameter can specify the fundamental period of the audio signal except for a fraction of a first sampling distance assigned to the subband decoder. By a precisely specified fundamental period parameter except for a fraction—preferably 1/N with integer N—of the first sampling distance, the pulses can be spaced with a higher accuracy in relation to the subband decoder, which allows a harmonic spectrum of the audio signal to be modeled more precisely in the second subband.
Furthermore the pulse shape of the respective pulse can be selected as a function of a non-integer proportion of the fundamental period parameter in units of the first sampling distance from different pulse shapes stored in a lookup table. Quite different pulse shapes can be selected from the lookup table by simple retrieval in real time with little outlay in circuitry, processing or computing effort. The pulse shapes to be stored can be optimized in advance in respect of a possible natural audio reproduction. Actually the accumulated effects or the accumulated pulse response of a number of filters, decimators and/or modulators can be computed in advance and stored in each case as the appropriately shaped pulse in the lookup table. A converter is referred to in this connection as a decimator, which multiplies a sampling distance of a signal by a decimation factor m, in that all sampling values except for every mth sampling value are discarded. A modulator is to be understood as a filter which multiplies individual sampling values of a signal by predetermined individual factors and outputs the product in each case.
Furthermore the pulse interval can be determined by an integer proportion of the fundamental period parameter in units of the first sampling distance.
In accordance with a further advantageous embodiment of the invention the pulses can be formed from a predetermined pulse shape, e.g. a square-wave pulse, by pulse values which have a second sampling distance which is smaller by a bandwidth expansion factor than the first sampling distance. The time interval between the pulses can then be determined in units of the second sampling distance by the fundamental period parameter multiplied by the bandwidth expansion factor. The inverse N of that fraction 1/N which corresponds to the accuracy of the fundamental period parameter in units of the first sampling distance can preferably be selected as the bandwidth expansion factor.
Preferably the pulses can be shaped by a pulse-shaping filter with filter coefficient predetermined in the second sampling distance.
Furthermore the pulses can be filtered before or after mixing-in of the noise signal by at least one highpass, lowpass and/or bandpass and/or be decimated by at least one decimator.
In accordance with a further advantageous embodiment of the invention the fundamental period parameter can be derived for each time frame from one or more fundamental period values.
In particular the fundamental period parameter can be derived in such cases from fluctuation-compensating, preferably not linearly linked fundamental period values of a number of time frames. This enables fluctuations or jumps of the fundamental period values, which for example can result from incorrect measurements of a basic audio frequency caused by interference noise, from having a disadvantageous effect on the fundamental period parameter.
In this context a relative deviation of a current fundamental period value from an earlier fundamental period value or from a variable derived therefrom can be determined and attenuated within the framework of the derivation of the fundamental period parameter.
In accordance with a further advantageous embodiment of the invention a mixing ratio between the pulses and the noise signal is determined by at least one mixing parameter. This can be derived on a time frame basis from a signal level relationship existing in a subband decoder between a tonal and an atonal audio signal proportion of the first subband. In this way level parameters present in the subband decoder relating to a harmonics-to-noise ratio in the first subband can be used for forming the audio signal components in the second subband.
Furthermore, within the framework of deriving the mixing parameter, the signal level ratio can be converted such that for a predominance of the atonal audio signal proportion the tonal audio signal proportion is reduced further. Since with natural audio sources an atonal audio signal proportion increasingly predominates in higher frequency bands, especially above 6 kHz, the reproduction quality can generally be improved by such a reduction.
Advantageous exemplary embodiments of the invention are explained in greater detail below on the basis of the drawing.
The figures show the following schematic diagrams:
a filter coefficient of a pulse-shaping filter,
b a power spectral density of the filter coefficient,
In the low subband the supplied audio data AD is decoded by a lowband decoder LBD specific to the low subband, i.e. a decoder with a bandwidth essentially only comprising the low subband. For this subsidiary information specific to the low subband contained in the audio data AD, namely atonal mixing parameters gFIX, tonal mixing parameters gLTP as well as fundamental period values λLTP are especially evaluated. In this case the lowband decoder, e.g. a speech codec in accordance with ITU-T Recommendation G.729, creates a narrowband audio signal NAS in the frequency range f=0-4 kHz with a sampling rate fs=8 kHz.
In the high subband a synthetic excitation signal u(k) is formed by a highband excitation signal generator HBG on the basis of the subsidiary information gFIX, gLTP and k LTP extracted for each time frame by the lowband decoder LBD. The variable k refers here and below to an index by which digital sampling values of the excitation signal and other signals are indexed. The excitation signal u(k) is fed from the excitation signal generator to an audio synthesis filter ASYN which is excited by this signal to generate a synthetic highband audio signal HAS in the frequency range f=4-8 kHz. The highband audio signal HAS is combined with the narrowband audio signal NAS to finally create and to output the broadband synthetic audio signal SAS in the frequency range f=0-8 kHz.
An audio signal encoder can also be realized in a simple manner by means of the audio signal decoder. For this purpose the synthesized audio signal SAS is to be directed to a comparison device (not shown) which compares the synthesized audio signal SAS with an audio signal to be encoded. By variation of the audio data AD and especially of subsidiary information gFIX, gLTP and λLTP, the synthesized audio signal SAS is then matched to the audio signal to be encoded.
The invention can advantageously be used for general audio encoding and for subband audio synthesis and also for artificial bandwidth expansion of audio signals. The latter can in this case be interpreted as a special case of a subband audio synthesis in which the information about a specific subband is used to reconstruct or to estimate missing frequency components of another subband.
The application options given here are based on a suitably-formed excitation signal u(k). The excitation signal u(k) which represents a spectral fine structure of an audio signal, can be converted by the audio synthesis filter ASYN in a different manner e.g. by shaping its time and/or frequency curve.
So that a synthetically formed excitation signal u(k) matches an original excitation signal (not shown) used by a (subband) audio signal encoder, the synthetic excitation signal u(k) should preferably have the following characteristics:
the synthetic excitation signal u(k) should in general exhibit a flat spectrum. With atonal, i.e. unvoiced sounds, the synthetic excitation signal u(k) can be embodied for this purpose from white noise.
for tonal, i.e. voiced sounds, the synthetic excitation signal u(k) should have harmonic signal components, i.e. spectral peaks in integer multiples of a basic audio frequency F0.
In practice purely tonal or purely atonal audio signals hardly ever occur. Instead real audio signals as a rule contain a mixture of tonal and atonal components. The synthetic excitation signal u(k) is preferably to be created such that a harmonics-to-noise ratio, i.e. an energy or intensity ratio of the tonal and atonal components of the original audio signal is reproduced as accurately as possible.
During tonal sounds a wideband noise component is generally added to the harmonics of the basic audio frequency F0. This noise component is frequently dominant, especially at higher frequencies above 6 kHz.
The formation of an excitation signal u(k) suitable for audio encoding, for subband-audio synthesis as well as for artificial bandwidth expansion of audio signals is explained in greater detail below.
The excitation signal u(k) is created as a subband signal sampled at a predetermined sampling rate of e.g. 16 kHz or 8 kHz. This subband signal u(k) represents the frequency components of the high subband of 4-8 kHz, through which the bandwidth of the narrowband audio signal NAS is to be expanded. The narrowband audio signal NAS extends over a frequency range of 0-4 kHz and is sampled at a sampling rate of 8 kHz.
The excitation signal u(k) formed excites the audio synthesis filter ASYN an is shaped by this into the highband audio signal HAS. The synthetic, wideband audio signal SAS is finally created by a combination of the shaped highband audio signal HAS and the narrowband audio signal NAS with a higher sampling rate of 16 kHz for example.
The formation of the excitation signal u(k) is based on an audio creation model in which tonal, i.e. voiced sounds are excited by a sequence of pulses and atonal, i.e. unvoiced sounds are excited preferably by white noise. Various modifications are provided, to allow mixed excitation forms, through which an improved audible impression can be achieved.
The creation of the tonal components of the excitation signal u(k) is based on two audio parameters of the audio creation model, namely the basic audio frequency F0 and the energy or intensity ratio γ between the tonal and the atonal audio components in the low subband. The latter is frequently also referred to as the “harmonics-to-noise ratio”, abbreviated to HNR. The basic audio frequency F0 is also referred to in technical parlance as the “fundamental speech frequency”.
The two audio parameters F0 and γ can be extracted on reception of a transmitted audio signal; preferably (e.g. in the case a bandwidth expansion) directly from the low frequency band of the audio signal or (e.g. in the case of a subband audio synthesis) from the lowband decoder of an underlying lowband audio codec, in which such audio parameters are available as a rule.
The fundamental speech frequency F0 is frequently represented by a fundamental period value which is given by the sampling rate divided by the fundamental speech frequency F0. The fundamental period value is frequently also referred to as the “pitch lag”. The fundamental period value is an audio parameter which in general is transferred with standard audio codec, such as in accordance with the G.729 Recommendation for example, for the purposes of a so called “long-term prediction”, abbreviated to LTP. If such a standard audio codec is used for the low subband, the fundamental speech frequency F0 can be determined or estimated on the basis of the LTP audio parameters provided by this audio codec.
With many standard audio codecs, such as in accordance with G.729 Recommendation for example, an LTP fundamental parameter value is transferred with a temporal resolution, i.e. accuracy which amounts to a fraction 1/N of the sampling distance used by this audio codec. With an audio codec in accordance with the G.729 Recommendation the LTP fundamental period value is provided with an accuracy of ⅓ of the sampling distance. In units of this sampling distance the fundamental period value can thus also assume non-integer values. Such accuracy can for example be achieved by the relevant audio encoder for example by a sequence of “open-loop” and “closed-loop” searches. The audio encoder attempts in this case to find that fundamental period value in which the intensity or energy of a LTP residual signal is minimized. An LTP fundamental period value determined in this way can however deviate, especially with loud ambient noises, from the fundamental period value corresponding to the actual fundamental speech frequency F0 of the tonal audio components and can thus adversely affect an exact reproduction of these tonal audio components. Period doubling errors and period halving errors occur as typical deviations. This means that the frequency corresponding to the deviating LPT fundamental period value is half or is double the actual fundamental speech frequency F0 of the tonal audio components.
When such LTP fundamental period values are used for synthesis of the tonal audio components in the high subband these types or large frequency deviations should be avoided. To minimize the effects of typical period doubling and period halving errors, the post-processing technique explained below can be used within the framework of the invention:
Let an LTP fundamental period value currently extracted from the lowband decoder LBD be referred to as λLTP(μ), with μ representing an index of a respectively processed time frame or subframe. The fundamental period value λLTP(μ) is given in units of the sampling distance of the lowband decoder LBD and can also assume non-integer values.
From the ratio between the current fundamental period value λLTP(μ) and a filtered fundamental period value λpost(μ−1) of the previous frame an integer factor f is initially calculated as
The round function in this case maps its argument to the closest integer.
A decision as to whether the current fundamental period value λLTP(μ) is to be modified is made as a function of the relative error
If the relative error lies below a predetermined threshold value of 1/10 for example, it is assumed that the current fundamental period value λLTP(μ) is the result of a beginning phase with period doubling errors or period halving errors. In such a case the current fundamental period value λLTP(μ) is corrected or filtered by division by the factor f in such a way that the filtered fundamental period values λpost(μ) essentially behave consistently over a number of time frames μ. It proves advantageous to determine the filtered fundamental period value λpost(μ) in accordance with
By multiplication with the factor N, e.g. N=3, in the argument of the round function the resulting fundamental period value λpost(μ) is again exact except for the fraction 1/N 5 of the sampling distance of the lowband decoder LBD.
Finally a moving average of the fundamental period values λpost(μ) is formed for further smoothing. The moving average corresponds to a type of lowpass filtering. With a moving average of for example two consecutive fundamental period values λpost(μ) a fundamental period parameter
is produced on the basis of which the excitation signal u(k) is derived for the high subband. On the basis of the averaging of two values the fundamental period parameter λp(μ) has a resolution that is higher by the factor two, that corresponds to a fraction 1/(2N) of the sampling distance of the lowband decoder LBD.
The non-linear filtering procedure explained above enables most period doubling—or in general—multiplying errors to be avoided. This results in a significant improvement in the reproduction quality.
An explanation is given below as to how tonal mixing parameters gv(μ) and atonal mixing parameters guv(μ) are derived for mixing corresponding tonal and atonal components of the excitation signal u(k) in the high subband for each time frame from mixing parameters gLTP(μ) and gFIX(μ) of the lowband decoder LBD specific for the low subband. It is assumed in this case that the lowband decoder LBD is a so-called CELP (CELP: Codebook Excited Linear Prediction) decoder, which features a so-called adaptive or LTP codebook and a so-called fixed codebook.
In real audio signals tonal sounds hardly ever occur without the contribution of atonal signal components. To estimate an energy or intensity ratio between tonal and atonal signal components it is assumed for the purposes of a model that the adaptive codebook only contributes tonal components in the low subband and that the fixed codebook only contributes atonal components in the low subband. It is further assumed that these two contributions are orthogonal to each other.
On the basis of these assumptions the intensity ratio between tonal and atonal signal components can be reconstructed from the mixing parameters gLTP and gFIX of the lowband decoder LBD. Both mixing parameters gLTP, gFIX can be extracted for each time frame from the lowband decoder LBD. For each time frame or subframe (indexed by μ) an instantaneous intensity ratio between the contributions of the adaptive and of the fixed code book, i.e. the harmonics-to-noise ratio γ can be determined by dividing the energy contributions of the adaptive and fixed codebook.
While the mixing parameter gLTP(μ) specifies a gain factor for the signals of the adaptive codebook, the mixing parameter gFIX(μ) specifies a gain factor for the signals of the fixed codebook. If the codebook vectors output from the adaptive codebook are designated with xLTP(μ) and the codebook vectors output from the fixed codebook with xFIX(μ), the harmonics-to-noise ratio is expressed as
For improved modeling of the atonal audio components in the high subband the harmonics-to-noise ratio γ derived from the low subband is converted by a type of Wiener filter in accordance with
Through this “Wiener” filtering a small γ (atonal audio segment) is further reduced, while large values of γ (tonal dominated audio segment) are hardly changed. Audio signals are naturally better approximated by such a reduction.
Finally, from the filtered harmonics-to-noise ratio γpost gain factors, i.e. mixing parameters gv and guv for tonal or atonal components of the excitation signal u(k) in the high subband can be determined for
Since in practice purely tonal or purely atonal audio signals hardly ever occur, the two mixing parameters gv(μ) and guv(μ) in practice (simultaneously) have a non-vanishing value. The calculation specifications given above ensure that the total of the squares of the mixing parameters gv and guv, i.e. a total energy of the mixed excitation signal u(k) is essentially constant.
The creation of the excitation signal u(k) on the basis of the audio parameters gv, guv and λp derived from the lowband decoder LBD is explained in greater detail below using the example of two embodiment variants of the excitation signal generator HBG. It is assumed here for reasons of clarity that the accuracy of the fundamental period values is given in units of the sampling distance of the lowband decoder LBD by 1/N with N=3. The remarks below are naturally able to be easily generalized to apply to any given value of N.
A first embodiment variant of the excitation signal generator HBG is shown schematically in
The audio parameters gv, guv and λp are derived and adapted for each time frame in a continuous sequence from audio parameters of the lowband decoder LBD or by means of a suitable audio parameter extraction block. The filter operations are designed for a fractional fundamental period parameter λp with an accuracy of 1/(2N), here equal to ⅙, in units of the sampling rate of the lowband decoder LBD and for a target bandwidth, which corresponds to the bandwidth of the lowband decoder LBD.
Since the lowband decoder LBD in accordance with its bandwidth of 0-4 kHz, uses a sampling rate of 8 kHz, and by means of the excitation signal u(k) audio components of 4-8 kHz, i.e. with a bandwidth of 4 kHz are to be created, a sampling rate of at least 8 kHz is to be provided for the pulse generator PG1. In accordance with the temporal resolution of the fundamental period parameter λp higher by the factor 2N=6 in the present exemplary embodiment however a sampling rate of fs=2*N*8 kHz=6*8 kHz=48 kHz is to be provided both for the pulse generator PG1 and also for the noise generator NOISE.
For creating the tonal proportion of the excitation signal the fundamental period parameter λp is multiplied by the factor 2N=6 and the product 6*λp is fed to the square-wave pulse generator SPG. The square-wave pulse generator SPG consequently creates individual square-wave pulses at an interval given by 6*λp in units of the sampling distance 1/48000 s of the square-wave pulse generator SPG. The individual square-wave pulses have an amplitude of √{square root over (6*λp)}, so that the average energy of a long pulse sequence is essentially constantly equal to 1.
The square-wave pulses created by the square-wave pulse generator SPG are multiplied by the “tonal” mixing parameters gv fed to the pulse-shaping filter SF. In the pulse-shaping filter SF the square-wave pulses are “smudged” in time to a certain extent by folding or correlation with the filter coefficient p(k). This filtering enables the so-called crest factor, i.e. a ratio of peaks to average sampled values to be significantly reduced and the audible quality of the synthesized audio signal SAS to be significantly improved. In addition the square-wave pulses can be spectrally shaped by the pulse-shaping filter SF in an advantageous manner. Preferably the pulse-shaping filter SF can exhibit a bandpass characteristic for this purpose with a transition region around 4 kHz and an essentially even gain increase in the direction of higher and lower frequencies. The result able to be achieved in this way is that higher frequencies of the excitation signal u(k) exhibit fewer harmonic components and thus the noise proportion increases as frequency increases.
A typical choice of the filter coefficients p(k) is shown schematically in
As illustrated in
Up to this method step an increased sampling rate of fs=48 kHz has been used. The remaining processing blocks shown in
For this purpose the summation signal is first filtered by the lowpass LP and the filtered signal is then converted by the decimator D3 from a 48 kHz sampling rate to a sampling rate of fs=16 kHz. The converted signal is subsequently fed to the highpass HP which feeds the highpass-filtered signal to the decimator D2, which finally creates from the signal supplied at the 16 kHz sampling rate the excitation signal u(k) with the target sampling rate of fs=8 kHz.
The created excitation signal u(k) contains the frequency components required for the bandwidth extension. These are present however as a spectrum mirrored around the frequency of 4 kHz. To invert the spectrum, the excitation signal u(k) can be modulated with modulation factors (−1)k.
Since the components of the audio signal decoder in accordance with
A second embodiment variant of the excitation signal generator HBG designed in this way is shown schematically in
The excitation signal generator is supplied with the audio parameters gv, guv and λp for each time frame in a continuous sequence. The derivation of the audio parameters gv, guv and λp has already been explained above. Let the fractional fundamental period parameter λp as above be specified with an accuracy of 1/(2N), here equal to ⅙, in units of the sampling rate of the lowband decoder LBD.
For the tonal components of the excitation signal u(k) the impulse response of all filtering, decimation and modulation operations illustrated in
For operation of the pulse generator PG2 the lookup table LOOKUP is supplied with the factional proportion λp−└λp┘ of the respective fundamental period parameter λp. The brackets └ ┘ in this case designate an integer proportion of a rational or real number. On the basis of the supplied fractional proportion λp−└λp┘ a pulse shape is selected from the stored pulse shapes vj(k) and a correspondingly shaped pulse is output from the lookup table LOOKUP. In the present exemplary embodiment λp−└λp┘ can assume the values 0, ⅙, 2/6, 3/6, 4/6 and ⅚. Preferably those pulse shapes vj(k) are selected of which the index j corresponds to the relevant counter of the relevant fraction.
Each of the stored pulse shapes vj(k) corresponds to a pulse response of the chain shown in
As illustrated in
Finally the noise signal of the noise generator NOISE multiplied by the “atonal” mixing parameter guv is added to the pulse output by the pulse positioning device PP, in order to obtain the excitation signal u(k).
The embodiment variant shown in
This application is the US National Stage of International Application No. PCT/EP2006/000812, filed Jan. 31, 2006 and claims the benefit thereof, which is incorporated by reference herein in their entirety.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2006/000812 | 1/31/2006 | WO | 00 | 7/29/2008 |