The present invention relates to source coding and particularly to audio source coding, in which an audio signal is processed by two different audio coders having different coding algorithms.
In the context of low bitrate audio and speech coding technology, several different coding techniques have traditionally been employed in order to achieve low bitrate coding of such signals with best possible subjective quality at a given bitrate. Coders for general music/sound signals aim at optimizing the subjective quality by shaping a spectral (and temporal) shape of the quantization error according to a masking threshold curve which is estimated from the input signal by means of a perceptual model (“perceptual audio coding”). On the other hand, coding of speech at very low bitrates has been shown to work very efficiently when it is based on a production model of human speech, i.e. employing Linear Predictive Coding (LPC) to model the resonant effects of the human vocal tract together with an efficient coding of the residual excitation signal.
As a consequence of these two different approaches, general audio coders, like MPEG-1 Layer 3 (MPEG=Moving Pictures Expert Group), or MPEG-2/4 Advanced Audio Coding (AAC) usually do not perform as well for speech signals at very low data rates as dedicated LPC-based speech coders due to the lack of exploitation of a speech source model. Conversely, LPC-based speech coders usually do not achieve convincing results when applied to general music signals because of their inability to flexibly shape the spectral envelope of the coding distortion according to a masking threshold curve. In the following, concepts are described which combine the advantages of both LPC-based coding and perceptual audio coding into a single framework and thus describe unified audio coding that is efficient for both general audio and speech signals.
Traditionally, perceptual audio coders use a filterbank-based approach to efficiently code audio signals and shape the quantization distortion according to an estimate of the masking curve.
a shows the basic block diagram of a monophonic perceptual coding system. An analysis filterbank 1600 is used to map the time domain samples into subsampled spectral components. Dependent on the number of spectral components, the system is also referred to as a subband coder (small number of subbands, e.g. 32) or a transform coder (large number of frequency lines, e.g. 512). A perceptual (“psychoacoustic”) model 1602 is used to estimate the actual time dependent masking threshold. The spectral (“subband” or “frequency domain”) components are quantized and coded 1604 in such a way that the quantization noise is hidden under the actual transmitted signal, and is not perceptible after decoding. This is achieved by varying the granularity of quantization of the spectral values over time and frequency.
The quantized and entropy-encoded spectral coefficients or subband values are, in addition with side information, input into a bitstream formatter 1606, which provides an encoded audio signal which is suitable for being transmitted or stored. The output bitstream of block 1606 can be transmitted via the Internet or can be stored on any machine readable data carrier.
On the decoder-side, a decoder input interface 1610 receives the encoded bitstream. Block 1610 separates entropy-encoded and quantized spectral/subband values from side information. The encoded spectral values are input into an entropy-decoder such as a Huffman decoder, which is positioned between 1610 and 1620. The outputs of this entropy decoder are quantized spectral values. These quantized spectral values are input into a requantizer, which performs an “inverse” quantization as indicated at 1620 in
Traditionally, efficient speech coding has been based on Linear Predictive Coding (LPC) to model the resonant effects of the human vocal tract together with an efficient coding of the residual excitation signal. Both LPC and excitation parameters are transmitted from the encoder to the decoder. This principle is illustrated in
a indicates the encoder-side of an encoding/decoding system based on linear predictive coding. The speech input is input into an LPC analyzer 1701, which provides, at its output, LPC filter coefficients. Based on these LPC filter coefficients, an LPC filter 1703 is adjusted. The LPC filter outputs a spectrally whitened audio signal, which is also termed “prediction error signal”. This spectrally whitened audio signal is input into a residual/excitation coder 1705, which generates excitation parameters. Thus, the speech input is encoded into excitation parameters on the one hand, and LPC coefficients on the other hand.
On the decoder-side illustrated in
Over time, many methods have been proposed with respect to an efficient and perceptually convincing representation of the residual (excitation) signal, such as Multi-Pulse Excitation (MPE), Regular Pulse Excitation (RPE), and Code-Excited Linear Prediction (CELP).
Linear Predictive Coding attempts to produce an estimate of the current sample value of a sequence based on the observation of a certain number of past values as a linear combination of the past observations. In order to reduce redundancy in the input signal, the encoder LPC filter “whitens” the input signal in its spectral envelope, i.e. it is a model of the inverse of the signal's spectral envelope. Conversely, the decoder LPC synthesis filter is a model of the signal's spectral envelope. Specifically, the well-known auto-regressive (AR) linear predictive analysis is known to model the signal's spectral envelope by means of an all-pole approximation.
Typically, narrow band speech coders (i.e. speech coders with a sampling rate of 8 kHz) employ an LPC filter with an order between 8 and 12. Due to the nature of the LPC filter, a uniform frequency resolution is effective across the full frequency range. This does not correspond to a perceptual frequency scale.
In order to combine the strengths of traditional LPC/CELP-based coding (best quality for speech signals) and the traditional filterbank-based perceptual audio coding approach (best for music), a combined coding between these architectures has been proposed. In the AMR-WB+ (AMR-WB=Adaptive Multi-Rate WideBand) coder B. Bessette, R. Lefebvre, R. Salami, “UNIVERSAL SPEECH/AUDIO CODING USING HYBRID ACELP/TCX TECHNIQUES,” Proc. IEEE ICASSP 2005, pp. 301-304, 2005 two alternate coding kernels operate on an LPC residual signal. One is based on ACELP (ACELP=Algebraic Code Excited Linear Prediction) and thus is extremely efficient for coding of speech signals. The other coding kernel is based on TCX (TCX=Transform Coded Excitation), i.e. a filterbank based coding approach resembling the traditional audio coding techniques in order to achieve good quality for music signals. Depending on the characteristics of the input signal signals, one of the two coding modes is selected for a short period of time to transmit the LPC residual signal. In this way, frames of 80 ms duration can be split into subframes of 40 ms or 20 ms in which a decision between the two coding modes is made.
The AMR-WB+ (AMR-WB+=extended Adaptive Multi-Rate WideBand codec), cf. 3GPP (3GPP=Third Generation Partnership Project) technical specification number 26.290, version 6.3.0, June 2005, can switch between the two essentially different modes ACELP and TCX. In the ACELP mode a time domain signal is coded by algebraic code excitation. In the TCX mode a fast Fourier transform (FFT=fast Fourier transform) is used and the spectral values of the LPC weighted signal (from which the excitation signal is derived at the decoder) are coded based on vector quantization.
The decision, which modes to use, can be taken by trying and decoding both options and comparing the resulting signal-to-noise ratios (SNR=Signal-to-Noise Ratio).
This case is also called the closed loop decision, as there is a closed control loop, evaluating both coding performances and/or efficiencies, respectively, and then choosing the one with the better SNR by discarding the other.
It is well-known that for audio and speech coding applications a block transform without windowing is not feasible. Therefore, for the TCX mode the signal is windowed with a low overlap window with an overlap of ⅛th. This overlapping region is necessary, in order to fade-out a prior block or frame while fading-in the next, for example to suppress artifacts due to uncorrelated quantization noise in consecutive audio frames. This way the overhead compared to non-critical sampling is kept reasonably low and the decoding necessary for the closed-loop decision reconstructs at least ⅞th of the samples of the current frame.
The AMR-WB+ introduces ⅛th of overhead in a TCX mode, i.e. the number of spectral values to be coded is ⅛th higher than the number of input samples. This provides the disadvantage of an increased data overhead. Moreover, the frequency response of the corresponding band pass filters is disadvantageous, due to the steep overlap region of ⅛th of consecutive frames.
In order to elaborate more on the code overhead and overlap of consecutive frames,
From the sequence of windows the varying overlapping regions can be seen, which overlap by exactly ⅛th of the center part M. The table at the bottom of
When switching from TCX to ACELP the window samples are discarded from the FFT-TCX frame in the overlapping region, as for example indicated at the top of
From the table at the bottom of
It is a significant disadvantage of the AMR-WB+that an overhead of ⅛th is introduced.
According to an embodiment, an audio encoder adapted for encoding frames of a sampled audio signal to obtain encoded frames, wherein a frame includes a number of time domain audio samples, may have: a predictive coding analysis stage for determining information on coefficients of a synthesis filter and a prediction domain frame based on a frame of audio samples; a time-aliasing introducing transformer for transforming overlapping prediction domain frames to the frequency domain to obtain prediction domain frame spectra, wherein the time-aliasing introducing transformer is adapted for transforming the overlapping prediction domain frames in a critically-sampled way; and a redundancy reducing encoder for encoding the prediction domain frame spectra to obtain the encoded frames based on the coefficients and the encoded prediction domain frame spectra.
According to another embodiment, a method for encoding frames of a sampled audio signal to obtain encoded frames, wherein a frame includes a number of time domain audio samples, may have the steps of: determining information on coefficients for a synthesis filter based on a frame of audio samples; determining a prediction domain frame based on the frame of audio samples; transforming overlapping prediction domain frames to the frequency domain to obtain prediction domain frame spectra in a critically-sampled way introducing time aliasing; and encoding the prediction domain frame spectra to obtain the encoded frames based on the coefficients and the encoded prediction domain frame spectra.
Another embodiment may have a computer program having a program code for performing the above method, when the program code runs on a computer or processor.
According to another embodiment, an audio decoder for decoding encoded frames to obtain frames of a sampled audio signal, wherein a frame includes a number of time domain audio samples, may have: a redundancy retrieving decoder for decoding the encoded frames to obtain an information on coefficients for a synthesis filter and prediction domain frame spectra; an inverse time-aliasing introducing transformer for transforming the prediction domain frame spectra to the time domain to obtain overlapping prediction domain frames, wherein the inverse time-aliasing introducing transformer is adapted for determining overlapping prediction domain frames from consecutive prediction domain frame spectra; an overlap/add combiner for combing overlapping prediction domain frames to obtain a prediction domain frame in a critically-sampled way; and a predictive synthesis stage for determining the frames of audio samples based on the coefficients and the prediction domain frame.
According to another embodiment, a method for decoding encoded frames to obtain frames of a sampled audio signal, wherein a frame includes a number of time domain audio samples, may have the steps of: decoding the encoded frames to obtain an information on coefficients for a synthesis filter and prediction domain frame spectra; transforming the prediction domain frame spectra to the time domain to obtain overlapping prediction domain frames from consecutive prediction domain frame spectra; combining overlapping prediction domain frames to obtain a prediction domain frame in a critically sampled way; and determining the frame based on the coefficients and the prediction domain frame.
Another embodiment may have computer program product for performing the above method, when the computer program runs on a computer or processor.
Embodiments of the present invention are based on the finding that a more efficient coding can be carried out, if time-aliasing introducing transforms are used, for example, for TCX encoding. Time aliasing introducing transforms can allow achieving critical sampling while still being able to cross-fade between adjacent frames. For example in one embodiment the modified discrete cosine transform (MDCT=Modified Discrete Cosine Transform) is used for transforming overlapping time domain frames to the frequency domain. Since this particular transform produces only N frequency domain samples for 2N time domain samples, critical sampling can be maintained even though the time domain frames may overlap by 50%. At the decoder or the inverse time-aliasing introducing transform an overlap and add stage may be adapted for combining the time aliased overlapping and back transformed time domain samples in a way, that time domain aliasing cancellation (TDAC=Time Domain Aliasing Cancellation) can be carried out.
Embodiments may be used in the context of a switched frequency domain and time domain coding with low overlap windows, such as for example the AMR-WB+. Embodiments may use an MDCT instead of a non-critically sampled filterbank. In this way the overhead due to non-critical sampling may be advantageously reduced based on the critical sampling property of, for example, the MDCT. Additionally, longer overlaps are possible without introducing additional overhead. Embodiments can provide the advantage that based on the longer overheads, crossover-fading can be carried out more smoothly, in other words, sound quality may be increased at the decoder.
In one detailed embodiment the FFT in the AMR-WB+ TCX-mode may be replaced by an MDCT while keeping functionalities of AMR-WB+, especially the switching between the ACELP mode and the TCX mode based on a closed or open loop decision. Embodiments may use the MDCT in a non-critically sampled fashion for the first TCX frame after an ACELP frame and subsequently use the MDCT in a critically sampled fashion for all subsequent TCX frames. Embodiments may retain the feature of closed loop decision, using the MDCT with low overlap windows similar to the unmodified AMR-WB+, but with longer overlaps. This may provide the advantage of a better frequency response compared to the unmodified TCX windows.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
a-2j show equations for an embodiment of a time domain aliasing introducing transform;
a shows another embodiment of an audio encoder;
b shows another embodiment of an audio encoder;
c shows yet another embodiment of an audio encoder;
d shows yet another embodiment of an audio encoder;
a shows a sample of time domain speech signal for voice speech;
b illustrates a spectrum of a voiced speech signal sample;
a illustrates a time domain signal of a sample of a unvoiced speech;
b shows a spectrum of a sample of an unvoiced speech signal;
a shows an embodiment of an audio decoder;
b shows another embodiment of an audio decoder;
c shows another embodiment of an audio decoder;
a shows a sequence of window functions and an according to table of window parameters;
b shows possible transitions for an MDCT-based embodiment;
a shows a table of possible transitions in an embodiment;
b illustrates a transition window from ACELP to TCX80 according to one embodiment;
c shows an embodiment of a transition window from a TCXx frame to a TCX20 frame to a TCXx frame according to one embodiment;
d illustrates an embodiment of a transition window from ACELP to TCX20 according to one embodiment;
e shows an embodiment of a transition window from ACELP to TCX40 according to one embodiment;
f illustrates an embodiment of the transition window for a transition from a TCXx frame to a TCX80 frame to a TCXx frame according to one embodiment;
a,b illustrates LPC encoding and decoding;
In the following, embodiments of the present invention will be described in detail. It is to be noted, that the following embodiments shall not limit the scope of the invention, they shall be rather taken as possible realizations or implementations among many different embodiments.
The audio encoder 10 further comprises a time-aliasing introducing transformer 14 for transforming overlapping prediction domain frames to the frequency domain to obtain prediction domain frame spectra, wherein the time-aliasing introducing transformer 14 is adapted for transforming the overlapping prediction domain frames in a critically sampled way. The audio encoder 10 further comprises a redundancy reducing encoder 16 for encoding the prediction domain frame spectra to obtain the encoded frames based on the coefficients and the encoded prediction domain frame spectra.
The redundancy reducing encoder 16 may be adapted for using Huffman coding or entropy coding in order to encode the prediction domain frame spectra and/or the information on the coefficients.
In embodiments the time-aliasing introducing transformer 14 can be adapted for transforming overlapping prediction domain frames such that an average number of samples of a prediction domain frame spectrum equals an average number of samples in a prediction domain frame frame, thereby achieving the critically sampled transform. Furthermore, the time-aliasing introducing transformer 14 can be adapted for transforming overlapping prediction domain frames according to a modified discrete cosine transformation (MDCT=Modified Discrete Cosine Transform).
In the following, the MDCT will be explained in further detail with the help of the equations illustrated in
The MDCT was proposed by Princen, Johnson, and Bradley in 1987, following earlier (1986) work by Princen and Bradley to develop the MDCT's underlying principle of time-domain aliasing cancellation (TDAC), further described below. There also exists an analogous transform, the MDST, based on the discrete sine transform, as well as other, rarely used, forms of the MDCT based on different types of DCT or DCT/DST (DST=Discrete Sine Tranform) combinations, which can also be used in embodiments by the time domain aliasing introducing transform 14.
In MP3, the MDCT is not applied to the audio signal directly, but rather to the output of a 32-band polyphase quadrature filter (PQF=Polyphase Quadrature Filter) bank. The output of this MDCT is postprocessed by an alias reduction formula to reduce the typical aliasing of the PQF filter bank. Such a combination of a filter bank with an MDCT is called a hybrid filter bank or a subband MDCT. AAC, on the other hand, normally uses a pure MDCT; only the (rarely used) MPEG-4 AAC-SSR variant (by Sony) uses a four-band PQF bank followed by an MDCT. ATRAC (ATRAC=Adaptive TRansform Audio Coding) uses stacked quadrature mirror filters (QMF) followed by an MDCT.
As a lapped transform, the MDCT is a bit unusual compared to other Fourier-related transforms in that it has half as many outputs as inputs (instead of the same number). In particular, it is a linear function F: R2N→RN, where R denotes the set of real numbers. The 2N real numbers x0, . . . x2N-1 are transformed into the N real numbers X0, . . . , XN-1 according to the formula in
The normalization coefficient in front of this transform, here unity, is an arbitrary convention and differs between treatments. Only the product of the normalizations of the MDCT and the IMDCT, below, is constrained.
The inverse MDCT is known as the IMDCT. Because there are different numbers of inputs and outputs, at first glance it might seem that the MDCT should not be invertible. However, perfect invertibility is achieved by adding the overlapped IMDCTs of subsequent overlapping blocks, causing the errors to cancel and the original data to be retrieved; this technique is known as time-domain aliasing cancellation (TDAC).
The IMDCT transforms N real numbers X0, . . . , XN-1 into 2N real numbers y0, . . . y2N-1 according to the formula in
In the case of a windowed MDCT with the usual window normalization (see below), the normalization coefficient in front of the IMDCT should be multiplied by 2 i.e., becoming 2/N.
Although the direct application of the MDCT formula would necessitate O(N2) operations, it is possible to compute the same thing with only O(N log N) complexity by recursively factorizing the computation, as in the fast Fourier transform (FFT). One can also compute MDCTs via other transforms, typically a DFT (FFT) or a DCT, combined with O(N) pre- and post-processing steps. Also, as described below, any algorithm for the DCT-IV immediately provides a method to compute the MDCT and IMDCT of even size.
In typical signal-compression applications, the transform properties are further improved by using a window function wn (n=0, . . . , 2N−1) that is multiplied with xn and yn in the MDCT and IMDCT formulas, above, in order to avoid discontinuities at the n=0 and 2N boundaries by making the function go smoothly to zero at those points. That is, the data is windowed before the MDCT and after the IMDCT. In principle, x and y could have different window functions, and the window function could also change from one block to the next, especially for the case where data blocks of different sizes are combined, but for simplicity the common case of identical window functions for equal-sized blocks is considered first.
The transform remains invertible, i.e. TDAC works, for a symmetric window wn=w2N-1-n as long as w satisfies the Princen-Bradley condition according to
Various different window functions are common, an example is given in
Note that windows applied to the MDCT are different from windows used for other types of signal analysis, since they have to fulfill the Princen-Bradley condition. One of the reasons for this difference is that MDCT windows are applied twice, for both the MDCT (analysis filter) and the IMDCT (synthesis filter).
As can be seen by inspection of the definitions, for even N the MDCT is essentially equivalent to a DCT-IV, where the input is shifted by N/2 and two N-blocks of data are transformed at once. By examining this equivalence more carefully, important properties like TDAC can be easily derived.
In order to define the precise relationship to the DCT-IV, one has to realize that the DCT-IV corresponds to alternating even/odd boundary conditions, it is even at its left boundary (around n=−½), odd at its right boundary (around n=N−½), and so on (instead of periodic boundaries as for a DFT). This follows from the identities given in
Consider an MDCT with 2N inputs and N outputs, where the inputs can be divided into four blocks (a, b, c, d) each of size N/2. If these are shifted by N/2 (from the +N/2 term in the MDCT definition), then (b, c, d) extend past the end of the N DCT-IV inputs, so they have to be “folded” back according to the boundary conditions described above.
Thus, the MDCT of 2N inputs (a, b, c, d) is exactly equivalent to a DCT-IV of the N inputs: (−cR−d, a−bR), where R denotes reversal as above. In this way, any algorithm to compute the DCT-IV can be trivially applied to the MDCT.
Similarly, the IMDCT formula as mentioned above is precisely ½ of the DCT-IV (which is its own inverse), where the output is shifted by N/2 and extended (via the boundary conditions) to a length 2N. The inverse DCT-IV would simply give back the inputs (−cR−d, a−bR) from above. When this is shifted and extended via the boundary conditions, one obtains the result displayed in
One can now understand how TDAC works. Suppose that one computes the MDCT of the subsequent, 50% overlapped, 2N block (c, d, e, f). The IMDCT will then yield, analogous to the above: (c−dR, d−cR, e+fR, eR+f)/2. When this is added with the previous IMDCT result in the overlapping half, the reversed terms cancel and one obtains simply (c, d), recovering the original data.
The origin of the term “time-domain aliasing cancellation” is now clear. The use of input data that extend beyond the boundaries of the logical DCT-IV causes the data to be aliased in exactly the same way that frequencies beyond the Nyquist frequency are aliased to lower frequencies, except that this aliasing occurs in the time domain instead of the frequency domain. Hence the combinations c−dR and so on, which have precisely the right signs for the combinations to cancel when they are added.
For odd N (which are rarely used in practice), N/2 is not an integer so the MDCT is not simply a shift permutation of a DCT-IV. In this case, the additional shift by half a sample means that the MDCT/IMDCT becomes equivalent to the DCT-III/II, and the analysis is analogous to the above.
Above, the TDAC property was proved for the ordinary MDCT, showing that adding IMDCTs of subsequent blocks in their overlapping half recovers the original data. The derivation of this inverse property for the windowed MDCT is only slightly more complicated.
Recall from above that when (a,b,c,d) and (c,d,e,f) are MDCTed, IMDCTed, and added in their overlapping half, we obtain (c+dR,cR+d)/2+(c−dR,d−cR)/2=(c,d), the original data.
Now, multiplying both the MDCT inputs and the IMDCT outputs by a window function of length 2N is supposed. As above, we assume a symmetric window function, which is therefore of the form (w, z, zR,wR), where w and z are length-N/2 vectors and R denotes reversal as before. Then the Princen-Bradley condition can be written
w
2
+z
R
2=(1,1, . . . )
with the multiplications and additions performed elementwise, or equivalently
w
R
2
+z
2=(1,1, . . . )
reversing w and z.
Therefore, instead of MDCTing (a,b,c,d), MDCT (wa,zb,zRc,wRd) is MDCTed with all multiplications performed elementwise. When this is IMDCTed and multiplied again (elementwise) by the window function, the last-N half results as displayed in
Note that the multiplication by ½ is no longer present, because the IMDCT normalization differs by a factor of 2 in the windowed case. Similarly, the windowed MDCT and IMDCT of (c,d,e,f) yields, in its first-N half according to
a depicts another embodiment of the audio coder 10. In the embodiment depicted in
Another embodiment of an audio encoder 10 is depicted in
c shows another embodiment of an audio encoder 10. In the embodiment depicted in
In other words, the decider 15 may be adapted for reverting the encoding procedure, i.e. carrying out full decoding for both branches. Having fully decoded frames the decider 15 may be adapted for comparing the decoded samples to the original samples, which is indicated by the dotted arrow in
d shows another embodiment of the audio encoder 10. In the embodiment shown in
The configuration of the embodiment shown in
In the following, the decision process of the decider 15 is illuminated. Generally, a differentiation between an impulse-like portion of an audio signal and a stationary portion of a stationary signal can be made by applying a signal processing operation, in which the impulse-like characteristic is measured and the stationary-like characteristic is measured as well. Such measurements can, for example, be done by analyzing the waveform of the audio signal. To this end, any transform-based processing or LPC processing or any other processing can be performed. An intuitive way for determining as to whether the portion is impulse-like or not is for example to look at a time domain waveform and to determine whether this time domain waveform has peaks at regular or irregular intervals, and peaks in regular intervals are even more suited for a speech-like coder, i.e. for the codebook encoder. Note, that even within speech voiced and unvoiced parts can be distinguished. The codebook encoder 13 may be more efficient for voiced signal parts or voiced frames, wherein the transform-based branch comprising the time-aliasing introducing transformer 14 and the redundancy reducing encoder 16 may be more suitable for unvoiced frames. Generally, the transform based coding may also be more suitable for stationary signals other than voice signals.
Exemplarily, reference is made to
Speech can generally be classified as voiced, unvoiced or mixed. Time-and-frequency domain plots for sampled voiced and unvoiced segments are shown in
The spectral envelope is characterized by a set of peaks, which are called formants. The formants are the resonant modes of the vocal tract. For the average vocal tract there are 3 to 5 formants below 5 kHz. The amplitudes and locations of the first three formants, usually occurring below 3 kHz are quite important, both, in speech synthesis and perception. Higher formants are also important for wideband and unvoiced speech representations. The properties of speech are related to physical speech production systems as follows. Exciting the vocal tract with quasi-periodic glottal air pulses generated by the vibrating vocal cords produces voiced speech. The frequency of the periodic pulse is referred to as the fundamental frequency or pitch. Forcing air through a constriction in the vocal tract produces unvoiced speech. Nasal sounds are due to the acoustic coupling of the nasal tract to the vocal tract, and plosive sounds are reduced by abruptly reducing the air pressure, which was built up behind the closure in the tract.
Thus, a stationary portion of the audio signal can be a stationary portion in the time domain as illustrated in
Furthermore, impulse-like portions and stationary portions can occur in a timely manner, i.e., which means that a portion of the audio signal in time is stationary and another portion of the audio signal in time is impulse-like. Alternatively or additionally, the characteristics of a signal can be different in different frequency bands. Thus, the determination, whether the audio signal is stationary or impulse-like, can also be performed frequency-selective so that a certain frequency band or several certain frequency bands are considered to be stationary and other frequency bands are considered to be impulse-like. In this case, a certain time portion of the audio signal might include an impulse-like portion or a stationary portion.
Coming back to the embodiment shown in
Subsequently, an analysis-by-synthesis CELP encoder will be discussed with respect to
Generally, the short-term prediction A(z) is calculated by an LPC analysis stage which will be further discussed below. Depending on this information, the long-term prediction AL(z) includes the long-term prediction gain b and delay T (also known as pitch gain and pitch delay). The CELP algorithm encodes the excitation or prediction domain frames using a codebook of for example Gaussian sequences. The ACELP algorithm, where the “A” stands for “algebraic” has a specific algebraically designed codebook.
The codebook may contain more or less vectors where each vector has a length according to a number of samples. A gain factor g scales the excitation vector and the excitation samples are filtered by the long-term synthesis filter and a short-term synthesis filter. The “optimum” vector is selected such that the perceptually weighted mean square error is minimized. The search process in CELP is evident from the analysis-by-synthesis scheme illustrated in
In CELP, the long-term predictor is often implemented as an adaptive codebook containing the previous excitation signal. The long-term prediction delay and gain are represented by an adaptive codebook index and gain, which are also selected by minimizing the mean square weighted error. In this case the excitation signal consists of the addition of two gain-scaled vectors, one from an adaptive codebook and one from a fixed codebook. The perceptual weighting filter in AMR-WB+ is based on the LPC filter, thus the perceptually weighted signal is a form of an LPC domain signal. In the transform domain coder used in AMR-WB+, the transform is applied to the weighted signal. At the decoder, the excitation signal is obtained by filtering the decoded weighted signal through a filter consisting of the inverse of synthesis and weighting filters.
A reconstructed TCX target x(n) may be filtered through a zero-state inverse weighted synthesis filter
to find the excitation signal which can be applied to the synthesis filter. Note that the interpolated LP filter per subframe or frame is used in the filtering. Once the excitation is determined, the signal can be reconstructed by filtering the excitation through synthesis filter 1/Â(z) and then de-emphasizing by for example filtering through the filter 1/(1-0.68z−1). Note that the excitation may also be used to update the ACELP adaptive codebook and allows to switch from TCX to ACELP in a subsequent frame. Note also that the length of the TCX synthesis can be given by the TCX frame length (without the overlap): 256, 512 or 1024 samples for the mod [ ] of 1, 2 or 3 respectively.
The functionality of an embodiment of the predictive coding analysis stage 12 will be discussed subsequently according to the embodiment shown in
An embodiment of an audio decoder 80 for decoding encoded frames to obtain frames of a sampled audio signal, wherein a frame comprises a number of time domain samples, is shown in
The overlap and add combiner 86 can be adapted for combining overlapping prediction domain frames such that an average number of samples in an prediction domain frame equals an average number of samples of the prediction domain frame spectrum. In embodiments the inverse time-aliasing introducing transformer 84 can be adapted for transforming the prediction domain frame spectra to the time domain according to an IMDCT, according to the above details.
Generally in block 86, after “overlap/add combiner” there may in embodiments optionally be an “excitation recovery”, which is indicated in brackets in
Moreover, in embodiments, the predictive synthesis stage 88 can be adapted for determining the frame based on linear prediction, i.e. LPC. Another embodiment of an audio decoder 80 is depicted in
c shows another embodiment of an audio decoder 80 having similar components as in the embodiment depicted in
In embodiments of audio encoders 10 and audio decoders 80, the respective windowing filters 17 and 84 can be adapted for applying windowing functions according to window sequence information.
In other words, the MDCT-based TCX may request from the arithmetic decoder a number of quantized spectral coefficients, lg, which is determined by the mod [ ] and last_lpd_mode values of the last mode. These two values may also define the window length and shape which will be applied in the inverse MDCT. The window may be composed of three parts, a left side overlap of L samples, a middle part of ones of M samples and a right overlap part of R samples. To obtain an MDCT window of length 2*lg, ZL zeros can be added on the left and ZR zeros on the right side.
The following table shall illustrate the number of spectral coefficients as a function of last_lpd_mode and mod [ ] for some embodiments:
The MDCT window is given by
Embodiments may provide the advantage, that a systematic coding delay of the MDCT, IDMCT respectively, may be lowered when compared to the original MDCT, through application of different window functions. In order to provide more details on this advantage,
In other words, the decoder already knows that there is a third zero part and therefore decoding can be started earlier, encoding respectively. Therefore, the systematic delay can be reduced by 2Δt as is shown at the bottom of
In the following an embodiment will be considered, where the MDCT is used in the AMR-WB+ codec, replacing the FFT. Therefore, the windows will be detailed, according to
T=L+M+R
as compared to the transform coder length for MDCT
T=L/2+M+R/2.
a illustrates at the top a view graph of an example sequence of window functions for AMR-WB+. From the left to the right the view graph at the top of
At the bottom of
In the table shown at the bottom of
From the table at the bottom of
b illustrates the transition from ACELP to a TCX80 window in more detail. The view chart in
c illustrates the transition from any TCX frame, denoted by “TCXx”, to a TCX20 frame and back to any TCXx frame.
In summary, the
When transiting from TCX to ACELP or from ACELP to TCX80 multiple options are possible. In one embodiment the window sampled from the MDCT TCX frame may be discarded in the overlapping region. In another embodiment the windowed samples may be used for a cross-fade and for canceling a time domain aliasing in the MDCT TCX samples based on the aliased ACELP samples in the overlapping region. In yet another embodiment, cross-over fading may be carried out without canceling the time domain aliasing. In the ACELP to TCX transition the zero-input response (ZIR=zero-input response) can be removed at the encoder for windowing and added at the decoder for recovering. In the figures this is indicated by dotted lines within the TCX windows following an ACELP window. In the present embodiment when transiting from TCX to TCX, the windowed samples can be used for cross-fade.
When transiting from ACELP to TCX80, the frame length is longer and may be overlapped with the ACELP frame, the time domain aliasing cancellation or discard method may be used.
When transiting from ACELP to TCX80 the previous ACELP frame may introduce a ringing. The ringing may be recognized as a spreading of error coming from the previous frame due to the usage of LPC filtering. The ZIR method used for TCX40 and TCX20 may account for the ringing. A variant for the TCX80 in embodiments is to use the ZIR method with a transform length of 1088, i.e. without overlap with the ACELP frame. In another embodiment the same transform length of 1152 may be kept and zeroing of the overlap area just before the ZIR may be utilized, as shown in
Summarizing, embodiments of the present invention provide the advantage that critical sampling can be carried out for all TCX frames, when a TCX frame precedes. As compared to the conventional approach an overhead reduction of ⅛th can be achieved. Moreover, embodiments provide the advantage that the transitional or overlapping area between consecutive frames may be 128 samples, i.e. longer than for the conventional AMR-WB+. The improved overlap areas also provide an improved frequency response and a smoother cross-fade. Therewith a better signal quality can be achieved with the overall encoding and decoding process. Depending on certain implementation requirements of the inventive methods, the inventive methods can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, in particular, a disc, a DVD, a flash memory or a CD having electronically readable control signals stored thereon, which cooperate with a programmable computer system such that the inventive methods are performed. Generally, the present invention is therefore a computer program product with a program code stored on a machine-readable carrier, the program code being operated for performing the inventive methods when the computer program product runs on a computer. In other words, the inventive methods are, therefore, a computer program having a program code for performing at least one of the inventive methods when the computer program runs on a computer.
While this invention has been described in terms of several advantageous embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
08017661.3 | Oct 2008 | EP | regional |
This application is a continuation of copending International Application No. PCT/EP2009/004015, filed Jun. 4, 2009, which is incorporated herein by reference in its entirety, and claims priority to U.S. Patent Application No. 61/079,862 filed Jul. 11, 2008 and U.S. Patent Application No. 61/103,825 filed Oct. 8, 2008, and additionally claims priority from European Application No. 08017661.3, filed Oct. 8, 2008, which are all incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
61079862 | Jul 2008 | US | |
61103825 | Oct 2008 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2009/004015 | Jun 2009 | US |
Child | 13004475 | US |