The invention relates to a method and to an apparatus for encoding and decoding excitation patterns from which the masking levels for an audio signal transform codec are determined.
For the quantisation of spectral data in an audio transform encoder psycho-acoustic information is required, i.e. an approximation of the true masking threshold. In a corresponding audio transform decoder the same approximation is used for reconstructing the quantised data. At encoder side, overlapping sections of the source signal are windowed using window functions. At decoder side, overlap+add is carried out for the decoded signal windows.
In order to limit the amount of side information data to be transmitted, known transform codecs like mp3 and AAC are using as masking information scale factors for critical bands (also denoted ‘scale factor bands’), which means that for a group of neighbouring frequency bins or coefficients the same scale factor is used prior to the quantisation process. Cf. K. Brandenburg, M. Bosi: “ISO/IEC MPEG-2 Advanced Audio Coding: Overview and Applications”, 103rd AES Convention, 26-29 Sep. 1997, New York, preprint No. 4641.
However, the scale factors are representing only a coarse (step-wise) approximation of the masking threshold. The accuracy of such representation of the masking threshold is very limited because groups of (slightly) different-amplitude frequency bins will get the same scale factor, and therefore the applied masking threshold is not optimum for a significant number of frequency bins.
For improving the encoding/decoding quality, the masking level can be computed as shown in:
An audio codec applying such excitation patterns for masking purposes is described in O. Niemeyer, B. Edler: “Efficient Coding of Excitation Patterns Combined with a Transform Audio Coder”, 118th AES Convention, 28-31 May 2005, Barcelona, Paper 6466. For each spectral audio data block to be encoded an excitation pattern is computed, wherein the excitation patterns represent the (true) frequency-dependent psycho-acoustic properties of the human ear.
For avoiding a significant increase of the resulting data rate in comparison with scale factor based masking, in each case 16 successive excitation patterns are combined in order to efficiently encode these excitation patterns. The excitation pattern matrix values are SPECK (Set Partitioning Embedded bloCK) encoded as described for image coding applications in W. A. Pearlman, A. Islam, N. Nagaraj, A. Said: “Efficient, Low-Complexity Image Coding With a Set-Partitioning Embedded Block Coder”, IEEE Transactions on Circuits and Systems for Video Technology, November 2004, vol. 14, no. 11, pp. 1219-1235.
The actual excitation pattern coding is performed following building with the excitation pattern values a 2-dimensional matrix over frequency and time, and a 2-dimensional DCT transform of the logarithmic-scale matrix values. The resulting transform coefficients are quantised and entropy encoded in bit planes, starting with the most significant one, whereby the SPECK-coded locations and the signs of the coefficients are transferred to the audio decoder as bit stream side information.
At encoder and at decoder side, the encoded excitation patterns are correspondingly decoded for calculating the masking thresholds to be applied in the audio signal encoding and decoding, so that the calculated masking thresholds are identical in both the encoder and the decoder. The audio signal quantisation is controlled by the resulting improved masking threshold.
Different window/transform lengths are used for the audio signal coding, and a fixed length is used for the excitation patterns.
A disadvantage of such excitation pattern audio encoding processing is the processing delay caused by coding together the excitation patterns for a number of blocks in the encoder, but a more accurate representation of the masking threshold for the coding of the spectral data can be achieved and thereby an increased encoding/decoding quality, while the combined excitation pattern coding of multiple blocks causes only a small increase of side information data.
In the above-mentioned Niemeyer/Edler processing, the masking thresholds derived from the excitation patterns are independent from the window and transform length selected in the audio signal coding. Instead, the excitation patterns are derived from fixed-length sections of the audio signal. However, a short window and transform length represents a higher time resolution and for optimum coding/decoding quality the level of the related masking threshold should be adapted correspondingly.
A problem to be solved by the invention is to further increase the quality of the audio signal encoding/decoding by improving the masking threshold calculation, without causing an increase of the side information data rate.
According to the invention, for each spectrum to be quantised in the coding of the audio signal, an excitation pattern is computed and coded, i.e. for every shorter window/transform its own excitation pattern is calculated and thereby the time resolution of the excitation patterns is variable. The excitation patterns for long windows/trans-forms and for shorter windows/transforms are grouped together in corresponding matrices or blocks. The amount of excitation pattern data is the same for both long and shorter window/transform lengths, i.e. for non-transient and for transient source signal sections. The excitation pattern matrix can therefore have a different number of rows in each frame.
Regarding the excitation pattern coding, following an optional logarithmic calculus of the matrix values, a pre-determined scan or sorting order is applied to the two-dimensionally transformed excitation pattern data matrix values, and by that re-ordering a quadratic matrix can be formed to which matrix' bit planes the SPECK encoding is applied directly. A fixed number of values only of the scan path are coded.
In principle, the inventive encoding method is suited for encoding excitation patterns from which the masking levels for an audio signal encoding are determined following a corresponding excitation pattern decoding, wherein for said audio signal encoding said audio signal is processed successively using different window and spectral transform lengths and a section of the audio signal representing a given multiple of the longest transform length is denoted a frame, and wherein said excitation patterns are related to a spectral representation of successive sections of said audio signal, said method including the steps:
SPECK encoding bit planes of the matrix PTq are processed and a successive partitioning is used for locating and coding the positions of the corresponding coefficient bits in said bit planes.
In principle the inventive encoding apparatus is an audio signal encoder in which excitation patterns are encoded from which following a corresponding excitation pattern decoding the masking levels for an encoding of said audio signal are determined, wherein for encoding said audio signal it is processed successively using different window and spectral transform lengths and a section of the audio signal representing a given multiple of the longest transform length is denoted a frame, and wherein said excitation patterns are related to a spectral representation of successive sections of said audio signal, said apparatus including:
In principle, the inventive decoding method is suited for decoding excitation patterns that were encoded according to the above encoding method, from which excitation patterns the masking levels for an encoded audio signal decoding are determined, wherein for said audio signal decoding said audio signal is processed successively using different window and spectral inverse transform lengths and a section of the audio signal representing a given multiple of the longest transform length is denoted a frame, and wherein said excitation patterns are related to a spectral representation of successive sections of said audio signal, said method including the steps:
In principle the inventive decoding apparatus is an audio signal decoder in which excitation patterns encoded according to the above encoding method are decoded and used for determining the masking levels for the decoding of the encoded audio signal, wherein for decoding said audio signal it is processed successively using different window and spectral inverse transform lengths and a section of the audio signal representing a given multiple of the longest transform length is denoted a frame, and wherein said excitation patterns are related to a spectral representation of successive sections of said audio signal, said apparatus including:
Advantageous additional embodiments of the invention are disclosed in the respective dependent claims.
Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in:
In the block diagram for the inventive audio transform encoder in
As mentioned above, the power spectrum is required for the computation of the excitation patterns in section 14. For getting the power spectrum, the current windowed signal block is also transformed in step/stage 12 using an MDST (modified discrete sine transform). Both frequency representations, of types MLT and MDST, are fed to a buffer 13 that stores up to L blocks, wherein L is e.g. ‘8’ or ‘16’. The current window type code is also fed to buffer 13, via a delay 111 corresponding to one block transform period. The output of each transform contains K frequency bins for one signal block. In case a transient is detected in step/stage 11, the time domain input signal is windowed by an integer number of LS short windows (i.e. blocks) instead of a single long window of length N=2K, wherein LS is e.g. ‘3’ or ‘8’ and wherein the total number of frequency bins for all short windows of one long signal block is K.
A number of L signal blocks form a data group, denoted ‘frame’. The excitation pattern coding is applied to the excitation patterns of a frame in step/stage 141. For each spectrum to be quantised later on, one excitation pattern is computed. This feature is different to the audio coding described in the Brandenburg and the Niemeyer/Edler publications mentioned above and to the corresponding feature in the following standards, where a fixed time resolution of the excitation patterns is used:
The amount of excitation pattern data is the same for both long and short transform lengths. As a consequence, for a signal block containing short windows more excitation pattern data have to be encoded than for a signal block containing a long window.
The excitation patterns to be encoded are preferably arranged within a matrix P that has a non-quadratic shape. Each row of the matrix contains one excitation pattern corresponding to one spectrum to be quantised. Thus, the row and column indices correspond to the time and frequency axes, respectively. The number of rows in matrix P is at least L, but in contrast to the processing described in the Niemeyer/Edler publication, the matrix P can have a different number of rows in each frame because that number will depend on the number of short windows in the corresponding frame.
As an alternative, rows and columns of matrix P can be exchanged.
For applying a 2-dimensional transform (e.g. by using two cascaded 1-dimensional DCTs), the last row (or even more rows) of the matrix can be duplicated in order to get a number of rows (e.g. an even number) that the transform can handle.
Table 1 shows an example for a frame with one block using short windows, which would result in 11 rows. Because the 2-dimensional transform can handle input sizes that are a multiple of ‘4’, the last row is duplicated:
Similar to section 3.2 in the Niemeyer/Edler publication mentioned above, the actual coding of the excitation pattern matrix P is performed as follows (see also
When compared to the Niemeyer/Edler publication, the excitation pattern encoding processing differs in the steps c), d) and e) listed above. Step c) is performed additionally in the inventive processing. Regarding step d), a re-ordering of the matrix PT coefficients is carried out, which re-ordering is different for different matrix sizes.
Regarding step e), the re-ordering or scanning has two advantages over the Niemeyer/Edler processing:
In step d), a sorting or scanning order for matrix PT for each possible matrix P size has to be provided, e.g. by determining a sorting index under which a corresponding scanning path is stored in a memory of the audio encoder and in a memory of the audio decoder.
In a training phase carried out once for all types of audio signals, statistics for all matrix elements are collected. For that purpose, for example for multiple test matrices for different types of audio signals, the squared values for each matrix entry are calculated and are averaged over the test matrices for each value position within the matrix. Then, the order of amplitudes represents the order of sorting. This kind of processing is carried out for all possible matrix sizes, and a corresponding sorting index is assigned to the sorting order for each matrix size. These sorting indices are used for (automatically) selecting a scan or sorting order in the excitation pattern matrix encoding and decoding process.
As stated in above step e), the number of values to be encoded is further reduced. From the statistics (determined in the training phase) a fixed number of values to be coded is evaluated: following sorting, only the number of values is used that add up to a given threshold of the total energy, for example 0.999.
In the audio signal encoder, the excitation data matrix code EPM can include the sorting index information. As an alternative which saves overall data rate, at decoder side the matrix size and thereby the sorting index is automatically determined from the number of short windows (signalled by the window type code WT) per frame. The excitation patterns encoded in step/stage 141 are decoded as described below in an excitation pattern decoder step or stage 142. From the decoded excitation patterns for the L blocks the corresponding masking thresholds are calculated in a masking threshold calculator step/stage 143, the output of which is intermediately stored in a buffer 144 that supplies the quantisation and entropy coding stage/step 15 with the current masking threshold for each transform coefficient received from step/stage 12 and buffer 13. The quantisation and entropy coding stage/step 15 supplies bitstream multiplexer 16 with the coded frequency bins CFB.
In the inventive decoder shown in
The excitation pattern data matrix code EPM is decoded in an excitation pattern decoder 242, whereby a correspondingly inverse SPECK processing provides a copy of matrix PTq, a correspondingly inverse scanning provides a copy of transformed-matrix PT, and a correspondingly inverse transform provides reconstructed matrix P for a current block. The excitation patterns of reconstructed matrix P are used in a masking threshold calculation step/stage 243 for reconstructing the masking thresholds for the current block, which are intermediately stored in a buffer 244 and are supplied to stage/step 25.
The following steps are performed in excitation pattern decoder 242 for reconstructing the excitation patterns (see also
When processing stereo input signals or, more generally, multi-channel signals the correlation between the channels can be exploited in the excitation pattern coding. For example, a synchronised transient detection can be used where all channel signals are processed with the same window type. I.e., for each channel nch an excitation pattern matrix P(nch) of the same size is obtained. The individual matrices can be coded in different multi-channel coding modes k (where in the stereo case L and R denote the data corresponding to the left and right channel):
In the encoder, all three coding modes k can be carried out and the excitation patterns are decoded from the candidate or temporary bit streams resulting in matrices P′ (nch, k). For each multi-channel coding mode k, the distortion d(k) of the applied coding is computed:
From these temporary bit streams the required data amounts s(k) are evaluated in the encoder. Preferably, the coding mode actually used is the one where the minimum of the product d(k)*s(k) is achieved. The corresponding bit stream data of this coding mode are transmitted to the decoder. As further side information, the multi-channel coding mode index k is also transmitted to the decoder.
Number | Date | Country | Kind |
---|---|---|---|
10305295.7 | Mar 2010 | EP | regional |