The present invention is related to audio processing and, particularly, to audio processing operating in a spectral domain using scale parameters for spectral bands. In examples, the invention refers to an apparatus and method for harmonicity-dependent tilt control of scale parameters in an audio encoder. E.g. the invention refers to an encoder or a method for encoding.
In one of the most widely used state-of-the-art perceptual audio codec, Advanced Audio Coding (AAC) [1-2], spectral noise shaping is performed with the help of so-called scale factors.
In this approach, the MDCT spectrum is partitioned into a number of non-uniform scale factor bands. For example at 48 kHz, the MDCT has 1024 coefficients and it is partitioned into 49 scale factor bands. In each band, a scale factor is used to scale the MDCT coefficients of that band. A scalar quantizer with constant step size is then employed to quantize the scaled MDCT coefficients. At the decoder-side, inverse scaling is performed in each band, shaping the quantization noise introduced by the scalar quantizer.
The 49 scale factors are encoded into the bitstream as side-information. It usually requires a significantly high amount of bits for encoding the scale factors, due to the relatively high number of scale factors and the required high precision. This can become a problem at low bitrate and/or at low delay.
In MDCT-based TCX, a transform-based audio codec used in the MPEG-D USAC [3] and 3GPP EVS [4] standards, spectral noise shaping is performed with the help of a LPC-based perceptual filer, the same perceptual filter as used in recent ACELP-based speech codecs (e.g. AMR-WB).
In this approach, a set of 16 LPCs is first estimated on a pre-emphasized input signal. The LPCs are then weighted and quantized. The frequency response of the weighted and quantized LPCs is then computed in 64 uniformly spaced bands. The MDCT coefficients are then scaled in each band using the computed frequency response. The scaled MDCT coefficients are then quantized using a scalar quantizer with a step size controlled by a global gain. At the decoder, inverse scaling is performed in every 64 bands, shaping the quantization noise introduced by the scalar quantizer.
This approach has a clear advantage over the AAC approach: it requires the encoding of only 16 (LPC)+1 (global-gain) parameters as side-information (as opposed to the 49 parameters in AAC). Moreover, 16 LPCs can be efficiently encoded with a small number of bits by employing a LSF representation and a vector quantizer. Consequently, the approach of prior art 2 requires less side-information bits as the approach of prior art 1, which can makes a significant difference at low bitrate and/or low delay.
However, this approach has also some drawbacks. The first drawback is that the frequency scale of the noise shaping is restricted to be linear (i.e. using uniformly spaced bands) because the LPCs are estimated in the time-domain. This is disadvantageous because the human ear is more sensible in low frequencies than in the high frequencies. The second drawback is the high complexity required by this approach. The LPC estimation (autocorrelation, Levinson-Durbin), LPC quantization (LPC<->LSF conversion, vector quantization) and LPC frequency response computation are all costly operations. The third drawback is that this approach is not very flexible because the LPC-based perceptual filter cannot be easily modified and this prevents some specific tunings that would be required for critical audio items.
Some recent work has addressed the first drawback and partly the second drawback of prior art 2. It was published in U.S. Pat. No. 9,595,262 B2, EP2676266 B1. In this new approach, the autocorrelation (for estimating the LPCs) is no more performed in the time-domain but it is instead computed in the MDCT domain using an inverse transform of the MDCT coefficient energies. This allows using a non-uniform frequency scale by simply grouping the MDCT coefficients into 64 non-uniform bands and computing the energy of each band. It also reduces the complexity required to compute the autocorrelation.
WO2019091573 discloses an encoding apparatus in which scale parameters, used for spectral noise shaping, require small side information and has small computation complexity. See also the standards [5] and [6]. The scale factors are obtained from the band energies, where the band energies are obtained from the MDCT coefficients and normalized using band widths. In one aspect of WO2019091573, the pre-emphasis with fixed tilt is applied on the band energies to reduce quantization noise in low-frequencies. Different tilt is optimal for different types of signals and the fixed tilt of the pre-emphasis is not well suited for all types of signals, requiring a compromise in choosing it and this is not addressed in WO2019091573, [5], [6]. E.g. optimal pre-emphasis tilt is different for harmonic and for transient like signals. Further the MDCT spectra has temporal variation for constant amplitude signals which leads to temporal variation of SNS scale factors even for constant amplitude signals and this is also not addressed in WO2019091573, [5], [6]. This variation leads to temporally varying quantization of constant amplitude signals and thus to quality degradation. The normalization of the band energies with the band widths, as used in WO2019091573,[5],[6], can be avoided, by adapting the pre-emphasis tilt and thus achieving computational complexity reduction.
WO2019091573 discloses an encoding apparatus in which downsampling of scale parameters is carried out. See also the standards [5] and [6]. However, there is no control of pre-emphasis tilt based on harmonicity.
According to an aspect, there is provided an Apparatus for encoding an audio signal, comprising:
The apparatus may be configured to obtain the harmonicity measure of the audio signal as an autocorrelation of the audio signal.
The apparatus may be configured to obtain the harmonicity measure of the audio signal using a normalized autocorrelation measurement of the audio signal.
The apparatus may be such that the harmonicity measure of the audio signal is a value between 0 and a value different from zero, so that a lower harmonicity is closer to 0 than a higher harmonicity.
The scale parameter calculator may be configured so that a comparatively higher value of the harmonicity measure of the audio signal causes a higher tilt value, or a higher pre-emphasis factor, than a comparatively lower value of the harmonicity measure.
The apparatus may be such that the scale parameter calculator is configured to process the amplitude-related measures to verify
is the tilt value, which is an exponent applied to d, where h is fixed, g′≥0 is, or is derived from, the harmonicity measure, gtilt is pre-defined, b is an index indicating the band out of nb+1 bands in such a way that a higher frequency band has a higher index than a lower frequency band, Es(b) is the set of amplitude-related measures and Ep(b) is a pre-emphasized energy per band.
The apparatus may be such that the scale parameter calculator is configured to process the amplitude-related measures by applying the tilt value to be proportional with at least the harmonicity measure, or the pre-emphasis factor to be obtained by raising a constant number with an exponent proportional with at least the harmonicity measure.
The apparatus may be such that the scale parameter calculator is configured to process the amplitude-related measures by applying the tilt value to be proportional with at least an index which increases with higher bands, or the pre-emphasis factor obtained by raising a constant number with an exponent proportional with at least an index which increases with higher bands.
The apparatus may be such that the scale parameter calculator is configured to process the amplitude-related measures by applying the tilt value, or the pre-emphasis factor, to be dependent on the bandwidth of the spectral representation.
The apparatus may be such that the scale parameter calculator is configured to process the amplitude-related measures by applying the tilt value to be linear with the harmonicity measure, or pre-emphasis factor to be obtained by raising a constant number with an exponent which is linear with the harmonicity measure.
The apparatus may be such that the scale parameter calculator is configured to obtain the amplitude-related measures from a squared magnitude spectrum of the audio signal.
The apparatus may be such that each amplitude-related measure is obtained as an integral, or a sum, of squared magnitude values of the spectrum of the audio signal.
The apparatus may be such that the integral, or sum, of the squared magnitude values of the spectrum of the audio signal is not normalized using the width of each band.
The apparatus may be such that the integral, or sum, of the squared magnitude values of the spectrum of the audio signal is obtained, for each index, by a sum of a squared modified discrete cosine transform, MDCT, coefficient and a squared modified discrete cosine transform, MDST, coefficient.
The apparatus may be such that the converter is configured to perform an MDCT transformation and an MDST transformation, to provide MDCT coefficients and MDST coefficients, wherein the amplitude-related measure for each band is obtained as sum of magnitudes of the MDCT coefficients, or a squared version(s) thereof, and magnitudes of the MDST coefficients, or a squared version(s) thereof.
The apparatus may be such that the scale parameter calculator is configured to control the tilt value or the pre-emphasis factor based on a long term predictor, LTP, parameter or on a long term post-filter LTPF parameter as the harmonicity measure of the audio signal.
The apparatus may be such that the scale parameter calculator is configured to control the tilt value or the pre-emphasis factor based on a spectral flatness measure as the harmonicity measure of the audio signal.
The apparatus may be configured to quantize the harmonicity measure of the audio signal, so as to control the tilt value or the pre-emphasis factor based on quantized version of the harmonicity measure of the audio signal.
The apparatus may further comprise a downsampler to downsample the set of scale parameters, which is a first set of scale parameters, to obtain a second set of scale parameters, which is, or is comprised in, the modified version of the first set of scale parameters, the second set of scale parameters having a second number of scale parameters which is lower than a first number of scale parameters of the first set of scale parameters, wherein the scale parameter encoder is configured to generate the encoded representation of the second set of scale parameters as the encoded representation of the modified version of the scale parameters.
The apparatus may further comprise a downsampler to downsample the set of scale parameters, which is a first set of scale parameters, to obtain a second set of scale parameters, the second set of scale parameters having a second number of scale parameters which is lower than a first number of scale parameters of the first set of scale parameters,
The apparatus may be such that the spectral processor is configured to determine this third set of scale parameters so that the third number is equal to the first number.
The apparatus may be such that the scale parameter calculator is configured to
The apparatus may be such that the spectral processor is configured to use the first set of scale parameters in the linear domain for processing the spectral representation or to interpolate the second set of scale parameters in the log-like domain to obtain interpolated log-like domain scale factors and to transform the log-like domain scale factors into a linear domain to obtain the third set of scale parameters.
The apparatus may be such that the scale parameter calculator is configured to calculate the first set of scale parameters for non-uniform bands, and
The apparatus may be such that the first group of frequency adjacent scale parameters of the first set and the second group of frequency adjacent scale parameters of the first set have at least one scale parameter of the first set in common, so that the first group and the second group overlap with each other.
The apparatus may be such that the downsampler is configured to use an average operation among a group of first scale parameters, the group having two or more members.
The apparatus may be such that the average operation is a weighted average operation configured to weight a scale parameter in a middle of the group stronger than a scale parameter at an edge of the group.
The apparatus may be such that the downsampler is configured to perform a mean value removal so that the second set of scale parameters is mean free.
The apparatus may be such that the downsampler is configured to perform a scaling operation using a scaling factor lower than 1.0 and greater than 0.0 in a log-like domain.
The apparatus may be configured to provide a second set of quantized scale factors associated with the encoded representation, and
The apparatus may be configured to quantize and encode the second set using a vector quantizer, wherein the encoded representation comprises one or more indices for one or more vector quantizer codebooks.
The apparatus may be such that the spectral processor is configured to determine an interpolated scale factor based on a quantized scale factor and a difference between the quantized scale factor and a next quantized scale factor in an ascending sequence of quantized scale factors with respect to frequency.
The apparatus may be such that the spectral processor is configured to determine, from the quantized scale factor and the difference, at least two interpolated scale factors, wherein for each of the two interpolated scale factors, a different weighting factor is used.
The apparatus may be such that the weighting factors increase with increasing frequencies associated with the interpolated scale factors.
The apparatus may be such that the spectral processor is configured to perform an interpolation operation in a log-like domain, and
The apparatus may be such that the scale parameter calculator is configured to calculate an amplitude-related measure for each band to obtain a set of amplitude-related measures, and
The apparatus may be such that the scale parameter calculator is configured to calculate an amplitude-related measure for each band to obtain a set of amplitude-related measures, and
The apparatus may be such that the scale parameter calculator is configured to calculate an amplitude-related measure for each band to obtain the set of amplitude-related measures, and to perform a noise-floor addition operation, wherein a noise floor is calculated from an amplitude-related measure derived as a mean value from two or more frequency bands of the spectral representation.
The apparatus may be such that the scale factor calculator is configured to perform at least one of a group of operations, the group of operations comprising performing a smoothing operation, performing a pre-emphasis operation, performing a noise-floor addition operation, and performing a log-like domain conversion operation to obtain the first set of scale parameters.
The apparatus may be such that the spectral processor is configured to scale spectral values in the spectral representation using the set of scale factors, or modified version thereof, to obtain a scaled spectral representation and to apply a temporal noise shaping (TNS) operation onto the scaled spectral representation, and
The apparatus may be such that the converter uses an analysis window to generate a sequence of blocks of windowed audio samples, and a time-spectrum converter for converting the blocks of windowed audio samples into a sequence of spectral representations, a spectral representation being a spectral frame or a spectrum of the audio signal.
The apparatus may be such that the converter is configured to apply a modified discrete cosine transform, MDCT, operation to obtain an MDCT spectrum from a block of time domain samples, or
In accordance to an aspect, the scale parameter calculator may obtain the set of amplitude-related measures as a set of energies per bands.
In accordance to an aspect, the converter may perform a first conversion to obtain a first part of the spectral representation and a second conversion to obtain a second part of the spectral representation, or wherein the converter is configured to perform a single conversion to obtain the spectral representation that has a first part of the spectral representation and a second part of the spectral representation, wherein the first part of the spectral representation is provided to the spectral processor, and the second part of the spectral representation is not processed by the spectral processor, and both the first part of the spectral representation and the second part of the spectral representation are provided to the scale parameter calculator to calculate the set of scale parameters based on both the first part of the spectral representation and the second part of the spectral representation.
In accordance to an aspect, the first part of the spectral representation is formed by MDCT coefficients and the second part of the spectral representation is formed by MDST coefficients, or the first part of the spectral representation is formed by MDST coefficients and the second part of the spectral representation is formed by MDCT coefficients.
In accordance to an aspect, the scale parameter calculator may be configured to obtain the amplitude-related measures from the first part of the spectral representation, squared, summed to the second part of the spectral representation, squared.
In accordance to an aspect, the scale parameter calculator is configured to obtain the amplitude-related measures from the spectral representation, or at least part thereof.
In accordance to an aspect, there is provided a method for encoding an audio signal, comprising:
In accordance to an aspect, there is provided a non-transitory storage unit storing instruction which, when executed by a processor, cause the processor to control or perform the method above.
Preferred embodiments of the present invention are subsequently described in more detail with respect to the accompanying drawings, in which:
It is anticipated that, with the present invention, a pre-emphasis is performed to be dependent on the harmonicity of the audio signal.
Throughout the specification, the term “scale factor” or “scale parameter” (171) is used in order to refer to the same parameter or value, i.e., a value or parameter that is, subsequent to some processing, used for weighting some kind of spectral values (mathematically, it will be expressed as gSNS, for example). This weighting, when performed in the linear domain is actually a multiplying operation with a scaling factor. However, when the weighting is performed in a logarithmic domain, then the weighting operation with a scale factor is done by an actual addition or subtraction operation. Thus, in the terms of the present application, scaling does not only mean multiplying or dividing but also means, depending on the certain domain, addition or subtraction or, generally means each operation, by which the spectral value, for example, is weighted or modified using the scale factor or scale parameter.
The downsampler 130 (optional) of the apparatus 300a (or 300c or 300e) of
Furthermore, the spectral processor 120 of the apparatus 300a (or 300c or 300e) of
Thus, the encoded representation 140″ of the second set of scale factors 172 that is output by block 140 either comprises a codebook index for a preferably used scale parameter codebook or a set of corresponding codebook indices. In other embodiments, the encoded representation comprises the quantized scale parameters of quantized scale factors that are obtained, when the codebook index or the set of codebook indices or, generally, the encoded representation is input into a decoder-side vector decoder or any other decoder. Preferably, the spectral processor 120 uses the same set of scale factors that is also available at the decoder-side, i.e., uses the quantized second set of scale parameters together with an interpolation operation to finally obtain the third set of scale factors.
In a preferred, optional embodiment, the third number of scale factors in the third set of scale factors 173 is equal to the first number of scale factors. However, a smaller number of scale factors is also useful. Exemplarily, for example, one could derive 64 scale factors (171) in block 110, and the downsampler (130) could then downsample the 64 scale factors (171) to 16 scale factors (140′) for transmission. Then, there could be performed an interpolation not necessarily to 64 scale factors, but to 32 scale factors (e.g. third number of scale factors) in the spectral processor 120. Alternatively, one could perform an interpolation to an even higher number (e.g. third number, which may be equal to or greater than the first number) such as more than 64 scale factors as the case may be, as long as the number of scale factors transmitted in the encoded output signal (170) is smaller than the number of scale factors calculated in block 110 or calculated and used in block 120 of
In general terms, the scale factors may be either provided to the spectral processor 120 either in their native form (e.g. 171) or in a modified version (e.g., 172 or 173). In
Summarizing, in the examples of
Preferably, the scale factor calculator 110 is configured to perform several operations illustrated in
A further operation performed by the scale factor calculator can be an optional inter-band smoothing 112. This inter-band smoothing is preferably used to smooth out the possible instabilities that can appear in the vector of amplitude-related measures as obtained by step 111. If one would not perform this smoothing, these instabilities would be amplified when converted to a log-domain later as illustrated at 115, especially in spectral values where the energy is close to 0. However, in other embodiments, inter-band smoothing is not performed.
An operation according to the invention, performed by the scale factor calculator 110, is the pre-emphasis operation 113. This pre-emphasis operation has a similar purpose as a pre-emphasis operation e.g. used in an LPC-based perceptual filter of the MDCT-based TCX processing as discussed before with respect to the prior art. This procedure increases the amplitude of the shaped spectrum in the low-frequencies that results in a reduced quantization noise in the low-frequencies.
The pre-emphasis operation 113 makes use of a harmonicity measure 163 as acquired. The scale parameter calculator 110 calculates an amplitude-related measure for each band (e.g. energy per band), to obtain a set of amplitude-related measures. The scale parameter calculator performs the pre-emphasis operation (113) to the set of amplitude-related measures, so that low frequency amplitudes are emphasized with respect to high frequency amplitudes according to a tilt value controlled by the tilt value, which in turn is based on a harmonicity measure (163) of the audio signal 160. According to a first option, the harmonicity measure 163 may be obtained (e.g. in the frequency domain), downstream to the converter 100, e.g. as spectral flatness measure, e.g. calculated from the spectral representation 100′. According to a second option, the harmonicity measure 163 may be obtained from the input audio signal 160 (e.g. in the time domain), upstream to the converter 100.
A further optional processing operation is the noise-floor addition processing 114. This procedure improves the quality of signals containing very high spectral dynamics such as, for example, Glockenspiel, by limiting the amplitude amplification of the shaped spectrum in the valleys, which has the indirect effect of reducing the quantization noise in the peaks, at the cost of an increase of quantization noise in the valleys, where the quantization noise is anyway not perceptible due to masking properties of the human ear such as the absolute listening threshold, the pre-masking, the post-masking or the general masking threshold indicating that, typically, a quite low volume tone relatively close in frequency to a high volume tone is not perceptible at all, i.e., is fully masked or is only roughly perceived by the human hearing mechanism, so that this spectral contribution can be quantized quite coarsely. The noise-floor addition operation 114, however, does not necessarily have to be performed.
Furthermore, block 115 indicates a log-like domain conversion. Preferably, a transformation of an output of one of blocks 111, 112, 113, 114 in
As illustrated in
Furthermore, the optional downsampler 130 additionally performs a mean value removal 133 and an additional scaling step 134. However, the low-pass filtering operation 131, the mean value removal step 133 and the scaling step 134 are only optional steps. Thus, the downsampler 130 illustrated in
Thus, it is made sure that the second set of scale factors are the same quantized second set of scale factors that are also available on the decoder-side, i.e., in the decoder that only receives the encoded audio signal that has the one or more indices per frame as output by block 141 via line 146.
Finally, the spectral processor 120 has a scalar quantizer/encoder 125 that is configured for receiving a single global gain for the whole spectral representation, i.e., for a whole frame. Preferably, the global gain is derived depending on certain bitrate considerations. Thus, the global gain is set so that the encoded representation of the spectral representation generated by block 125 fulfils certain requirements such as a bitrate requirement, a quality requirement or both. The global gain can be iteratively calculated or can be calculated in a feed forward measure as the case may be. Generally, the global gain is used together with a quantizer and a high global gain typically results in a coarser quantization where a low global gain results in a finer quantization. Thus, in other words, a high global gain results in a higher quantization step size while a low global gain results in a smaller quantization step size when a fixed quantizer is obtained. However, other quantizers can be used as well together with the global gain functionality such as a quantizer that has some kind of compression functionality for high values, i.e., some kind of non-linear compression functionality so that, for example, the higher values are more compressed than lower values. The above dependency between the global gain and the quantization coarseness is valid, when the global gain is multiplied to the values before the quantization in the linear domain corresponding to an addition in the log domain. If, however, the global gain is applied by a division in the linear domain, or by a subtraction in the log domain, the dependency is the other way round. The same is true, when the “global gain” represents an inverse value. Alternatively the spectral processor 120 may use vector quantizer instead of the scalar quantizer.
Subsequently, preferred implementations of the individual procedures described with respect to
Here below, non-limiting examples are provided.
The signal 160 may be converted into multiple bands (e.g. NB bands, e.g. NB=64, but a different number ca be used). For each band a plurality of spectral lines (indexes, bins) may be defined. An example is shown in
The mapping from b to Ind(b) is shown in
The energies per band EB(n) are computed as follows:
with X(k) are the MDCT (or MDST) coefficients (or other coefficients, e.g. other coefficients of another real-valued transform), NB=64 is the number of bands and k (with k between Ind(b) and Ind(b+1)−1) are the indexes of spectral lines of the b-th band, and Ind(b+1)−Ind(b) is the width (in number of coefficients) of the b-th band (the energy per band is therefore normalized by the width of each band). The bands may be non-uniform and follow the perceptually-relevant bark scale (smaller in low-frequencies, larger in high-frequencies). The lowest band is b=0, the highest b=63=NB−1 (as explained above, NB may be different from 64).
It is to be noted that a version without dividing by the number of the indexes may also be performed. Hence, the formula
may be substituted by:
(Notably, the energy per band is not normalized by the width of the band).
It is in addition or in alternative possible to compute the energies per band EB(n):
with X(k) being MDCT coefficients, with Y(k) being MDST coefficients at the same k-th index, NB=64 is the number of bands and k (with k between Ind(b) and Ind(b+1)−1) are the indexes of spectral lines of the b-th band, and Ind(b+1)−Ind(b) is the width (in number of coefficients) of the b-th band (the the energy per band is therefore normalized by the width of the band). The bands may be non-uniform and follow the perceptually-relevant bark scale (smaller in low-frequencies, larger in high-frequencies). The lowest band is b=0, the highest b=63=NB−1 (as explained above, NB may be different from 64).
As above, the formula
may be substituted by
It is to be noted that the formula EB(b)=Σk=Ind(b)Ind(b+1)−1[X(k)2+Y(k)2] for b=0 . . . NB−1 can be understood, in pseudocode, as:
The energy per band EB(b) is smoothed using
Remark: this step is mainly used to smooth the possible instabilities that can appear in the vector EB(b). If not smoothed, these instabilities are amplified when converted to log-domain (see step 5), especially in the valleys where the energy is close to 0.
The smoothed energy per band ES(b) is then pre-emphasized using
with gtilt (which in some examples can be avoided) controls the pre-emphasis tilt and depends on the bandwidth of the audio signal 160 (bandwidth of the spectral representation, the sampling frequency corresponding to the spectral representation). gtilt is for example 15.75 for 8 kHz bandwidth of the spectral representation, 18.375 for 12.8 kHz bandwidth of the spectral representation and 19.5 for 16 kHz bandwidth of the spectral representation. g′ is the harmonicity measure 163.
A more general definition may be
where the
is an exponent applied to d>0; h is fixed (e.g. equal to 10), g′≥0 is, or is derived from, the harmonicity measure 163, gtilt is pre-defined (e.g. dependent on the spectral representation bandwidth), b is an index indicating the band out of NB bands (e.g. NB=64 or a different number).
It is here distinguished between:
The pre-emphasis used in this step may have an analogous purpose as the pre-emphasis used in the LPC-based perceptual filter of prior art 2, it increases the amplitude of the shaped Spectrum in the low-frequencies, resulting in reduced quantization noise in the low-frequencies.
A noise floor at −40 dB is added to EP(b) using
E
P(b)=max(EP(b),noiseFloor) for b=0 . . . 63
with the noise floor being calculated by
This step improves quality of signals containing very high spectral dynamics such as e.g. glockenspiel, by limiting the amplitude amplification of the shaped spectrum in the valleys, which has the indirect effect of reducing the quantization noise in the peaks, at the cost of an increase of quantization noise in the valleys where it is anyway not perceptible.
A transformation into the logarithm domain is then performed using
The vector EL(b) is then downsampled by a factor of 4 using
This step applies a low-pass filter (w(k)) on the vector EL(b) before decimation. This low-pass filter has a similar effect as the spreading function used in psychoacoustic models: it reduces the quantization noise at the peaks, at the cost of an increase of quantization noise around the peaks where it is anyway perceptually masked.
The final scale factors are obtained after mean removal and scaling by a factor of 0.85
Since the codec has an additional global-gain, the mean can be removed without any loss of information. Removing the mean also allows more efficient vector quantization.
The scaling of 0.85 slightly compress the amplitude of the noise shaping curve. It has a similar perceptual effect as the spreading function mentioned in Step 6: reduced quantization noise at the peaks and increased quantization noise in the valleys.
The scale factors are quantized using vector quantization, producing indices which are then packed into the bitstream and sent to the decoder, and quantized scale factors scfQ(n).
The quantized scale factors scfQ(n) are interpolated using
and transformed back into linear domain using
g
SNS(b)=2scfQint(b) for b=0 . . . 63
Interpolation is used to get a smooth noise shaping curve and thus to avoid any big amplitude jumps between adjacent bands.
The SNS scale factors gSNS(b) are applied on the MDCT frequency lines for each band separately in order to generate the shaped spectrum Xs(k)
This expression can be read, in pseudocode, as:
The scale parameters gSNS(b) may be outputted by block 122 and used by block 123 in
Preferably, the scale factor decoder 220 is configured to operate in substantially the same manner as has been discussed with respect to the spectral processor 120 of
Furthermore, the spectrum decoder 210 illustrated in
Further procedures of preferred embodiments of the decoder are discussed subsequently.
The vector quantizer indices produced in encoder step 8 are read from the bitstream and used to decode the quantized scale factors scfQ(n).
Same as Encoder Step 9.
The SNS scale factors gSNS(b) are applied on the quantized MDCT frequency lines for each band separately in order to generate the decoded spectrum {circumflex over (X)}(k) as outlined by the following code.
{circumflex over (X)}(k)=(k)·gSNS(b) for k=Ind(b) . . . Ind(b+1)−1, for b=0 . . . 63
Preferably the additional tool TNS between Spectral Noise Shaping (SNS) and quantization/coding (see block diagram below) is used. TNS (Temporal Noise Shaping) also shapes the quantization noise but does a time-domain shaping (as opposed to the frequency-domain shaping of SNS) as well. TNS is useful for signals containing sharp attacks and for speech signals.
TNS is usually applied (in AAC for example) between the transform and SNS. Preferably, however, it is preferred to apply TNS on the shaped spectrum. This avoids some artifacts that were produced by the TNS decoder when operating the codec at low bitrates.
Along the x-axis, the index for the bands 0 to 63 is given. Particularly, there are 64 bands going from 0 to 63.
The 16 downsample points corresponding to scfQ(i) are illustrated as vertical lines 1100. Particularly,
Correspondingly, the second block of four bands is (4, 5, 6, 7), and the middle point of the second block is 5.5.
The windows 1110 correspond to the windows w(k) discussed with respect to the step 6 downsampling described before. It can be seen that these windows are centered at the downsampled points and there is the overlap of one block to each side as discussed before.
The interpolation step 222 of
The position of the second band is calculated as a function of the two vertical lines around it (1.5 and 5.5):2=1.5+(⅛)*(5.5−1.5).
Correspondingly, the position of the third band as a function of the two vertical lines 1100 around it (1.5 and 5.5): 3=1.5+(⅜)*(5.5−1.5).
A specific procedure is performed for the first two bands and the last two bands. For these bands, an interpolation cannot be performed, because there would not exist vertical lines or values corresponding to vertical lines 1100 outside the range going from 0 to 63. Thus, in order to address this issue, an extrapolation is performed as described with respect to step 9: interpolation as outlined before for the two bands 0, 1 on the one hand and 62 and 63 on the other hand.
Subsequently, a preferred implementation of the converter 100 of
Particularly,
The converter 100 on the encoder-side is preferably implemented to perform a framing with overlapping frames such as a 50% overlap so that frame 2 overlaps with frame 1 and frame 3 overlaps with frame 2 and frame 4. However, other overlaps or a non-overlapping processing can be performed as well, but it is preferred to perform a 50% overlap together with an MDCT algorithm. To this end, the converter 100 comprises an analysis window 101 and a subsequently-connected spectral converter 102 for performing an FFT processing, an MDCT processing or any other kind of time-to-spectrum conversion processing to obtain a sequence of frames corresponding to a sequence of spectral representations as input in
Correspondingly, the scaled spectral representation(s) are input into the converter 240 of
Here, the pre-emphasis as controlled by the tilt (and in turn by the harmonicity measure 163 of the signal 160), e.g. at block 113, is discussed. While it is often referred to “energy per band”, this concept can be generalized to “amplitude-related measure”.
The pre-emphasis (113) has the purpose of increasing the amplitude of the spectrum (e.g. as outputted by the converter 100, or a processed version thereof) in the low-frequencies, resulting in reduced quantization noise in the low-frequencies. By using the harmonicity measure 163 to control the pre-emphasis spectral tilt, the increase of the amplitude of the spectrum is controlled in the low-frequencies in function of the harmonicity 163 of the audio signal 160. So, in general terms if the audio signal 160 is highly harmonic (e.g. is mostly speech) the amplitude of the spectrum is increased at low frequencies (e.g. normally voiced speech) in respect to the high frequencies (e.g. mostly noise), whose spectrum's amplitude is decreased.
If the audio signal 160 is weakly harmonic (e.g. is mostly noise), then:
Due to the present techniques, the spectral tilt increases or decreases based on the harmonicity measure 163, so that a lower harmonicity (163) implies a lower tilt (and a higher harmonicity, 163, implies higher tilt).
An example of a pre-emphasis using a spectral tilt is provided by the above shown formula
with g′ being a representation of the harmonicity measure 163. That formula may be generalized into
where the tilt value
is an exponent applied to d, h is fixed, g′≥0 is, or is derived from, the harmonicity measure, gtilt≠0 is pre-defined, b is an index indicating the band out of NB bands. The pre-emphasis factor
(or more frequently
e.g.
where
(tilt value) is an exponent
applied e.g. to d (e.g. d=10), h (e.g. h=10) is fixed, g′≥0 (e.g. 0≤g′≤1) is, or is derived from, the harmonicity, gtilt≠0 is pre-defined and may be, in general, dependent on the bandwidth of the spectral representation 100′ (e.g. gtilt may be higher for higher bandwidths), b is an index indicating the band out of nb+1 bands. A more common notation is
which is the same of before but the exponentiation has base 10. The pre-emphasis is then applied to the spectral energy ES(b), so as to have a pre-emphasized energy information
The bands are in general indexed by an index b which may be between a lower index (e.g. 0) to indicate a low frequency (e.g. DC in case of 0) and increases up to a maximum index (e.g. equal to nb, which may be, for example, 63, in the case that the signal is subdivided into NB=64 bands, but different number of bands are possible).
In examples, the higher the harmonicity, the higher g′ (which may be understood as the harmonicity measure 163), the higher the spectral tilt, the more precise the quantizing of the lower frequencies of the shaped spectrum with more detail with respect to the higher frequencies.
Often the control of the spectral tilt reduces the spectral tilt with respect to a fixed value, since a weight between 0 and 1 may be applied (e.g., 0 for a completely noisy or transient signal, and 1, or another value greater than zero, for a totally harmonic signal). This may be obtained, for example, by weighting the tilt using a normalized value, such as a normalized harmonicity. Therefore, g′ may be a normalized version of the harmonicity 163 and may take a value between 0 and 1.
The harmonicity measure 163 may be obtained by autocorrelating the input signal 160, e.g. along one frame. It may be, for example, a normalized autocorrelation.
It is to be noted that the scaling of the spectral (pre-emphasize) tilt might slightly increase the computational effort, since a scaling by g′ (harmonicity or, more in general, harmonicity-related measurement) is carried out, for example. However, when calculating
for b=0 . . . nb, it is possible to calculate the exponent
by:
Therefore, the increase in computational effort is not overwhelming. Indeed gtilt and g′ are the same for all the bands in the whole frame.
Reference can now be made to Step 10 (spectral shaping). The calculation of the scale factors gSNS(b) to be used for shaping the signal therefore results controlled by the spectral tilt value. It is noted that:
The pre-emphasis applied by the processor 110 may therefore increase the amplitude of the shaped spectrum in the low frequencies, resulting in reduced quantization noise in the low-frequencies. Using e.g. harmonicity measure 163 to control the spectral tilt permits to adapt the strength of this effect to the features of the audio signal 160. So, if the audio signal 160 is highly-harmonic, the effect is increasing the amplitude of the shaped spectrum at low frequencies, so that there is reduced quantization noise, and if the mono audio signal 602 is non-harmonic there is applied a less strong spectral tilt on the shaped energies (the lower-frequency part of the spectrum is not amplified too much or not at all compared to the higher frequencies), hence permitting to quantize more evenly over the whole spectrum.
Several techniques may be used for acquiring the harmonicity measure 163. For example, the harmonicity measure 163 may be an LTP (long term predictor) parameter or an LTPF (long term post-filter) parameter, so that the scale parameter calculator 110 controls the tilt based on an LTP parameter or an LTPF parameter as the harmonicity measure of the audio signal 160. In addition or alternative, the harmonicity measure 163 may be a spectral flatness measure of the audio signal 160, so that the scale parameter calculator 110 controls the tilt based on the spectral flatness measure. Notably, the harmonicity measure 163 may be encoded anyway in the output encoded audio signal 170, e.g. for other reasons (e.g., as LTP or LTPF parameter and/or for the purpose of performing packet loos concealment at the decoder). For this reason, the harmonicity measure 163 may be quantized, and its quantized version may be used as g′ for controlling the tilt. In addition or alternative, the harmonicity measure 163 may be normalized (e.g. between 0 and 1, as discussed above), so as to control the tilt based on a normalized version of the harmonicity measure 163 of the audio signal 160.
As explained above, the obtained set of amplitude-related measures may be preferably a set of energies per bands Ep(b), e.g. obtained through Ep(b)=ES(b).
where
is an exponent applied to d (e.g. d>1), h (e.g. h>0) is fixed, g′≥0 is the harmonicity measure, gtilt>0 is pre-defined, b is an index indicating the band out of NB bands in such a way that a higher frequency band has a higher index than a lower frequency band.
In general terms, the tilt value
may be proportional (or at least linear) with the harmonicity measure (g′, 163 in the figures), where the proportionality coefficient may be
Further, the tilt value
may be proportional (or more in general linear) with the coefficient b/nb is a value which is 0 for b=0 and 1 for the highest band and increases with the increase of the frequency of the generic b-th band (the increase is linear with the index b, but not necessarily with the frequency).
Moreover, the tilt value
may be proportional (or more in general linear) with gtilt (which in some examples can be avoided), which is controlled with the sampling rate (e.g. gtilt)
Therefore, the tilt value
may be proportional (or at least linear with) at least one of:
An example of how to perform the pre-emphasis (e.g. at 113) is shown in
The codec IVA is built upon the MDCT-based TCX from EVS [8, 9] with adaptations of some technologies from LC3 [5,6]. It has been proven in EVS and LC3 that an MDCT-based codec with HPF can code well a wide range of mono signals, including clean speech, at bitrates of 48 kbps and bandwidth of at least 16 kHz [5,6,8,9]. This codec structure can be extended by adding new or modifying existing technologies, so that coding of a wide range of signals may be possible even at lower bitrates.
Some examples and further aspects are proposed here below.
The MDCT-based codec processes its input signal in frames of e.g. 20 ms, the length in samples being. The other components follow the same framing and have the update rate of 50 frames per second of the coded parameters.
The input signal x1 (160), which is the original audio signal e.g. processed with a 20 Hz high-pass/DC rejection filter, is input to the pitch analysis and to the pulse extraction. In the pitch analysis, denoted as “Get pitch contour”, a pitch contour is obtained. The pitch contour steers many parts of the codec, including pulse extraction and HPF. The pitch contour tracks pitch change over time. Pitch is here used to term the fundamental frequency or the pitch lag, where the pitch lag is also known as the fundamental period and is the inverse of the fundamental frequency. Viewing pitch as a frequency or as a period is equivalent when describing the concepts of the implemented algorithms. When pitch is used for filtering in TD it is the corresponding pitch lag, expressed in number of samples, that is used. Whenever the exact representation is important, pitch will be denoted as pitch lag or fundamental frequency or frequency bin index.
The pulse extraction extracts pulses from x1 and codes them, where a pulse may be a glottal pulse or any other kind of transient. The extracted pulses xP
The decoded output xo of the MDCT-based codec is filtered via an HPF. The HPF produces xii, in which noise between harmonics is suppressed.
A waveform xP
The encoder (e.g.
The pitch contour is obtained and coded for frames with high harmonicity. For frames with low harmonicity, it is signaled with one bit that the pitch contour is not present. The pitch analysis also outputs the high frequency tonality flag ϕH and a flag indicating if constant parameters should be used for the HPF across the whole frame.
The pulse extraction obtains a short-time Fourier transform (STFT) of the input audio signal, and uses a non-linear magnitude spectrogram and the phase spectrogram of the STFT to find and extract pulses. Each pulse has a waveform with high-pass characteristics and the energy concentrated in the temporal center. The pulse residual signal xM is obtained by subtracting a signal consisting of pulses xP
The pulse residual signal xM is windowed and transformed via the MDCT to produce the spectrum XM. A spectrum consists of frequency coefficients, also named frequency bins or lines. The length of XM, and all other spectra derived from XM, is equal to the frame length HM. The MDCT window is chosen among three windows as in EVS, with the difference that all three windows are symmetrical low overlap windows and that the overlap is longer. The low overlap window is equal to the square root of the Tukey window. The three windows used have the length of 30 ms, 25 ms and 21.25 ms with the corresponding overlap region having the length of 10 ms, 5 ms and 1.25 ms. The longer window allows better energy compaction for tonal signals compared to EVS. The decision which window to use is basically the same as in EVS, with the difference that there is only overlap change and no splitting of the frame into sub-frames. This means that the overlap choice is made so that an increase of energy occurs in the non-overlapping part of the MDCT window, thus avoiding problems with the MDCT unfolding and TNS. In the initial versions of IVA, sub-frames of 5 ms and 10 ms, as in EVS, were used. With the introduction of the pulse extraction and by using TNS, the need for splitting the frame is reduced. Avoiding the sub-frame splitting, makes the implementation of new technologies and the maintenance easier. It could be investigated in the future, if there are signals where the sub-frame splitting could be beneficial.
The spectral envelope of XM is perceptually flattened using SNS obtaining
TNS provides temporal shaping of the quantization noise by filtering across frequencies in the MDCT spectrum [7]. TNS filter coefficients are obtained from the autocorrelation of the MDCT spectrum and TNS is activated if the filter provides coding gain above a threshold. Two independent filters are used, one for the range from 800 to 4500 Hz and another one above 4500 Hz. If it is active, the TNS filters
The perceptually flattened predicted MDCT spectrum XF
Therefore, as explain above, the spectral representation (e.g. as outputted by the converter 100, is divided into two parts, and the first part 100′ (e.g. obtained from MDCT coefficients) may be partially sent to the spectral processor 120 to obtain the encoded version of the spectral representation, while a second part 100″ (e.g. obtained from the MDST representation) is not used, but both the first part 100′ and the second part 100″ may be provided to the scale parameter calculator 110 to obtain the scale factors (or processed version thereof).
The pitch contour is determined by {dot over (d)}F
The pitch values estimated by the pitch search fall between the minimum pitch lag ďF
The values of {dot over (d)}F
The core of the process is the pitch search F
The pitch search F
Fractional delays are obtained by interpolating ρH. The interpolation is done using filters from a predefined list of filters, each filter having a distinct fractional delay between 0 and 1. It will be further considered that ρH [m] also includes values for fractional delays m, that is, ρH [m] is also defined when m is a rational number. Besides d{hacek over (F)}
The value of the initial pitch candidate d{tilde over (F)}
The harmonicity level ρ{umlaut over (F)}
The location of the absolute maximum in ρH is the first candidate dF
The second candidate dF
The value of d{umlaut over (F)}
The values of τF
In each of the intervals depicted in
The values of {grave over (d)}D,
First, the pitch {grave over (d)}D and the associated {grave over (ρ)}D are obtained via F
The value of {dot over (d)}F
The frame is considered to have high harmonicity if max({grave over (ρ)}D,
For frames with high harmonicity, {acute over (d)}F
If {acute over (ρ)}F
The pitch contour dV is obtained by an interpolation of the decoded values of {dot over (d)}F
If
and there is no significant amplitude modulation in the interval where {acute over (d)}F
The average pitch lag
It is determined for each MDCT frequency bin between 4.8 kHz and 16 kHz, as in 5.3.3.2.5 in the EVS algorithmic description [8], if the frequency bin belongs to a tone or contains mostly noise. The bin tonality is used, as in EVS, for the adaptive changing of the dead-zone in the MDCT spectrum quantizer.
The total number of tonal frequency bins {dot over (N)}T is calculated in the current frame and smoothed over time: NT=0.5NT+{dot over (N)}T.
Normalized correlation ρH
The high frequency tonality flag ϕH is set to 1 if TNS is inactive, the pitch contour is present and there is a tonality in high frequencies, where the tonality exists in high frequencies if ρH
The high frequency tonality flag ϕH is used in the decoder to decide how to fill MDCT coefficients quantized to zero.
The SNS scale factors are obtained from smoothed and spectrally tilted energies in e.g. NB (e.g. 64, or a different number) frequency bands having increasing widths. The band energies (or more in general amplitude-related measures) are optionally transformed to a logarithmic domain (e.g. 115, step 5) after optionally adding a noise floor (e.g. 114, step 4). The band energies (or more in general amplitude-related measures) are then e.g. downsampled (e.g. by the downsampler 130) to 16 values in the logarithmic domain, the arithmetic mean e.g. is removed and a fixed scaling is applied. The 16 values are then quantized and coded, the coded values being denoted as “sns”. The 16 quantized values are e.g. interpolated and transformed back to the linear domain. The interpolated scale factors are applied on the MDCT spectrum, where in SNSE the MDCT spectrum is divided by the scale factors and in SNSD multiplied. This process is the same as in LC3 [5, 6].
Different to the SNS implementation in LC3 is that the NB (e.g. 64, or a different number) band energies are obtained from a squared magnitude spectrum. Using the magnitude instead of the MDCT avoids the problem of the MDCT temporal variation for constant amplitude signals.
Another difference is that the band energies are not normalized by the band width. (see step 1 above, where it is explained that
may be substituted by EB(b)=Σk=Ind(b)Ind(b+1)−1X(k)2, and
may be substituted by EB(b)=Σk=Ind(b)Ind(b+1)−1[X(k)2+Y(k)2]. By not normalizing the band energies, a similar effect is obtained as by the fixed spectral tilt in LC3 SNS. This difference is advantageous because it reduces computational complexity by avoiding the divisions used in the normalization of the band energies. Instead of LC3's strong fixed spectral tilt, smaller adaptive tilt is used, proportional to the harmonicity of the input signal. Via the stronger tilt for harmonic signals, harmonic components at low frequencies are given more importance and vice versa. Weaker tilt for applause signals achieves more even distribution of quantization steps across the whole bandwidth and allows perceptually efficient coding of wideband transients.
An additional difference is that, in some examples, the SNS scale factors are coded using a two stage full search vector quantizer without a split, having 9 bits in the first and 6 bits in the second stage. This quantization scheme requires less bits than the scheme with pyramid vector quantizer in LC3, thus leaving more bits for coding the flattened MDCT spectrum, which was shown to be beneficial.
The borders of the NB (e.g. 64, or a different number) bands used in IVA are shown as horizontal lines in
The quantized values are interpolated to 128 scale factors, thus producing a smoother envelope than the NB (e.g. 64, or a different number) interpolated scale factors in LC3. The 128 bands are obtained by splitting each of the 64 bands into two halves.
Some discussion on the decoder is here provided.
The decoder (e.g. 800) processes the output of the encoder frame by frame.
The coded “spect” is decoded to obtain the quantized MDCT spectrum XQ. The energies EN
The perceptually flattened predicted MDCT spectrum XF
If TNS is active, the temporal envelope is restored in “TNSD” producing
The reshaped spectrum
LTP generates a time domain prediction from xD depending on the pitch contour and the LTP activation parameter, windows the time domain prediction and converts the windowed signal to the predicted MDCT spectrum
As already mentioned, xH is produced by suppressing noise between harmonics of xD via HPF, driven by the pitch contour. Finally xP
An inventively encoded audio signal can be stored on a digital storage medium or a non-transitory storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier or a non-transitory storage medium.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer. A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are preferably performed by any hardware apparatus.
The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
LC3 is referred in: