The present invention relates in general to encoding/decoding of audio signals, an in particular to methods and devices for efficient low bit-rate audio encoding/decoding.
When audio signals are to be transmitted and/or stored, a standard approach today is to code the audio signals into a digital representation according to different schemes. In order to save storage and/or transmission capacity, it is a general wish to reduce the size of the digital representation needed to allow reconstruction of the audio signals with sufficient quality. The trade-off between size of the coded signal and signal quality depends on the actual application.
There is a large variety of different coding principles. Transform based audio coders compress audio signals by quantizing the transform coefficients. Such coding thus operates in a transformed frequency domain. Transform based audio coders are efficient concerning moderate and high-bitrate coding of general audio but are not very efficient concerning low-bitrate coding of speech.
Code-Excited Linear Prediction (CELP) codecs, e.g. Algebraic Code-Excited Linear Prediction (ACELP) codecs, are very efficient at low bit-rate speech coding. The CELP speech synthesis model uses analysis-by-synthesis coding of the speech signal of interest. The ACELP codec can achieve high-quality at 8-12 kbit/s. However, signal features having high-frequency components are generally not modeled equally well.
One approach used for reducing the required bit-rate is to use BandWidth Extension (BWE). The main idea behind BWE is that part of an audio signal is not transmitted, but reconstructed (estimated) at the decoder from the received signal components. A combination of a CELP coding of a signal sampled by a low sampling rate and BWE is one solution that is discussed.
On the other hand BWE is more efficiently performed in a transformed domain, e.g. a Modified Discrete Cosine Transform (MDCT) domain. The reason for this is that the perceptually important signal features in the BWE region is more efficiently modeled in a frequency domain representation.
A problem with prior art codec systems is thus to find BWE encoding schemes that are efficient for all types of audio signals.
A general object of the present invention is to provide methods and encoder and decoder arrangements that allow for an efficient low bit-rate encoding/decoding for most types of audio signals.
This object is achieved by methods and arrangements according to the enclosed independent claims. Preferred embodiments are defined in the dependent claims.
In general words, in a first aspect, a method for encoding of an audio signal comprises obtaining of a low band synthesis signal of an encoding of the audio signal. A first energy measure of a first reference band within a low band in the low band synthesis signal is obtained. A transform of the audio signal into a transform domain is performed. An energy offset is selected from a set of at least two predetermined energy offsets for each of a plurality of first subbands of a first high band of the audio signal in the transform domain. The first high band is situated at higher frequencies than the low band. The first high band is encoded. The encoding comprises providing of a first set of quantization indices representing a respective scalar quantization of a spectrum envelope in the plurality of first subbands of the first high band relative to the first energy measure. The quantization indices of the first set of quantization indices are given with a respective selected energy offset. The encoding of the first high band also comprises providing of a parameter defining the used energy offset. A second energy measure of a second reference band within the low band in the low band synthesis signal is obtained. A second high band of the audio signal in the transform domain is encoded. The second high band is situated in frequency between the low band and the first high band. The encoding of the second high band comprises providing of a second set of quantization indices representing a respective scalar quantization of a spectrum envelope in a plurality of second subbands of the second high band relative to the second energy measure.
In a second aspect, a method for decoding of an audio signal comprises receiving of an encoding of the audio signal. The encoding represents a first set of quantization indices of a spectrum envelope in a plurality of first subbands of a first high band of the audio signal. The first set of quantization indices represents energies relative to a first energy measure. A low band synthesis signal of an encoding of the audio signal is obtained. The first energy measure is obtained as an energy measure of a first reference band within a low band in the low band synthesis signal. The first high band is situated at higher frequencies than the low band. The encoding further represents a parameter defining a used energy offset. An energy offset is selected from a set of at least two predetermined energy offsets for each of the first subbands. This selection is based on the parameter defining the used energy offset. A signal in a transform domain is reconstructed by determining a spectrum envelope in the first high band from the first set of quantization indices corresponding to the first subbands, by use of the so selected energy offset and the first energy measure, for each of the first subbands of the first high band. An inverse transform is performed into the audio signal, based on at least the reconstructed signal in the transform domain. The encoding further represents a second set of quantization indices of a spectrum envelope in a plurality of second subbands of a second high band. The second high band is situated in frequency between the low band and the first high band. The second set of quantization indices represents energies relative to a second energy measure. The second energy measure is obtained as an energy measure of a second reference band within the low band in the low band synthesis signal. The reconstructing of the signal in the transform domain further comprises determining of a spectrum envelope in the second high band from the second set of quantization indices corresponding to the second subbands by use of the second energy measure for each of the second subbands of the second high band.
In a third aspect, an encoder apparatus for encoding of an audio signal comprises a transform encoder, a selector, a synthesizer, an energy reference block and an encoder block. The transform encoder is configured for performing a transform of the audio signal into a transform domain. The selector is configured for selecting an energy offset from a set of at least two predetermined energy offsets for each of a plurality of first subbands of a first high band of the audio signal in the transform domain. The synthesizer is configured for obtaining a low band synthesis signal of an encoding of the audio signal. The energy reference block is connected to the synthesizer and configured for obtaining a first energy measure of a first reference band within a low band in the low band synthesis signal. The first high band is situated at higher frequencies than the low band. The encoder block is connected to the selector and the energy reference block. The encoder block is configured for encoding the first high band. The encoding of the first high band comprises providing of a first set of quantization indices representing a respective scalar quantization of a spectrum envelope in the plurality of first subbands of the first high band relative to the first energy measure. The quantization indices of the first set of quantization indices are given with a respective selected energy offset. The encoding of the first high band further comprises providing of a parameter defining the used energy offset. The energy reference block is further configured for obtaining a second energy measure of a second reference band within the low band of the low band synthesis signal. The encoder block is further configured for encoding a second high band of the audio signal in the transform domain. The second high band is situated in frequency between the low band and the first high band. The encoding of the second high band comprises providing of a second set of quantization indices representing a respective scalar quantization of a spectrum envelope in a plurality of second subbands of the second high band relative to the second energy measure.
In a fourth aspect, an audio encoder comprises an encoder apparatus according to the third aspect.
In a fifth aspect, a network node comprises an audio encoder according to the fourth aspect.
In a sixth aspect, a decoder apparatus for decoding of an audio signal comprises an input block, a synthesizer, an energy reference block, a selector, a reconstruction block and an inverse transform decoder. The input block is configured for receiving an encoding of the audio signal. The encoding represents a first set of quantization indices of a spectrum envelope in a plurality of first subbands of a first high band of the audio signal. The first set of quantization indices represents energies relative to a first energy measure. The synthesizer is configured for obtaining a low band synthesis signal of an encoding of the audio signal. The energy reference block is connected to the synthesizer and configured for obtaining the first energy measure as an energy measure of a first reference band within a low band in the low band synthesis signal. The first high band is situated at higher frequencies than the low band. The encoding further represents a parameter defining a used energy offset. The selector is connected to the input block. The selector is configured for selecting an energy offset from a set of at least two predetermined energy offsets for each of the first subbands based on the parameter defining the used energy offset. The reconstruction block is connected to the input block, the selector and the energy reference block. The reconstruction block is configured for reconstructing a signal in a transform domain by determining a spectrum envelope in the first high band from the first set of quantization indices corresponding to the first subbands by use of the selected energy offset and the first energy measure, for each of the first subbands of the first high band. The inverse transform decoder is connected to the reconstruction block. The inverse transform decoder is configured for performing an inverse transform into the audio signal based on at least the reconstructed signal in the transform domain. The encoding further represents a second set of quantization indices of a spectrum envelope in a plurality of second subbands of a second high band. The second high band is situated in frequency between the low band and the first high band. The second set of quantization indices represents energies relative to a second energy measure. The energy reference block is further configured for obtaining the second energy measure as an energy measure of a second reference band within the low band of the low band synthesis signal. The reconstruction block is further configured for determining of a spectrum envelope in the second high band from the second set of quantization indices corresponding to the second subbands by use of the second energy measure for each of the second subbands of the second high band.
In a seventh aspect, an audio decoder comprises a decoder apparatus according to the sixth aspect.
In an eighth aspect, a network node comprises an audio decoder according to the seventh aspect.
One advantage with the present invention is that the quality, measured in subjective listening tests, is increased compared to e.g. a pure ACELP encoding, with very low required additional bit-rate for BWE information. Further advantages are discussed in connection to the different embodiments described below.
The invention, together with further objects and advantages thereof, may best be understood by making reference to the following description taken together with the accompanying drawings, in which:
Throughout the drawings, the same reference numbers are used for similar or corresponding elements.
The description will start with a description of the overall system, then describe examples presenting a part of the final solution before the final solution is presented.
An example of a general audio system with a codec system is schematically illustrated in
In many real-time applications, the time delay between the production of the original audio signal 16 and the produced audio output 36 is typically not allowed to exceed a certain time. If the transmission resources at the same time are limited, the available bit-rate is also typically low.
The ACELP encoding is complemented by a low-bitrate BWE for high bands. The transform encoder 52 is in this particular embodiment a Modified Discrete Cosine Transform (MDCT) encoder 52. However, in alternative embodiments, the transform encoder 52 could also be based on other transforms. Non-exclusive examples of such transforms are Fourier Transforms, different types of Sine or Cosine Transforms, Karhunen-Loeve-transform, or different types of filterbanks. The operation, as such, of such transforms, is well known within the art of codecs, and will not be discussed more in detail. The encoder arrangement 56 is arranged for providing BWE information concerning at least a high band. The high band, as the name suggests, is situated at higher frequencies than the ACELP encoded low band. In the present embodiment, an encoder combiner 61 is connected to the ACELP encoder 41 and the encoder apparatus 50 based on the MDCT transform and is arranged for providing a suitable joint encoding of all the information about the audio signal. Such representation of the audio signal is provided as a binary flux 22.
In a particular embodiment, the input and output signals are sampled at 32 kHz, which gives the basis for the MDCT BWE. The signal for the ACELP core encoding is resampled to 12.8 kHz.
The ACELP decoding is in analogy with the encoding side complemented by a low-bitrate BWE for high bands. The inverse transform decoder 86 is in this particular embodiment an Inverse Modified Discrete Cosine Transform (IMDCT) decoder 85. However, in alternative embodiments, the transform decoder 86 could also be based on other transforms. Non-exclusive examples of such transforms are Fourier Transforms, different types of Sine or Cosine Transforms, Karhunen-Loeve-transform, or different types of filterbanks.
An important part of the present approach is the encoder apparatus handling the BWE.
The encoder arrangement 56 comprises a selector 58, in this embodiment comprising a power distribution analyzer 57. This power distribution analyzer 57 is configured for obtaining a power distribution of the audio signal in the transform domain. As will be discussed further below, different types of audio signals can have very differing behavior in the transform domain. Such behaviors may, however, be utilized for encoding purposes. In one embodiment of a power distribution analyzer 57 a classification of the audio signal into two or more classes is performed. Such a power distribution analyzer 57 can in different embodiments receive spectral information 42 from a synthesizer 29. The synthesizer 29 obtains a low band synthesis signal of an encoding of the audio signal. The synthesis signal may be based on signals of external sources, e.g. from the core encoder 40 via an MDCT transformer 54. The synthesizer 29 may comprise only the MDCT transformer 54 or both the MDCT transformer 54 and an encoder. The spectral information can alternatively be derived directly 42B by a synthesizer 29 directly based on properties of the audio signal in the transform domain. Examples of such analysis or classification will be further discussed below. The selector 58 is configured for providing an energy offset intended for finding suitable quantization indices. The provision of the energy offset is performed by selecting an energy offset 92 from a set of predetermined energy offsets. The set of predetermined energy offsets comprises at least two predetermined energy offsets. This set of predetermined energy offsets is known by both the encoder and decoder and is typically provided in a memory 53, connected to the selector 58. A predetermined energy offset 92 is selected for each of the subbands that are going to be encoded. The selection is furthermore based on the analysis of the audio signal.
In a particular embodiment, the selecting is based on an open loop approach. In this embodiment, a parameter is determined characterizing a power distribution of the audio signal in the transform domain. The actual selection is then performed based on the determined parameter. This means that for one type of signal, one energy offset 92 is used for encoding each individual subband.
The encoder arrangement 56 further comprises an energy reference block 59. The energy reference block is configured for obtaining an energy measure 93 to be used as an energy reference. The energy measure 93 is an energy measure of a first reference band within a low band in the transform domain of the audio signal. A low band signal 43 with the first reference band can be obtained e.g. from the core encoder 40, via the MDCT transformer 54. Alternatively, a low band signal 43B could be achieved from the transform domain version 90 of the audio signal. The energy measure is typically a mean energy of the first reference band. In alternative embodiments, the energy measure could instead be any other characteristic statistical measure of the energies of the first reference band, such as e.g. median value, a mean square value or a weighted average value. This reference energy measure is used as a starting point of a relative quantization of the MDCT envelope. The band in which the first reference band is selected is situated at lower frequencies than the band that the encoder apparatus 50 is supposed to handle. In other words, the high band is situated at higher frequencies than the low band of the audio signal, just as the notation indicates.
An encoder block 55 is connected to the selector 58, the transform encoder 52 and the energy reference block 59 for receiving the selection of the energy offset range 92, the transform domain version 90 of the audio signal and the energy measure 93. The encoder block 55 is configured for encoding said high band by providing a set of quantization indices representing a respective scalar quantization of a spectrum envelope relative to the energy measure 93 of the first reference band and by use of the selected energy offset 92. The encoder block 55 thereby outputs a set of parameters 95 representing the relative energies. The encoder block 55 is further configured for providing a parameter defining the used predetermined energy offset. These outputs are then in particular embodiments combined with the core encoding and other BWE encodings and transmitted to the receiver.
In this embodiment, the selector 58 is configured for receiving the quantization indices for all predetermined energy offsets. The selector 58 here comprises a calculation block 64 and a selection block 65. The calculation block 64 is configured for calculating a quantization error for each of the sets of quantization indices. To this end, the calculation block also has access to the original transformed audio signal 90. The selection block 65 is then configured for selecting the set of quantization indices giving the smallest quantization error. These quantization indices are used as the output set of parameters 95 together with the parameter defining the used energy offset.
The frequency ranges for the low band and high band can be selected depending on the total available bit rate, the used encoding techniques, the required level of audio quality etc. In a particular embodiment, typically intended for wireless communication, the low band ranges from essentially 0 to 6.4 kHz. The first reference band ranges from 0-5.9 kHz, however, in an alternative embodiment the entire low band is comprised in the first reference band. The upper limit of the high band is 11.6 kHz in the present embodiment. The reason to limit envelope quantization to 11.6 kHz is the decreased resolution of human auditory system in these frequencies, and low energy in speech signal. Optionally, a very high band VHB above the high band upper limit can be encoded by further BWE methods, e.g. in that the envelope in the very high band region above 11.6 kHz is predicted. However, such aspects are not within the main scope of the present disclosure. The number of subbands can also be selected in different manners. Numerous subbands give a better prediction but require higher bit-rates. In this particular embodiment, 8 subbands are used. The low band region is ACELP coded, and the high band is reconstructed in MDCT domains.
Audio signals may look very different depending on the type of sound it represents. Voice activity detection may e.g. be used for switching to alternative encoding schemes.
In
Real examples of voiced and unvoiced speech are presented in the
By making use of an analysis of the power distribution between different bands of the audio signal, a suitable energy offset can be selected, that is narrower than for general audio signals. By determining a parameter that characterizes important aspects of a power distribution of the audio signal in the frequency domain, such a parameter can be utilized for making a selection of a useful energy offset. If the energy offset used for each case by such actions is reduced to half compared to the total energy offset range, one bit can be saved in the encoding of each subband. If, as in the embodiments of
The concept of selecting a proper energy offset depending on an analysis of the power distribution of the audio signal can be further generalized. In
In the open loop approach of the embodiment of
In this particular embodiment, the step of selecting 216 an energy offset is dependent on a power distribution of the audio signal in a frequency domain. To this end, the step of selecting 216 a predetermined energy offset range is based on an open loop procedure, comprising the step 215 of determining a parameter characterizing a power distribution of said audio signal in a frequency domain. The actual selecting is then based on the determined parameter.
In one particular embodiment, the transform encoding is a Modified Discrete Cosine Transform. Also in one particular embodiment, the classification comprises classification between a class of voiced audio signals and a class of unvoiced audio signals. Furthermore, in one particular embodiment, the low band is encoded by a CELP encoder.
The synthesizer 27 is configured for obtaining a low band synthesis signal of an encoding of the audio signal. The synthesis signal may be based on signals of external sources, e.g. from the signal provided to a core decoder 70 via an MDCT transformer 87.
The energy reference block 89 is configured for receiving the energy measure 72 of the first reference band within the low band in a transform domain of the audio signal. The energy measure, i.e. the energy reference 93 is provided to the reconstruction block 81.
The parameter defining a used energy offset is provided to the selector 88. The selector 88 is configured for selecting an energy offset from a set of predetermined energy offsets for each of the first subbands based on the parameter. The reconstruction block 81 is connected to the input block 82, the selector 88 and the energy reference block 89. The reconstruction block 81 is configured for reconstructing a signal in a transform domain by determining a spectrum envelope in the high band from the set of quantization indices 96 by use of the selected of energy offset 92 and the energy measure 93 of the reference band.
The inverse transform decoder 85 is connected to the reconstruction block 81 and configured for performing an inverse transform based on at least the reconstructed energy offsets into at least a part 98 the audio signal.
The encoding further represents a parameter defining a used energy offset range. An energy offset is in step 266 selected from a set of at least two predetermined energy offsets. This is performed for each of the first subbands and is based on the parameter defining a used energy offset. A signal in a transform domain is in step 268 reconstructed by determining a spectrum envelope in the high band from the set of quantization indices corresponding to the first subbands by use of the selected energy offset and the energy measure of the first reference band for each of said first subbands of said first high band. In step 270, an inverse transform is performed based on at least the reconstructed signal in said transform domain into at least a part of the audio signal.
In one particular embodiment, the transform encoding is a Modified Discrete Cosine Transform. Also in one particular embodiment, the classification comprises classification between a class of voiced audio signals and a class of unvoiced audio signals. Furthermore, in one particular embodiment, the low band is encoded by a CELP encoder.
In particular embodiments, the above possible step in energy is preferably restricted. This is achieved by constraining the encoded energy in the subbands closest to the low band not to differ too much from an energy level in the high end of the low band. This is achieved by providing ranges of encoded energies that are restricted not to support encoding of too large positive energy changes. The encoder is constrained not to allow any rapid energy increase, even if this creates mismatch with the original signal energy in these closest subbands. The reference energy for such an increase constraint is derived from a second reference band within the low band. In a particular embodiment, this second reference band is situated at the high end of the low band. In the example given further above, it could e.g. be suitable to select a band of 5.9-6.4 kHz for establishing this second reference energy.
In other words, the high band is divided into two parts. A first high band, situated at the high frequency end of the high band, is encoded according to the principles described further above. A second high band comprises frequencies between the first high band and the low band. In this second high band, the encoded energies, i.e. the quantization indices are restricted in increased energy direction. In other words, the encoded energies are not allowed to increase too fast as compared to the high frequency end of the low band. This is achieved by providing allowed ranges of quantization indices, which do not allow more than a limited positive energy change. The further away from the low band a subband of the second high band is situated, the less restricted is the used quantization indices. In other words, the energy restriction of the encoded energies is reduced with increasing frequency of the second subbands.
In a particular embodiment, the first high band comprises 5 first subbands and covers the range of 8-11.6 kHz. The second high band comprises 3 subbands and ranges between 6.4 and 8 kHz. The MDCT BWE is realized as high-frequency envelope quantization at 1.55 kbit/s. The signal in band 0-6.4 kHz is fully quantized by the ACELP codec. The second reference band ranges between 5.9 and 6.4 kHz. The energy restriction for the first subband in the second high band is an energy difference from the energy reference of maximum +3 dB. The energy restriction for the second subband in the second high band is an energy difference of maximum +6 dB. The energy restriction for the third subband in the second high band is an energy difference of maximum +9 dB. The scalar quantizers of the different subbands are summarized in Table 1 and Table 2 for the second and first high band, respectively. The “Range 1” corresponds to audio samples having a voiced-type energy distribution, while “Range 2” corresponds to audio samples having an unvoiced-type energy distribution. All scalar quantizers have an offset from the corresponding low-frequency reference energy.
The different blocks of the encoder and decoder apparatuses are typically implemented in a processing unit, typically a Digital Signal Processor. The processing unit can be a single unit or a plurality of units to perform different steps of procedures described herein. The processing unit may also be the same processing unit that e.g. performs the low band encoding. The “receiving” of data from e.g. the core encoder may then be implemented as enabling an access to a memory position in which the actual data is stored. In an embodiment of an encoder or decoder apparatus, the apparatus comprises at least one computer program product in the form of a non-volatile memory, e.g. an EEPROM, a flash memory and/or a disk drive. The computer program product comprises a computer program comprising code means which run on the processing unit cause the encoder or decoder apparatus, respectively, to perform the steps of the procedures described further above. The code means in the computer program may comprise a module corresponding to each illustrated block. The modules essentially perform the steps of the procedures described further above. In other words, when the different modules are run on the processing unit they correspond to the corresponding blocks in e.g.
Although the code means in the embodiment disclosed above are implemented as computer program modules which when run on the processing unit causes the blocks to perform steps of the procedures described further below, at least one of the blocks may in alternative embodiments be implemented at least partly as hardware circuits.
As an implementation example,
As an implementation example,
Some or all of the software components described above may be carried on a computer-readable medium, for example a CD, DVD or hard disk, and loaded into the memory for execution by the processor.
The embodiments described above are to be understood as a few illustrative examples of the present invention. It will be understood by those skilled in the art that various modifications, combinations and changes may be made to the embodiments without departing from the scope of the present invention. In particular, different part solutions in the different embodiments can be combined in other configurations, where technically possible. The scope of the present invention is, however, defined by the appended claims.
ACELP—Algebraic Code Excited Linear Prediction
BWE—Bandwidth Extension
CELP—Code-Excited Linear Prediction
MDCT—Modified Discrete Cosine Transform
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SE11/50146 | 2/9/2011 | WO | 00 | 7/30/2013 |