Embodiments according to the invention are related to an audio decoder, an audio encoder, a method for decoding an audio signal, a method for encoding an audio signal and a corresponding computer program. Some embodiments are related to an audio signal.
Some embodiments according to the invention are related to an audio encoding/decoding concept, in which a side information is used for resetting a context of an entropy encoding/decoding.
Some embodiments are related to the control of the reset of an arithmetic coder.
Traditional audio coding concepts include an entropy coding scheme (for example for encoding spectral coefficients of a frequency domain signal representation) in order to reduce redundancy. Typically, entropy coding is applied to quantized spectral coefficients for frequency domain based coding schemes or quantized time domain samples for time domain based coding schemes. These entropy coding schemes typically make use of transmitting a code word in combination with an according code book index, which enables a decoder to look up a certain code book page for decoding an encoded information word corresponding to the transmitted code word on said code book page.
For details regarding such an audio coding concept, reference is made, for example, to international standard ISO/IEC 14496-3:2005(E), part 3: audio, part 4: general audio coding (GA)—AAC, Twin VQ, BSAC, in which the so called concept for “entropy/coding” is described.
However, it has been found that a significant overhead in bitrate is produced by the need for a regular transmission of a detailed code book selection information (e.g. sect_cb).
According to an embodiment, an audio decoder for providing a decoded audio information on the basis of an entropy encoded audio information may have: a context-based entropy decoder configured to decode the entropy-encoded audio information in dependence on a context, which context is based on a previously-decoded audio information in a non-reset state-of-operation; wherein the context-based entropy decoder is configured to select a mapping information, for deriving the decoded audio information from the encoded audio information, in dependence on the context; and wherein the context-based entropy decoder includes a context resetter configured to reset the context for selecting the mapping information to a default context, which default context is independent from the previously-decoded audio information, in response to a side information of the encoded audio information.
According to another embodiment, a method for providing a decoded audio information on the basis of an encoded audio information may have the steps of: decoding the entropy-encoded audio information taking into account a context, which is based on a previously-decoded audio information in a non-reset state of operation, wherein decoding the entropy-encoded audio information includes selecting a mapping information for deriving the decoded audio information from the encoded audio information, in dependence on the context, and using the selected mapping information for deriving a first portion of the decoded audio information; and wherein decoding the entropy-encoded audio information also includes resetting the context for selecting the mapping information to a default context, which is independent from the previously-decoded audio information, in response to a side information, and using the mapping information, which is based on the default context, for decoding a second portion of the decoded audio information.
According to another embodiment, an audio encoder for providing an encoded audio information on the basis of an input audio information may have: a context-based entropy encoder configured to encode a given audio information of the input audio information in dependence on a context, which context is based on an adjacent audio information, temporally or spectrally adjacent to the given audio information, in a non-reset state of operation; wherein the context-based entropy encoder is configured to select a mapping information for deriving the encoded audio information from the input audio information, in dependence on the context; and wherein the context-based entropy encoder includes a context resetter configured to reset the context for selecting the mapping information to a default context within a contiguous piece of input audio information, in response to the occurrence of a context reset condition; and wherein the audio encoder is configured to provide a side information of the encoded audio information indicating the presence of a context reset condition.
According to another embodiment, a method for providing an encoded audio information on the basis of an input audio information may have the steps of: encoding a given audio information of the input audio information in dependence on a context, which context is based on an adjacent audio information, temporally or spectrally adjacent to the given audio information, in a non-reset state of operation, wherein encoding the given audio information in dependence on the context includes selecting a mapping information, for deriving the encoded audio information from the input audio information, in dependence on the context, resetting the context for selecting the mapping information to a default context within a contiguous piece of input audio information in response to the occurrence of a context reset condition; and providing a side information of the encoded audio information indicating the presence of the context reset condition.
Another embodiment may have a computer program for performing the method for providing a decoded audio information on the basis of an encoded audio information, which method may have the steps of: decoding the entropy-encoded audio information taking into account a context, which is based on a previously-decoded audio information in a non-reset state of operation, wherein decoding the entropy-encoded audio information includes selecting a mapping information for deriving the decoded audio information from the encoded audio information, in dependence on the context, and using the selected mapping information for deriving a first portion of the decoded audio information; and wherein decoding the entropy-encoded audio information also includes resetting the context for selecting the mapping information to a default context, which is independent from the previously-decoded audio information, in response to a side information, and using the mapping information, which is based on the default context, for decoding a second portion of the decoded audio information, when the computer program runs on a computer.
Another embodiment may have a computer program for performing the method for providing an encoded audio information on the basis of an input audio information, which method may have the steps of: encoding a given audio information of the input audio information in dependence on a context, which context is based on an adjacent audio information, temporally or spectrally adjacent to the given audio information, in a non-reset state of operation, wherein encoding the given audio information in dependence on the context includes selecting a mapping information, for deriving the encoded audio information from the input audio information, in dependence on the context, resetting the context for selecting the mapping information to a default context within a contiguous piece of input audio information in response to the occurrence of a context reset condition; and providing a side information of the encoded audio information indicating the presence of the context reset condition, when the computer program runs on a computer.
According to another embodiment, an encoded audio signal may have: an encoded representation of a plurality of sets of spectral values, wherein a plurality of the sets of spectral values are encoded in dependence on an non-reset context, which is dependent on a respective preceding set of spectral values; wherein a plurality of the sets of spectral values are encoded in dependence on a default context, which is independent from a respective preceding set of spectral values; and wherein the encoded audio signal includes a side information signaling if a set of spectral coefficients is encoded in dependence on a non-reset context or in dependence on the default context.
An embodiment according to the invention creates an audio decoder for providing a decoded audio information on the basis of an encoded audio information. The audio decoder comprises a context-based entropy decoder configured to decode the entropy-encoded audio information in dependence on a context, which context is based on a previously decoded audio information in a non-reset state of operation. The entropy decoder is configured to select a mapping information (e.g. a cumulative frequencies table, or a Huffmann-codebook) for deriving the decoded audio information from the encoded audio information in dependence on the context. In addition, the context-based entropy decoder also comprises a context resetter configured to reset the context for selecting the mapping information to a default context, which is independent from the previously decoded audio information, in response to a side information of the encoded audio information.
This embodiment is based on finding that in many cases it is bitrate-efficient to derive the context, which determines the mapping of an entropy-encoded audio information onto a decoded audio information (for example by examining a code book, or by determining a probability distribution) in dependence on a context which is based on previously decoded audio information items, as accordingly, correlations within the entropy-encoded audio information can be exploited. For example, if a certain spectral bin comprises a large intensity in the first audio frame, then there is a high probability that the same spectral bin again comprises a large intensity in the next audio frame following the first audio frame. Thus, it becomes apparent that a selection of the mapping information on the basis of the context allows for a reduction of the bitrate when compared to a case in which a detailed information for the selection of a mapping information for deriving the decoded audio information from the encoded audio information is transmitted.
However, it has also been found that a derivation of the context from the previously decoded audio information sometimes results in situations in which a mapping information (for deriving the decoded audio information from the encoded audio information) is chosen, which is significantly inappropriate and thus results in an unnecessarily high bit demand for encoding the audio information. This situation would occur, if for example, the spectral energy distribution of subsequent audio frames differ significantly, such that new spectral energy distribution within the subsequent audio frame deviates strongly from the distribution which would be expected on the basis of a knowledge of the spectral distribution within the previous audio frame.
According to a key idea of the invention, in such cases, in which the bitrate would be significantly degraded by the choice of an inappropriate mapping information (for deriving the decoded audio information from the encoded audio information), the context is reset in response to a side information of the encoded audio information, thereby achieving the selection of a default mapping information (being associated with the default context) which in turn results in a moderate bit consumption for an encoding/decoding of the audio information.
To summarize the above, it is the key idea of the present invention that a bitrate-efficient encoding of an audio information can be achieved by combining a context-based entropy decoder, which normally (in a non-reset state of operation) uses a previously encoded audio information for deriving a context and for selecting a corresponding mapping information, with a side-information-based reset mechanism for resetting the context, because such a concept brings along a minimum effort for maintaining an appropriate decoding context, which is well-adapted to the audio content in a normal case (when the audio content fulfills the expectations used for the design of the context-based selection of a mapping rule) and avoids an excessive increase of the bitrate in an abnormal case (when the audio content strongly deviates from said expectations).
In an advantageous embodiment, the context resetter is configured to selectively reset the context-based entropy decoder at a transition between subsequent time portions (e.g. audio frames) having associated spectral data of the same spectral resolution (e.g. number of frequency bins). This embodiment is based on the finding that a reset of the context may have advantageous effect (in terms of reducing the useful bitrate) even if the spectral resolution remains unchanged. In other words, it has been found that it should be possible to perform a reset of the context independent from a change of the spectral resolution, because it has been found that the context may be inappropriate even if it is not necessary to change the spectral resolution (for example, by switching from a “long window” per frame to a plurality of “short windows” per frame). In other words, it has been found that a context may be inappropriate (which raises the desire to reset the context) even in a situation in which it is not desirable to change from a low temporal resolution (e.g. long window, in combination with high spectral resolution) to a high temporal resolution (e.g. short windows, in combination with a small spectral resolution).
In an advantageous embodiment, the audio decoder is configured to receive, as the encoded audio information, an information describing spectral values in a first audio frame and in a second audio frame subsequent to the first audio frame. In this case, the audio decoder advantageously comprises a spectral-domain-to-time-domain transformer configured to overlap-and-add a first windowed time domain signal, which is based on the spectral values of the first audio frame, and a second windowed time domain signal, which is based on the spectral values of the second audio frame. The audio decoder is configured to separately adjust a window shape of a window for obtaining the first windowed time domain signal and of a window for obtaining the second windowed time domain signal. The audio decoder is also advantageously configured to perform, in response to the side information, a reset of the context between a decoding of the spectral values of the first audio frame and a decoding of the spectral values of the second audio frame, even if the second window shape is identical to the first window shape, such that the context used for decoding the encoded audio information of the second audio frame is independent of the decoded audio information of the first audio frame in the case of a reset.
This embodiment allows for a reset of the context between a decoding (using mapping information selected on the basis of the context) of spectral values of the first audio frame and a decoding (using mapping information selected on the basis of the context) of spectral values of the second audio frame, even if windowed time domain signals of the first and second audio frames are overlapped-and-added, and even if identical window shapes are selected for deriving the first windowed time domain signal and the second windowed time domain signal from the spectral values of the first audio frame and the second audio frame.
Thus, the reset of the context may be introduced as an additional degree of freedom, which can be applied by the context resetter even between a decoding of spectral values of closely-related audio frames, the windowed time domain signals of which are derived using identical window shapes and are overlapped-and-added.
Thus, it is advantageous that the reset of context is independent from used window shapes and also independent from the fact that windowed time domain signals of subsequent frames belong to a contiguous audio content, i.e. are overlapped-and-added.
In an advantageous embodiment, the entropy decoder is configured to reset, in response to side information, the context between the decoding of audio information of adjacent frames of the audio information having identical frequency resolutions. In this embodiment, a reset of the context is performed independent from a change of the frequency resolution.
In yet another advantageous embodiment, the audio decoder is configured to receive a context-reset side information for signaling a reset of the context. In this case, the audio decoder is also configured to additionally receive a window-shape side information to adjust the window shapes of windows for obtaining the first and second windowed time signals independent from performing the reset of the context.
In an advantageous embodiment, the audio decoder is configured to receive, as the side information for resetting the context, a one-bit context reset flag per audio frame of the encoded audio information. In this case, the audio decoder is advantageously configured to receive, in addition to the context reset flag, a side information describing a spectral resolution of spectral values represented by the encoded audio information or a window length of a time window, for windowing time domain values represented by the encoded audio information. The context resetter is configured to perform a reset of the context in response to the one-bit context-reset-flag at a transition between two audio frames of the encoded audio information representing spectral values of identical spectral resolutions. In this case, the one-bit context reset-flag typically results in a single reset of the context between a decoding of encoded audio information of subsequent audio frames.
In another advantageous embodiment, the audio decoder is configured to receive, as a side information for resetting the context, a one-bit context to reset-flag per audio frame of the encoded audio information. Also, the audio decoder is configured to receive an encoded audio information comprising of a plurality of sets of spectral values per audio frame (such that a single audio frame is subdivided into multiple sub frames, to which individual short windows may be associated). In this case, the context-based entropy decoder is configured to decode the entropy-decoded audio information of a subsequent set of spectral values of a given audio frame in dependence on a context, which context is based on a previously decoded audio information of a preceding set of spectral values of the given audio frame in a non-reset state of operation. However, the context resetter is configured to reset the context to the default context before a decoding of a first set of spectral values of the given audio frame and between a decoding of any two subsequent sets of spectral values of the given audio frame in response to the one-bit context reset flag (i.e. if, and only if, the one-bit context reset flag is active), such that an activation of the one-bit context reset flag of the given audio frame causes a multiple-times resetting of the context when decoding the multiple sets of spectral values of the audio frame.
This embodiment is based on the finding that in this typically inefficient, in terms of bitrate, to perform only a single reset of the context in an audio frame comprising a plurality of “short windows,” for which individual sets of spectral values are encoded. Rather an audio frame comprising multiple sets of spectral values typically comprises a strong discontinuity of the audio content, such that it is advisable, in order to reduce the bitrate, to reset the context between each of the subsequent sets of spectral values. Such a solution has been found to be more efficient than a one-time reset of the context (for example, only at the beginning of the frame) and than individually signaling (e.g. using extra one-bit flags) multiple context reset times within the (multiple-short-window) frame.
In an advantageous embodiment, the audio decoder is configured to also receive a grouping side information when using so-called “short windows” (i.e. transmitting multiple sets of spectral values, which are overlapped-and-added using multiple short windows being shorter than an audio frame). In this case, the audio decoder is advantageously configured to group two or more of the sets of spectral values for a combination with a common scale factor information in dependence on the grouping side information. In this case, the context-resetter is advantageously configured to reset the context to the default context between a decoding of sets of spectral values grouped together in response to the one-bit context reset flag. This embodiment is based on the finding that, in some cases, there may be a strong variation of the decoded audio values (e.g. decoded spectral values) of a grouped sequence of sets of spectral values, even if the initial scale factors are applicable to the subsequent sets of spectral values. For example, if there is a steady yet significant frequency variation between subsequent sets of spectral values, the scale factors of the subsequent sets of spectral values may be equal (for example, if the frequency variation does not exceed a scale factor band), while it is nevertheless appropriate to reset the context at the transition between the different sets of spectral values. Thus, the described embodiment allows for a bitrate efficient encoding and decoding even in the presence of such frequency-variation audio signal transitions. Also, this concept still allows for good performance when encoding rapid volume changes in the presence of strongly correlated spectral values. In this case, a reset of the context can be avoided by deactivating the context-reset-flag, even though different scale factors may be associated with subsequent set of spectral values (which are not grouped together in this case, because the scale factors differ).
In another embodiment, the audio decoder is configured to receive, as the side information for resetting the context, a one-bit context reset flag per audio frame of the encoded audio information. In this case, the audio decoder is also configured to receive, as the encoded audio information, a sequence of encoded audio frames, the sequence of encoded frames comprising a linear-prediction-domain audio frame. The linear-prediction-domain audio frame comprises, for example, a selectable number of transform-coded-excitation portions for exciting a linear-prediction-domain audio synthesizer. The context-based entropy decoder is configured to decode spectral values of the transform-coded-excitation portions in dependence on a context, which context is based on a previously-decoded audio information in a non-reset state of operation. The context resetter is configured to reset, in response to the side information, the context to the default context before a decoding of a set of spectral values of a first transform-coded-excitation portion of a given audio frame, while omitting a reset of the context to the default context between a decoding of sets of spectral values of different transform-coded-excitation portions of (i.e. within) the given audio frame. This embodiment is based on the finding that a combination of a context-based decoding and a context reset brings along a reduction of the bitrate when encoding a transform-coded-excitation for a linear-prediction-domain audio synthesizer. In addition, it has been found that a temporal granularity for resetting the context when encoding a transform-coded-excitation typically can be chosen larger than a temporal granularity of resetting the context in the presence of a transition (short windows) of a pure frequency domain encoding (e.g. an Advanced-Audio-Coding-type audio coding).
In another advantageous embodiment, the audio decoder is configured to receive an encoded audio information comprising a plurality of sets of spectral values per audio frame. In this case, the audio decoder is also advantageously configured to receive a grouping side information. The audio decoder is configured to group two or more of the sets of spectral values for a combination with a common scale factor information in dependence on the grouping side information. In the advantageous embodiment, the context resetter is configured to reset the context to the default context in response to (i.e. in dependence on) the grouping side information. The context resetter is configured to reset the context between a decoding of sets of spectral values of subsequent groups, and to avoid to reset the context between a decoding of sets of spectral values of a single group (i.e. within a group). This embodiment of the invention is based on the finding that it is not necessary to use a dedicated context reset side information if there is a signaling of sets of spatial values having high similarity (and being grouped together for this reason). In particular, it has been found that there are many cases in which it is appropriate to reset the context whenever the scale factor data change (for example at a transition from one set of spectral values to another set of spectral values within a window, particularly if the sets of spectral values are not grouped, or at a transition from one window to another window). If however, it is desired to reset the context between two sets of spectral values, to which the same scale factors are associated, it is still possible to enforce the reset by signaling the presence of a new group. This brings along the price of retransmitting identical scale factors, but might be advantageous if a missing reset of the context would significantly degrade the coding efficiency. Nevertheless, an evaluation of the grouping side information for the reset of the context may be an efficient concept to avoid the need to transmit a dedicated context reset side information while still allowing a reset of the context whenever appropriate. In those cases in which the context may (or should) be reset even when the same scale factor information could be used, there is a penalty in terms of bitrate (caused by the need to use an additional group and retransmit the scale factor information), which penalty in bitrate may be compensated by a bitrate reduction in other frames.
Another embodiment according to the invention creates an audio encoder for providing an encoded audio information on the basis of an input audio information. The audio encoder comprises a context-based entropy encoder configured to encode a given audio information of the input audio information in dependence on a context, which context is based on an adjacent audio information, temporarily or spatially adjacent to the given audio information, in a non-reset state of operation. The context-based entropy encoder is also configured to select a mapping information, for deriving the encoded audio information from the input audio information, in dependence on the context. The context-based entropy encoder also comprises a context resetter configured to reset the context for selecting the mapping information to a default context, which is independent from the previously decoded audio information, within a continuous piece of input audio information in response to the occurrence of a context reset condition. The context-based entropy encoder is also configured to provide a side information of the encoded audio information indicating the presence of a context reset conditional. This embodiment according to the invention is based on the finding that the combination of a context-based entropy encoding and on occasional reset of the context, which is signaled by an appropriate side information, allows for a bitrate-efficient encoding of an input audio information.
In an advantageous embodiment, the audio encoder is configured to perform a regular context reset at least once per n frames of the input audio information. It has been found that a regular context reset brings along the chance to synchronize to an audio signal very rapidly, because a reset of the context introduces a temporal limitation of inter-frame dependencies (or at least contributes to such a limitation of the inter-frame dependences).
In another advantageous embodiment, the audio encoder is configured to switch between a plurality of different coding modes (for example, frequency domain encoding mode and linear-prediction-domain encoding mode). In this case, the audio encoder may advantageously be configured to perform a context reset in response to a change between two coding modes. This embodiment is based on the finding that the change between two coding modes is typically connected with a significant change of the input audio signal, such that there is typically only a very limited correlation between the audio content before the switching of the coding mode and after the switching the coding mode.
In another advantageous embodiment, the audio encoder is configured to compute or estimate a first number of bits that may be used for encoding a certain audio information (e.g. a specific frame or portion of the input audio information, or at least one or more specific spectral values of the input audio information) of the input audio information in dependence on a non-reset context, which non-reset context is based on an adjacent audio information temporarily or spectrally adjacent to the certain audio information, and compute or estimate a second number of bits that may be used for encoding the certain audio information using the default context (e.g. the state of the context to which the context is reset). The audio encoder is further configured to compare the first number of bits and the second number of bits to decide whether to provide the encoded audio information corresponding to the certain audio information on the basis of the non-reset context or on the basis of the default context. The audio encoder is also configured to signal the result of said decision using the side information. This embodiment is based on the finding that it is sometimes difficult to decide a priori whether it is advantageous, in terms of bitrate, to reset the context. A reset of the context may result in a selection of a mapping information (for deriving the encoded audio information from a certain input audio information), which is better suited (in terms of providing a lower bitrate) for the encoding of the certain audio information or worse-suited (in terms of providing a higher bitrate) for encoding the certain audio information. In some cases, it has been found to be advantageous to decide, whether or not to reset the context, by determining the number of bits that may be used for the encoding using both variants, with and without resetting the context.
Further embodiments according to the invention create a method for providing a decoded audio information on the basis of an encoded audio information, and a method for providing an encoded audio information on the basis of an input audio information.
Further embodiments according to the invention create corresponding computer programs.
Further embodiments according to the invention create an audio signal.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
a shows a graphical representation, in the form of a syntax representation, of information comprised by a frequency domain channel stream, which can be provided by the inventive audio encoder and which can be used by the inventive audio decoder;
b shows a graphical representation, in the form of a syntax representation, of information representing arithmetically coded spectral data of the frequency domain channel stream of
a shows a pseudo program code—in a C-language like form—of a method for resetting a context of an arithmetic coding;
b shows a pseudo program code of a method for mapping a context of an arithmetic decoding between frames or windows of identical spectral resolution and also between frames or windows of different spectral resolution;
c shows a pseudo program code of a method for deriving a state value from a context;
d shows a pseudo program code of a method for deriving an index of a cumulative frequencies table from a value describing the state of the context;
e shows a pseudo program code of a method for arithmetically decoding arithmetically encoded spectral values;
f shows a pseudo program code of a method for updating the context subsequent to a decoding of a tuple of spectral values;
a shows a graphical representation of a context reset in the presence of audio frames having associated therewith “long windows” (one long window per audio frame);
b shows a graphical representation of a context reset for audio frames having associated therewith a plurality of “short windows” (e.g. eight short windows per audio frame);
c shows a graphical representation of a context reset at a transition between a first audio frame having associated therewith a “long start window” and an audio frame having associated therewith a plurality of “short windows;”
a shows a graphical representation, in the form of a syntax representation, of information comprised by a linear prediction-domain channel stream;
b shows a graphical representation, in the form of a syntax representation, of information comprised by a transform coded-excitation coding, which transform-coded-excitation coding is part of the linear-prediction-domain channel stream of
c and 11d show a legend defining information items and help elements used in the syntax representations of
Thus, in operation, the context resetter 130 resets the context 122 whenever it detects a context-reset side information (e.g. a context reset flag) associated with the entropy-encoded audio information 110. A reset of the context 122 to the default context may have the consequence that a default mapping information (e.g. a default Huffmann-codebook, in the case of a Huffmann coding, or a default (cumulative) frequency information “cum_freq” in the case of an arithmetic coding) is selected for deriving the decoded audio information 112 (e.g. decoded spectral values a,b,c,d) from the entropy-encoded audio information 110 (comprising, e.g. encoded spectral values a,b,c,d).
Accordingly, in a non-reset state of operation, the context 122 is affected by previously decoded audio information, for example spectral values of previously decoded audio frames. Consequently, the selection of the mapping information (which is performed on the basis of the context), for decoding a current audio frame (or for decoding one or more spectral values of the current audio frame), is typically dependent on decoded audio information of a previously decoded frame (or a previously decoded “window”).
In contrast, if the context is reset (i.e. in a context reset state of operation), the impact of the previously decoded audio information (e.g. decoded spectral values) of a previously decoded audio frame onto the selection of the mapping information, for decoding a current audio frame, is eliminated. Thus, after a reset, the entropy decoding of the current audio frame (or at least of some spectral values) is typically no longer dependent on the audio information (e.g. spectral values) of the previously decoded audio frame. Nevertheless, a decoding of an audio content (e.g. one or more spectral values) of the current audio frame may (or may not) comprise some dependencies on previously decoded audio information of the same audio frame.
Accordingly, the consideration of the context 122 may improve the mapping information 124 used for deriving the decoded audio information 112 from the encoded audio information 110 in the absence of a reset condition. The context 122 may be reset if the side information 132 indicates a reset condition in order to avoid the consideration of an inappropriate context, which would typically result in an increased bitrate. Accordingly, the audio decoder 100 allows for a decoding of an entropy-encoded audio information with a good bitrate efficiency.
In the following, an overview will be given over an audio decoder, which allows for a decoding of both frequency-domain encoded audio content and linear-prediction-domain encoded audio content, thereby allowing for the dynamic (e.g. frame-wise) choice of the most appropriate coding mode. It should be noted that the audio decoder discussed in the following combines frequency-domain decoding and linear-prediction-domain decoding. However, it should be noted that the functionalities that are discussed in the following can be used separately in a frequency-domain audio decoder and a linear-prediction-domain audio decoder.
The audio decoder 200 also comprises a time domain signal reconstruction. In the case of a frequency-domain encoding, the time domain signal reconstruction may for example, comprise an inverse quantizer 250, which receives the frequency-domain decoded spectral values provided by the entropy decoder 240 and to provide, on the basis thereof, inversely quantized frequency-domain decoded spectral values to a frequency-domain-to-time-domain audio signal reconstruction 252. The frequency-domain-to-time-domain audio signal reconstruction may be configured to receive the frequency-domain control information 228 and, optionally, additional information (like, for example, control information). The frequency-domain-to-time-domain audio signal reconstruction 252 may be configured to provide, as an output signal, a frequency-domain coded time domain audio signal 254. Regarding the linear prediction domain, the audio decoder 200 comprises a linear-prediction-domain-to-time-domain audio signal reconstruction 262, which is configured to receive the linear-prediction-domain transform-coded-excitation stimulus decoded spectral values 244, the linear-prediction-domain control information 226 and optionally, additional linear-prediction-domain information (for example coefficients of the linear prediction models, or an encoded version thereof), and to provide, on the basis thereof, a linear-prediction-domain coded time domain audio signal 264.
The audio decoder 200 also comprises a selector 270 for selecting between the frequency-domain coded time domain audio signal 254 and the linear-prediction-domain coded time domain audio signal 264 in dependence on the domain selection information 230, to decide whether the decoded audio signal 212 (or a temporal portion thereof) is based on the frequency-domain coded time domain audio signal 254 or the linear-prediction-domain coded time domain audio signal 264. At the transition between the domains, a cross fade may be performed by the selector 270 to provide the selector output signal 272. The decoded audio signal 212 may be equal to the selector audio signal 272, or may advantageously be derived from the selector signal 272 using an audio signal postprocessor 280. The audio signal postprocessor 280 may take into consideration the post processing control information 232 provided by the bitstream demultiplexer 220.
To summarize the above, the audio decoder 200 may provide the decoded audio signal 212 on the basis of either the frequency-domain channel stream data 222 (in combination with possible additional control information), or the linear-prediction-domain channel stream data 224 (in combination with additional control information), wherein the audio decoder 200 may switch between the frequency-domain and the linear-prediction-domain using the selector 270. The frequency-domain coded time domain audio signal 254 and the linear-prediction-domain coded time domain audio signal 264 may be generated independently from each other. However, the same entropy decoder/context resetter 240 may be applied (possibly in combination with different, domain-specific mapping information, like cumulative frequencies tables) for the derivation of the frequency domain decoded spectral values 242, which form the basis of the frequency-domain coded time domain audio signal 254, and for the derivation of the linear-prediction-domain transform-coded-excitation stimulus decoded spectral values 244, which form the basis for the linear-prediction-domain coded time-domain audio signal 264.
In the following, details regarding the provision of the frequency-domain decoded spectral values 242 and regarding the provision of the linear-prediction-domain transform-coded-excitation stimulus decoded spectral values 244 will be discussed.
It should be noted that details regarding the derivation of the frequency-domain coded time domain audio signal 254 from frequency-domain decoded spectral values 242 can be found in international standard ISO/IEC 14496-3:2005, part 3: audio, part 4: general audio coding (GA)—AC, Twin VQ, BSAC, and the documents referenced therein.
It should also be noted that details regarding the computation of the linear-prediction-domain coded time-domain audio signal 264 on the basis of the linear-prediction-domain transform-coded-excitation stimulus decoded spectral values 244 may for example, be found in the international standards 3GPP TS 26.090, 3GPP TS 26.190 and 3GPP TS 26.290.
Said standards also comprise information regarding some of the symbols used in the following.
In the following, it will be described, how the frequency-domain decoded spectral values 242 can be derived from the frequency-domain channel stream data, and how the inventive context reset is involved in this calculation.
In the following, the relevant data structures of a frequency domain channel stream will be described taking reference to
a shows a graphical representation, in the form of a table, of the syntax of the frequency domain channel stream. As can be seen, the frequency domain channel stream may comprise a “global_gain” information. In addition, the frequency domain channel stream may comprise scale factor data (“scale_factor_data”), which define scale factors for different frequency bins. Regarding the global gain and the scale factor data, and their usage, reference is made to international standard ISO/IEC 14496-3(2005), part 3, sub part 4, and to the documents referenced therein.
The frequency domain channel stream may also comprise arithmetically coded spectral data (“ac_spectral_data”) which will be explained in detail in the following. It should be noted that the frequency-domain channel stream may comprise additional optional information, like noise filling information, configuration information, time warp information and temporal noise shaping information, which are not of relevance for the present invention.
In the following, details regarding the arithmetically coded spectral data will be discussed taking reference to
Taking reference again to
Taking reference now to
In addition, for groups of 4-tuples having a cardinal larger than one, an arithmetic codeword “acod_ne” for decoding the index ne of the tuple within the group ng may be included within the arithmetically encoded data “arith_data.” The codeword “acod_ne” may be encoded, for example, in dependent from a context.
In addition, one or more arithmetically encoded code words “acod_r” encoding one or more of the least significant bits of the values a,b,c,d of the tuple may be included in the arithmetically encoded data “arith_data.”
To summarize, the arithmetically encoded data “arith_data” comprise one (or in the presence of an arithmetic escape sequence, more) arithmetic codeword “acod_ng” for encoding a group index ng taking into account a cumulative frequencies table having index pki. Optionally (depending on the cardinal of the group designated by the group index ng) the arithmetically encoded data also comprise an arithmetic codeword “acod_ne” for encoding an element index ne. Optionally, the arithmetically encoded data may also comprise one or more arithmetic code words for encoding one or more least significant bits.
The context, which determines the index (e.g. pki) of the cumulative frequencies table used for the encoding/decoding of the arithmetic codeword “acod_ng” is based on context data q[0], q[1], qs not shown in
It has been noted that the initialization of the context, which is described in the section “obtain-inter-window context information” is performed once (and advantageously only once) per audio frame (if the audio frame comprises only one window) or once (and advantageously only once) per window (if the current audio frame comprises more than one window).
Accordingly, a reset of the entire context information q[0], q[1], qs (or the alternative initialization of the context information q[0] on the basis of the decoded spectral values of the previous frame (or previous window)) is advantageously performed only once per block of arithmetically encoded data (i.e. only once per window if the present frame comprises only one window, or only once per window, if the present frame comprises more than one window).
In contrast, the context information q[1] (which is based on the previously decoded spectral values of the current frame or window) is updated upon completion of a decoding of a single tuple of spectral values a, b, c, d, for example as defined by the procedure “arith_update_context.”
For further details regarding the payloads of the “spectral noiseless coder” (i.e. for encoding the arithmetically encoded spectral values) reference is made to the definitions as given in the tables of
To summarize, spectrum coefficients (e.g. a, b, c, d) from both the “linear prediction domain” coded signal 224 and the “frequency-domain” coded signal 222 are scalar quantized and then noiselessly coded by an adaptively context dependent arithmetic coding (for example an encoder providing the entropy coded audio signal 210). The quantized coefficients (e.g. a, b, c, d) are gathered together in 4-tuples before being transmitted (by the encoder) from the lowest-frequency to the highest-frequency. Each 4-tuple is split into the most significant 3-bits (one bit for the sign and 2 for the amplitude) wise plane and the remaining less significant bit-planes. The most significant 3-bits wise plane is coded according to its neighborhood (i.e. taking into consideration the “context”) by means of the group index, ng, and the element index, ne. The remaining less significant bit-planes are entropy coded without considering the context. The indexes ng and ne and the less significant bit-planes form the samples of the arithmetic coder (which are evaluated by the entropy decoder 240). Details regarding the arithmetic coding will be described below in the section 1.2.2.2.
In the following, the functionality of the context-based entropy decoder 120, 240, comprising the context resetter 130, will be described in detail taking reference to
It should be noted that it is the function of the context-based entropy decoder to reconstruct (decode) an entropy decoded (advantageously arithmetically decoded) audio information (e.g. spectral values a, b, c, d of a frequency-domain representation of the audio signal, or of a linear-prediction-domain transform-coded-excitation representation of the audio signal) on the basis of an entropy encoded (advantageously arithmetically encoded) audio information (e.g. encoded spectral values). The context-based entropy decoder (comprising the context resetter) may for example be configured to decode spectral values a, b, c, d encoded as described by the syntax shown in
It should also be noted that the syntax shown in
Taking reference now to
Subsequently, a plurality of arithmetically encoded spectral values (or tuples of such values) may be decoded by performing steps 620, 630, 640 one or more times. In step 620, a mapping information (for example a Huffmann codebook, or a cumulative frequencies table “cum_freq”) is selected on the basis of the context as established in step 610 (and optionally updated in the step 640). The step 620 may comprise a one-or-more step method for determining the mapping information. For example, the step 620 may comprise a step 622 of computing the state of the context on the basis of the context information (e.g. q[0], q[1]). The computation of the state of the context may for example be performed by the function “arith_get_context,” which is defined below. Optionally, an auxiliary mapping may be performed (for example as seen in the pseudo code portion labeled “compute state of context” of
Subsequently, the context may be updated in the step 640 using the newly decoded audio information (for example using one or more spectral values a, b, c, d). For example, a portion of the context representing previously encoded audio information of the present frame or window (e.g. q[1]) may be updated. For this purpose, the function “arith_update_context” detailed below may be used.
As mentioned above, steps 620, 630, 640 may be repeated.
Entropy decoding the encoded audio information may comprise using one or more arithmetic code words (e.g. “acod_ng,” “acod_ne” and/or “acod_r”) comprised by the entropy encoded audio information 222, 224, for example as represented in
In the following, an example of the context considered for the state calculation (state of the context) will be described taking reference to
Taking reference now to
Regarding the arithmetic encoding and arithmetic decoding, it should be noted that the arithmetic coder produces a binary code for a given set of symbols (e.g. spectral values a, b, c, d) and their respective probabilities (as defined, for example, by the cumulative frequencies tables). The binary code is generated by mapping a probability interval, where a set of symbols (e.g. a,b,c,d) lies, to a code word. Inversely, the set of samples in (e.g. a, b, c, d) are derived from the binary code by an inverse mapping, wherein the probability of the samples (e.g. a, b, c, d) is taken into account (for example by selecting a mapping information, like a cumulative frequencies distribution, on the basis of the context). In the following, the decoding process, i.e. the process of arithmetic decoding, which may be performed by the context based entropy decoder 120 or by the entropy decoder/context resetter 240, and which has been generally described taking reference to
For this purpose, reference is made to the definitions shown in the table of
Regarding the decoding process, it can be said that 4-tuples of quantized spectral coefficients are noiselessly coded (by the encoder) and transmitted (via a transmission channel or storage medium between the encoder and the decoder discussed here) starting from the lowest-frequency coefficient and progressing to the highest-frequency coefficient.
The coefficients from the advanced audio coding (AAC) (i.e. the coefficients of the frequency-domain channel stream data) are stored in an array “x_ac_quant[g][win][sfb][bin],” in the order of transmission of the noiseless coding code word is such that when they are decoded in the order received and stored in the array, [bin] if the most rapidly incrementing index and [g] is the most slowly incrementing index. Within a codeword the order of decoding is a, b, c, d.
The coefficient from the transform-coded-excitation (TCX) (e.g. of the linear-prediction-domain channel stream data) are stored directly in the array “x_tcx_invquant[win][bin],” and the order of the transmission of the noiseless coding code words is such that when they are decoded in the order received and stored in the array, bin if the most rapidly incrementing index and win if the most slowly incrementing index. Within a codeword, the order of decoding is a, b, c, d.
First, the flag “arith_reset_flag” is evaluated. The flag “arith_reset_flag” determines if the context may be reset. If the flag is TRUE, the function “arith_reset_context,” which is shown in the pseudo program code representation of
The noiseless decoder (or entropy decoder) outputs 4-tuples of signed quantized spectral coefficients. At first, the state of the context is calculated based on the four previously decoded groups “surrounding” (or, more precisely, neighboring) the 4-tuple to decode (as shown in
Once the state s is known, the group to which belongs the most significant 2-bits wise plane of 4-tuple is decoded using the function “arith_decode( )” fed with (or configured to use) the appropriate (selected) cumulative frequencies table corresponding to the context state. The correspondence is made by the function “arith_get_pk( )” which is represented by the pseudo code representation of
To summarize, the functions “arith_get_context” and “arith_get_pk” allow the obtain a cumulative frequencies table index pki on the basis of the context (namely q[0][1+i], q[1][1+i−1], q[s][1+i−1], q[0][1+i+1]). Thus, it is possible to select mapping information (namely one of the cumulative frequencies tables) in dependence on the context.
Then (once the cumulative frequencies table is selected) the “arith_decode( )” function is called with the cumulative frequencies table corresponding to the index returned by the “arith_get_pk( )” The arithmetic decoder is an integer implementation generating tag with scaling. The pseudo C-code shown in
Taking reference to the algorithm “arith_decode” shown in
This can be seen when considering the graphic representation of the syntax of “arith_data”, which is given in
While the decoded group index ng is the “escape” symbol, “ARITH_ESCAPE,” an additional group index ng is decoded and the variable lev is incremented by two. Once the decoded group index is not the escape, “ARITH_ESCAPE,” the number of elements, mm, within the group and the group offset, og, are deduced by looking up to the table “dgroups[ ]”:
mm=dgroups[nq]&255
og=dgroups[nq]>>8
The element index ne is then decoded by calling the function “arith_decode( )” with the cumulative frequencies table (arith_cf_ne+((mm*(mm−1))>>1)[]. Once the element index is decoded, the most significant 2-bits wise plane of the 4-tuple can be derived with the table “dgvector[ ]:”
a=dgvectors[4*(og+ne)]
b=dgvectors[4*(og+ne)+1]
c=dgvectors[4*(og+ne)+2]
d=dgvectors[4*(og+ne)+3]
The remaining bit planes (for example the least significant bits) are then decoded from the most significant to the lowest significant level by calling lev times “arith_decode( )” with the cumulative frequencies table “arith_cf_r[ ]” (which is a predefined cumulative frequencies table for the decoding of the least significant bits, and which may indicate equal frequencies of the bit combinations). The decoded bit plane r permits to refine the decode 4-tuple by the following way:
a=(a<<1)|(r&1)
b=(b<<1)|((r>>1)&1)
c=(c<<1)|((r>>2)&1)
d=(d<<1)|(r>>3)
Once the 4-tuple (a, b, c, d) is completely decoded, the context tables q and qs are updated by calling the function “arith_update_context( )” which is represented by the pseudo program code representation of
As can be seen from
To summarize, the function “arith_update_context” comprises two main functionalities, namely to update the context portion (e.g. q[1]) representing previously decoded spectral values of the current frame of window, as soon as a new spectral value of the current frame or window is decoded, and to update the context history (e.g. qs) in response to the completion of the decoding of a frame or window, such that the context history qs can be used to derive a context portion (e.g. q[0]) which represents an “old” context when decoding the next frame or window.
As can be seen in the pseudo program code representation of
In the following, the method of arithmetic decoding will be briefly summarized taking reference to
In the following, the course of the decoding will be briefly discussed for different scenarios taking reference to
a shows a graphical representation of the course of the decoding for an audio frame being frequency-domain encoded using a so-called “long window.” Regarding the encoding, reference is made to International Standard IOC/IEC 14493-3(2005), part 3, sub-part 4. As can be seen, the audio contents of a first frame 1010 are closely related, and the time-domain signals reconstructed for the audio frames 1010, 1012 are overlapped-and-added (as defined in said standard). One set of spectral coefficients is associated to each of the frames 1010, 1012, as is known from the above referenced standard. Further, a novel 1-bit context reset flag (“arith_reset_flag”) is associated with each of the frames 1010, 1012. If the context reset flag associated with the first frame 1010 is set, the context is reset (e.g. according to the algorithm shown in
Taking reference now to
Taking reference now to
To summarize the above, the evaluation of the context reset flag provides the inventive entropy decoder with a very large flexibility. In an advantageous embodiment, the entropy decoder is capable of:
In other words, the entropy decoder is configured to perform the context reset independent from a change of the window shape and/or spectral resolution, by evaluating the context reset side information separate from the window shape/spectral resolution side information.
In the following, the syntax of a linear-prediction-domain channel stream will be described taking reference to
Taking reference now to
Furthermore, it should be noted that he linear-prediction-domain channel stream may comprise up to four “blocks” (having indices k=0 to k=3) which comprise either an ACELP-encoded excitation or a transform-coded-excitation (which may itself be arithmetically coded). Taking reference again to
Regarding the TCX stimulus encoding, it should be noted that different encodings are used for encoding a first TCX “block” (also designated as “TCX frame”) of the current audio frame and for the encoding of any subsequent TCX “blocks” (TCX frames) of the current audio frame. This is indicated by the so-called “first_tcx_flag,” which indicates if the currently processed TCX “block” (TCX frame) is the first in the present frame (also designated as “super frame” in the terminology of linear-prediction-domain coding).
Taking reference now to
The spectral values representing the transform-coded-excitation stimulus of a first tcx “block” of an audio frame are encoded using a reset context (default context) if the context reset flag (“arith_reset_flag”) of said tcx “block” is active. The arithmetically encoded spectral values of a first tcx “block” of an audio frame are encoded using a non-reset context if the context reset flag of said audio frame is inactive. The arithmetically encoded values of any subsequent tcx “blocks” (subsequent to the first tcx “block”) of an audio frame are encoded using a non-reset context (i.e. using a context derived from a previous tcx block). Said details regarding the arithmetic encoding of the spectral values (or spectral coefficients) of the transform-coded-excitation can be seen in
The transform-coded-excitation spectral values, which are arithmetically encoded, can be decoded taking into account the context. For example, if the context reset flag of a tcx “block” is active, the context may be reset, for example, in accordance with the algorithm shown in
For the decoding of tcx excitation stimulus spectral values, the decoder may therefore use the algorithm, which has been explained, for example, with reference to
Accordingly, the tcx excitation stimulus spectral value decoder may be configured to decode spectral values encoded according to the syntax shown in
In the following, a decoding of a linear-prediction-domain excitation audio information will be described taking reference to
Accordingly, when decoding the linear-prediction-domain stimulus shown in
It should be also noted that an audio stream may comprise a combination of frequency-domain audio framed and linear prediction-domain audio frames, such that the decoder may be configured to properly decode such an alternating sequence. At a transition between different encoding modes (frequency-domain vs. linear prediction domain), a reset of the context may or may not be enforced by the context resetter.
In the following another audio decoder concept will be described, which allows for a bitrate-efficient resetting of the context even in the absence of a dedicated context reset side information.
It has been found that the side information, which accompanies the entropy encoded spectral values, can be exploited for deciding whether to reset the context for entropy-decoding (e.g. arithmetical decoding) of the entropy encoded spectral values.
An efficient concept for resetting the context of the arithmetic decoding has been found for audio frames in which sets of spectral values associated with a plurality of windows are comprised. For example, the so-called “advanced audio coding” (also briefly designated as “AAC”), which is defined in the international standard ISO/IEC 14496-3:2005, part 3, subpart 4, uses audio frames comprising eight sets of spectral coefficients, wherein each set of spectral coefficients is associated to one “short window”. Accordingly, eight short windows are associated with such an audio frame, wherein the eight short windows are used in an overlap-and-add procedure for overlapping-and-adding windowed time domain signals reconstructed on the basis of the sets of spectral coefficients. For details, reference is made to said international standard. However, in an audio frame comprising a plurality of sets of spectral coefficients, two or more of the sets of spectral coefficients may be grouped, such that common scale factors are associated with the grouped sets of spectral coefficients (and are applied thereto in the decoder). The grouping of sets of spectral coefficients may for example be signaled using a grouping side information (e.g. “scale_factor_grouping” bits). For details, reference is made, for example, to ISO/IEC 14496-3:2005(E), part 3, subpart 4, tables 4.6, 4.44, 4.45, 4.46 and 4.47. Nevertheless, in order to provide a full understanding, reference is made to the above-mentioned international standard in its entirety.
However, in an audio decoder according to an embodiment of the invention, the information regarding the grouping of different sets of spectral values (for example, by associating them with common scale spectral values) may be used for determining when to reset the context for the arithmetic encoding/decoding of the spectral values. For example, an inventive audio decoder according to the third embodiment might be configured to reset the context of the entropy decoding (e.g. of a context-based Huffmann-decoding or a context-based arithmetic decoding, as described above) whenever it is found that there is a transition from one group of sets of encoded spectral values to another group of sets of spectral values (to which other group of sets new scale factors are associated). Accordingly, rather than using a context reset flag, the scale factor grouping side information may be exploited to determine when to reset the context of the arithmetic decoding.
In the following, an example of this concept will be explained taking reference to
In contrast, the second audio frame is of type “EIGHT_SHORT_SEQUENCE” and may accordingly comprise eight sets of encoded spectral values. However, the first three sets of encoded spectral values may be grouped together to form one group (to which a common scale factor information is associated) 1322a. Another group 1322b may be defined by a single set of spectral values. A third group 1322C may comprise two sets of spectral values associated therewith, and a fourth group 1322D may comprise another two sets of spectral values associated therewith. The grouping of sets of spectral values of the audio frame 1320 may be signaled by the so-called “scale_factor_grouping” bits defined, for example, in table 4.6 of the above-referenced standard. Similarly, the audio frame 1340 may comprise four groups 1330A, 1330B, 1330C, 1330D.
However, the audio frames 1320, 1330 may, for example, not comprise a dedicated context reset flag. For entropy decoding the spectral values of the audio frame 1320, the decoder may, for example unconditionally or in dependence on a context reset flag, reset the context before decoding the first set of spectral coefficients of the first group 1322A. Subsequently, the audio decoder may avoid resetting the context between the decoding of different sets of the spectral coefficients of the same group of the spectral coefficients. However, whenever the audio decoder detects the beginning of a new group within the audio frame 1320 comprising a plurality of groups (of sets of spectral coefficients), the audio decoder may reset the context for the entropy decoding of the spectral coefficients. Thus, the audio encoder may effectively reset the contexts for decoding of the spectral coefficients of the first group 1322A, before the decoding of the spectral coefficients of the second group 1322B, before the decoding of the spectral coefficients of the third group 1322C, and before the decoding of the spectral coefficients of the fourth group 1322D.
Accordingly, a separate transmission of a dedicated context reset flag may be avoided within such audio frames in which there are a plurality of sets of spectral coefficients. Accordingly, the extra bit load produced by the transmission of the grouping bits may at least partly be compensated by the omission of the transmission of a dedicated context reset flag in such a frame, which may be unnecessary in some applications.
To summarize, a reset strategy has been described which can be implemented as a decoder feature (and also as an encoder feature). The strategy described here does not need the transmission of any additional information (like a dedicated side information for resetting the context) to a decoder. It uses the side information already sent by the decoder (e.g. by an encoder providing an AAC encoded audio stream corresponding to the above industry standard). As it is described herein, the change of content within the signal (audio signal) can happen from frame to frame of, for example, 1024 samples. In this case, we have already the reset flag which can control the context-adaptive coding and mitigate the impact on its performance. However, within a frame of 1024 samples, the content can change as well. In such a case, when an audio coder (for example according to the unified speech and audio coding “USAC”) uses a frequency domain (FD) coding, the decoder will usually switch to short blocks. In short blocks, grouping information is sent (as discussed above) which already gives information about the position of a transition or a transient (of the audio signal). Such information can be reused for resetting the context, as discussed in this section.
On the other hand, when an audio coder (like, for example, according to the unified speech and audio coding “USAC”) uses linear prediction domain (LPD) coding, a change of content will affect the selected coding modes. When different transform-coding-excitations occur within one frame of 1024 samples, a context mapping may be used, as described above. (See, for example, the context mapping of
In other words, taking reference, for example, to
Also, optionally, the decoder may be configured to evaluate a context reset flag, for example once per audio frame, if a TCX block is preceding the parent audio frame, to allow for a reset of the context even in the presence of an extended segments of TCX “blocks”.
In the following, the basic concept of a context-based entropy encoder will be discussed in order to facilitate the understanding of the specific procedures for the reset of the context which will be discussed in detail in the following.
Noiseless coding can be based on quantized spectral values and may use context dependent cumulative frequency tables derived from, for example, four previously decoded neighbouring tuples.
Note that the previous and current segments referred to in the above described embodiments may correspond to a tuple in the present embodiment, in other words, the segments may be processed band wise in the frequency or spectral domain. As illustrated in
In the present embodiment context based arithmetic coding may be carried out on the basis of 4-tuples (i.e. on four spectral coefficient indices), which are also labelled q(n,m), or q[m][n], representing the spectral coefficients after quantization, which are neighboured in the frequency or spectral domain and which are entropy coded in one step. According to the above description, coding may be carried out based on the coding context. As indicated in
a shows a flow-chart of a USAC (USAC=Universal Speech and Audio Coder) context dependent arithmetic coder for the encoding scheme of spectral coefficients. The encoding process depends on the current 4-tuple plus the context, where the context is used for selecting the probability distribution of the arithmetic coder and for predicting the amplitude of the spectral coefficients. In
Generally, in embodiments the entropy encoder can be adapted for encoding the current segment in units of a 4-tuple of spectral coefficients and for predicting an amplitude range of the 4-tuple based on the coding context.
In the present embodiment the encoding scheme comprises several stages. First, the literal codeword is encoded using an arithmetic coder and a specific probability distribution. The codeword represents four neighbouring spectral coefficients (a,b,c,d), however, each of a, b, c, d is limited in range:
−5<a,b,c,d<4.
Generally, in embodiments the entropy encoder can be adapted for dividing the 4-tuple by a predetermined factor as often as is useful to fit a result of the division in the predicted range or in a predetermined range and for encoding a number of divisions useful, a division remainder and the result of the division when the 4-tuple does not lie in the predicted range, and for encoding a division remainder and the result of the division otherwise.
In the following, if the term (a,b,c,d), i.e. any coefficient a, b, c, d, exceeds the given range in this embodiment, this can in general be considered by dividing (a,b,c,d) as often by a factor (e.g. 2 or 4) as is useful, for fitting the resulting codeword in the given range. The division by a factor of 2 corresponds to a binary shifting to the right-hand side, i.e. (a,b,c,d)>>1. This diminution is done in an integer representation, i.e. information may be lost. The least significant bits, which may get lost by the shifting to the right, are stored and later on coded using the arithmetic coder and a uniform probability distribution. The process of shifting to the right is carried out for all four spectral coefficients (a,b,c,d).
In general embodiments, the entropy encoder can be adapted for encoding the result of the division or the 4-tuple using a group index ng, the group index ng referring to a group of one or more code words for which a probability distribution is based on the coding context, and an element index ne in case the group comprises more than one codeword, the element index ne referring to a codeword within the group and the element index can be assumed uniformly distributed, and for encoding the number of divisions by a number of escape symbols, an escape symbol being a specific group index ng only used for indicating a division and for encoding the remainders of the divisions based on a uniform distribution using an arithmetic coding rule. The entropy encoder can be adapted for encoding a sequence of symbols into the encoded audio stream using a symbol alphabet comprising the escape symbol, and group symbols corresponding to a set of available group indices, a symbol alphabet comprising the corresponding element indices, and a symbol alphabet comprising the different values of the remainders.
In the embodiment of
In
In step 2120 it is checked whether (a,b,c,d) exceeds the given range and if so, the range of (a,b,c,d) is reduced by a factor of 4 in step 2125. In other words, in step 2125 (a,b,c,d) are shifted by 2 to the right and the removed bitplanes are stored for later usage in step 2150.
In order to indicate this reduction step, ng is set to 544 in step 2130, i.e. ng=544 serves as an escape codeword. This codeword is then written to the bitstream in step 2155, where for deriving the codeword in step 2130 an arithmetic coder with a probability distribution derived from the context is used. In case this reduction step was applied the first time, i.e. if lev==lev0, the context is slightly adapted. In case the reduction step is applied more than once, the context is discarded and a default distribution is used further on. The process then continues with step 2120.
If in step 2120 a match for the range is detected, more specifically if (a,b,c,d) matches the range condition, (a,b,c,d) is mapped to a group ng, and, if applicable, the group element index ne. This mapping is unambiguously, that is (a,b,c,d) can be derived from ng and ne. The group index ng is then coded by the arithmetic coder, using a probability distribution arrived for the adapted/discarded context in step 2135. The group index ng is then inserted into the bitstream in step 2155. In a following step 2140, it is checked whether the number of elements in the group is larger than 1. If it is useful, that is if the group indexed by ng consists of more than one element, the group element index ne is coded by the arithmetic coder in step 2145, assuming a uniform probability distribution in the present embodiment.
Following step 2145, the element group index ne is inserted into the bitstream in step 2155. Finally, in step 2150, all stored bitplanes are coded using the arithmetic coder, assuming a uniform probability distribution. The coded stored bitplanes are then also inserted into the bitstream in step 2155.
To summarize the above, an entropy encoder, in which the context reset concepts described in the following can be used, receives one or more spectral values and provides a code word, typically of variable length, on the basis of the one or more received spectral values. The mapping of the received spectral values onto the code word is dependent on an estimated probability distribution of code words, such that, generally speaking, short code words are associated with spectral values (or combinations thereof) having a high probability and such that long code words are associated with spectral values (or combinations thereof) having a low probability. The context is taken into consideration in that it is assumed that the probability of the spectral values (or combinations thereof) is dependent on previously encoded spectral values (or combinations thereof). Accordingly, the mapping rule (also designated “mapping information” or “codebook” or “cumulative frequencies table”) is selected in dependence on the context, i.e. on the previously encoded spectral values (or combinations thereof). However, the context is not always considered. Rather, the context is sometimes reset by the “context reset” functionality described herein. By resetting the context, it can be considered that the spectral values (or combinations thereof) to be currently encoded differ strongly from what would be expected on the basis of the context.
In the following, an audio encoder will be described taking reference to
The audio encoder 1400 of
The advantage of such a regular reset is to limit the dependence of the coding of the present frame from the previous frames. Resetting the context every n-frames (which is achieved by the counter 1460 and the reset flag generator 1470) allows the decoder to resynchronize its states with the encoder even when an error of transmission occurs. The decoded signal can then be recovered after a reset point. Further, the “regular reset” strategy allows the decoder to randomly access at any reset points of the bitstream without considering the past information. The interval between the reset points and the coding performance is a trade-off, which is made at the encoder according to the targeted receiver and the transmission channel characteristics.
In the following, another reset strategy as an encoder feature will be described. The following strategy triggers at the encoder side the reset flag which is sent every frame of 1024 samples on 1-bit. In the embodiment of
As can be seen in
For example, in a unified speech and audio coder (USAC) such a reset may be triggered when going from/to frequency domain coding to/from linear-prediction-domain coding. In other words, a context reset of the context-adaptive arithmetic coder 1420 may be performed and signalled whenever the coding mode changes between frequency domain coding and linear prediction domain coding. Such a reset of the context may be signalised or not by a dedicated context reset flag. However, alternatively, a different side information, for example side information indicating the coding mode, may be exploited at the decoder side to trigger the reset of the context.
The audio encoder 1600 of
To summarize the above,
It has been found that sometimes the characteristics of a signal change abruptly from frame to frame. For such non-stationary parts of the signal, the context from the past frame is often meaningless. Furthermore, it has been found that it can be more penalizing than beneficial to take into account the past frames in the context adaptive coding. A solution is then to trigger the reset flag when it happens. A way to detect such a case is to compare the decoding efficiency when both the reset flag is on and off. The flag value corresponding to the best coding is then used (to determine the new state of the encoder context) and transmitted. This mechanism was implemented in the unified speech and audio coding (USAC), and the following average gain of performance was measured:
12 kbps mono: 1.55 bit/frame (max: 54)
16 kbps mono: 1.97 bit/frame (max: 57)
20 kbps mono: 2.85 bit/frame (max: 69)
24 kbps mono: 3.25 bit/frame (max: 122)
16 kbps stereo: 2.27 bit/frame (max: 70)
20 kbps stereo: 2.92 bit/frame (max: 80)
24 kbps stereo: 2.88 bit/frame (max: 119)
32 kbps stereo: 3.01 bit/frame (max: 121)
In the following, another audio encoder 1700 will be described taking reference to
However, the audio encoder 1700 comprises a different reset flag generator 1770, when compared to the other audio encoders. The reset flag generator 1770 receive a side information, which is provided by the audio processor 1410 and provides, on the basis thereof, the reset flag 1772, which is provided to the context generator 1440. However, it should be noted that the audio encoder 1700 avoids to include the reset flag 1772 into the encoded audio stream. Rather, only the audio processor side information 1780 is included into the encoded audio stream.
The reset flag generator 1770 may, for example, be configured to derive the context reset flag 1772 from the audio processor side information 1780. For example, the reset flag generator 1770 may evaluate a grouping information (already described above) to decide whether to reset the context. Thus, the context may be reset between an encoding of different groups of sets of spectral coefficients, as explained, for example, for the decoder taking reference to
Accordingly, the encoder 1700 uses a reset strategy, which may be identical to a reset strategy at a decoder. However, the reset strategy may avoid the transmission of a dedicated context reset flag. In other words, the reset strategy described here does not need the transmission of any additional information to the decoder. It uses the side information which is already sent to the decoder (for example, a grouping side information). It should be noted here that for the present strategy, identical mechanisms for determining whether to reset the context or not are used at the encoder and at the decoder. Accordingly, reference is made to the discussion with respect to
First of all, it should be noted that different reset strategies discussed herein, for example, in section 2.1. to 2.5, can be combined. In particular, the reset strategies as an encoder feature, which have been discussed with reference to
In addition, it should be noted that the reset of the context at the encoder side should occur synchronously with the reset of the context at the decoder side. Accordingly, the encoder is configured to provide the context reset flag discussed above at the time (or for the frames, or windows) discussed above (e.g. with reference to
In the following, a method for providing a decoded audio information on the basis of encoded audio information will be briefly discussed taking reference to
The method 1800 can be supplemented by any of the functionalities discussed herein regarding the decoding of an audio information, also regarding the inventive apparatus.
In the following, a method 1900 for providing an encoded audio information on the basis of an input audio information will be described taking reference to
The method 1900 comprises encoding 1910 a given audio information of the input audio information in dependence on a context, which context is based on an adjacent audio information, temporally or spectrally adjacent to the given audio information, in a non-reset state of operation.
The method 1900 also comprises selecting 1920 a mapping information, for deriving the encoded audio information from the input audio information, in dependence on the context.
Also, the method 1900 comprises resetting 1930 the context for selecting the mapping information to a default context, which is independent from the previously decoded audio information, within a contiguous piece of input audio information (e.g. between decoding two frames, the time domain signals of which are overlapped-and-added) in response to the occurrence of a context reset condition.
The method 1900 also comprises providing 1940 a side information (e.g. a context reset flag, or a grouping information) of the encoded audio information indicating the presence of such a context reset condition.
The method 1900 can be supplemented by any of the features and functionalities described herein with respect to the inventive audio encoding concept.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
The inventive encoded audio signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blue-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
This application is a continuation of copending International Patent Application No. PCT/EP2009/007169 filed Oct. 6, 2009, which claims priority to U.S. Patent Application No. 61/103,820, filed Oct. 8, 2008, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61103820 | Oct 2008 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2009/007169 | Oct 2009 | US |
Child | 13081241 | US |