1. Field of the Invention
This invention is related to the scalable encoding of an audio signal and more specifically to methods for performing this data rate scaling in an efficient matter for multichannel audio signals including hierarchical filtering, joint coding of tonal components and joint channel coding of time-domain components in the residual signal.
2. Description of the Related Art
The main objective of an audio compression algorithm is to create a sonically acceptable representation of an input audio signal using as few digital bits as possible. This permits a low data rate version of the input audio signal to be delivered over limited bandwidth transmission channels, such as the Internet, and reduces the amount of storage necessary to store the input audio signal for future playback. For those applications in which the data capacity of the transmission channel is fixed, and non-varying over time, or the amount, in terms of minutes, of audio that needs to be stored is known in advance and does not increase, traditional audio compression methods fix the data rate and thus the level of audio quality at the time of compression encoding. No further reduction in data rate can be effected without either recoding the original signal at a lower data rate or decompressing the compressed audio signal and then recompressing this decompressed signal at a lower data rate. These methods are not “scalable” to address issues of varying channel capacity, storing additional content on a fixed memory, or sourcing bit streams at varying data rates for different applications.
One technique used to create a bit stream with scalable characteristics, and circumvent the limitations previously described, encodes the input audio signal as a high data rate bit stream composed of subsets of low data rate bit streams These encoded low data rate bit streams can be extracted from the coded signal and combined to provide an output bit stream whose data rate is adjustable over a wide range of data rates. One approach to implement this concept is to first encode data at a lowest supported data rate, then encode an error between the original signal and a decoded version of this lowest data rate bit stream. This encoded error is stored and also combined with the lowest supported data rate bit stream to create a second to lowest data rate bit stream. Error between the original signal and a decoded version of this second to lowest data rate signal is encoded, stored and added to the second to lowest data rate bit stream to form a third to lowest data rate bit stream and so on. This process is repeated until the sum of the data rates associated with bit streams of each of the error signals so derived and the data rate of the lowest supported data rate bit stream is equal to the highest data rate bit stream to be supported. The final scalable high data rate bit stream is composed of the lowest data rate bit stream and each of the encoded error bit streams.
A second technique, usually used to support a small number of different data rates between widely spaced lowest and highest data rates, employs the use of more than one compression algorithm to create a “layered” scalable bit stream. The apparatus that performs the scaling operation on a bit stream coded in this manner chooses, depending on output data rate requirements, which one of the multiple bit streams carried in the layered bit stream to use as the coded audio output. To improve coding efficiency and provide for a wider range of scaled data rates, data carried in the lower rate bit streams can be used by higher rate bit streams to form additional higher quality, higher rate bit streams.
The present invention provides a method for encoding audio input signals to form a master bit stream that can be scaled to form a scaled bit stream having an arbitrarily prescribed data rate and for decoding the scaled bit stream to reconstruct the audio signals.
This is generally accomplished by compressing the audio input signals and arranging them to form a master bit stream. The master bit stream includes quantized components that are ranked on the basis of their relative contribution to decoded signal quality. The input signal is suitably compressed by separating it into a plurality of tonal and residual components, and ranking and then quantizing the components. The separation is suitably performed using a hierarchical filterbank. The components are suitably ranked and quantized with reference to the same masking function or different psychoacoustic criteria. The components may then be ordered based on their ranking to facilitate efficient scaling. The master bit stream is scaled by eliminating a sufficient number of the low ranking components to form the scaled bit stream having a scaled data rate less than or approximately equal to a desired data rate. The scaled bit stream includes information that indicates the position of the components in the frequency spectrum. A scaled bit stream is suitably decoded using an inverse hierarchical filterbank by arranging the quantized components based on the position formation, ignoring the missing components and decoding the arranged components to produce an output bit stream.
In one embodiment, the encoder uses a hierarchical filterbank to decompose the input signal into a multi-resolution time/frequency representation. The encoder extracts tonal components at each iteration of the HFB at different frequency resolutions, removes those tonal components from the input signal to pass a residual signal to the next iteration of the HFB and than extracts residual components from the final residual signal. The tonal components are grouped into at least one frequency sub-domain per frequency resolution and ranked according to their psychoacoustic importance to the quality of the coded signal. The residual components include time-sample components (e.g. a Grid G) and scale factor components (e.g. grids G0, G1) that modify the time-sample components. The time-sample components are grouped into at least one time-sample sub-domain and ranked according to their contribution to the quality of the decoded signal.
At the decoder, the inverse hierarchical filterbank may be used to extract both the tonal components and the residual components within one efficient filterbank structure. All components are inverse quantized and the residual signal is reconstructed by applying the scale factors to the time samples. The frequency samples are reconstructed and added to the reconstructed time samples to produce the output audio signal. Note the inverse hierarchical filterbank may be used at the decoder regardless of whether the hierarchical filterbank was used during the encoding process.
In an exemplary embodiment, the selected tonal components in a multichannel audio signal are encoded using differential coding. For each tonal component, one channel is selected as the primary channel. The channel number of the primary channel and its amplitude and phase are stored in the bit stream. A bit-mask is stored that indicates which of the other channels include the indicated tonal component, and should therefore be coded as secondary channels. The difference between the primary and secondary amplitudes and phases are then entropy-coded and stored for each secondary channel in which the tonal component is present.
In an exemplary embodiment, the time-sample and scale factor components that make up the residual signal are encoded using joint channel coding (JCC) extended to multichannel audio. A channel grouping process first determines which of the multiple channels may be jointly coded and all channels are formed into groups with the last group possibly being incomplete.
Additional objects, features and advantages of the present invention are included in the following discussion of exemplary embodiments, which discussion should be read with the accompanying drawings. Although these exemplary embodiments pertain to audio data, it will be understood that video, multimedia and other types of data may also be processed in similar manners.
a and 2b are frequency and time domain representations of a Shmunk window for use with the hierarchical filterbank;
a through 5c illustrate an ‘overlap-add’ windowing;
a and 8b are a simplified block diagram of a 3-stage hierarchical filterbank and a more detailed block diagram of a single stage;
a and 17b are a simplified block diagram of a 3-stage inverse hierarchical filterbank and a more detailed block diagram of a single stage;
The present invention provides a method for compressing and encoding audio input signals to form a master bit stream that can be scaled to form a scaled bit stream having an arbitrarily prescribed data rate and for decoding the scaled bit stream to reconstruct the audio signals. A hierarchical filterbank (HFB) provides a multi-resolution time/frequency representation of the input signal from which the encoder can efficiently extract both the tonal and residual components. For multichannel audio, joint coding of tonal components and joint channel coding of residual components in the residual signal is implemented. The components are ranked on the basis of their relative contribution to decoded signal quality and quantized with reference to a masking function. The master bit stream is scaled by eliminating a sufficient number of the low ranking components to form the scaled bit stream having a scaled data rate less than or approximately equal to a desired data rate. The scaled bit stream is suitably decoded using an inverse hierarchical filterbank by arranging the quantized components based on position information, ignoring the missing components and decoding the arranged components to produce an output bit stream. In one possible application, the master bit stream is stored and than scaled down to a desired data rate for recording on another media or for transmission over a bandlimited channel. In another application, in which multiple scaled bit streams are stored on media, the data rate of each stream is independently and dynamically controlled to maximize perceived quality while satisfying an aggregate data rate constrain on all of the bit streams.
As used herein the terms “Domain”, “sub-domain”, and “component” describe the hierarchy of scalable elements in the bit stream. Examples will include:
As shown in
The input signal 100 is applied to both Masking Calculator 101 and Multi-Order Tone Extractor 102. Masking Calculator 101 analyzes input signal 100 and identifies a masking level as a function of frequency below which frequencies present in input signal 101 are not audible to the human ear. Multi-Order Tone Extractor 102 identifies frequencies present in input signal 101 using, for example, multiple overlapping FFTs or as shown a hierarchical filterbank based on MDCTs, which meet psychoacoustic criteria that have been defined for tones, selects tones according to this criteria, quantizes the amplitude, frequency, phase and position components of these selected tones, and places these tones into a tone list. At each iteration or level, the selected tones are removed from the input signal to pass a residual signal forward. Once complete, all other frequencies that do not meet the criteria for tones are extracted from the input signal and output from Multi-Order Tone Extractor 102, specifically the last stage of the hierarchical filterbank MDCT(256), in the time domain on line 111 as the final residual signal.
Multi-Order Tone Extractor 102 uses, for example, five orders of overlapping transforms, starting from the largest and working down to the smallest, to detect tones through the use of a base function. Transforms of size: 8192, 4096, 2048, 1024, and 512 are used respectively, for an audio signal whose sampling rate is 44100 Hz. Other transform sizes could be chosen.
where:
Tones detected at each transform size are locally decoded using the same decode process as used by the decoder of the present invention, to be described later. These locally decoded tones are phase inverted and combined with the original input signal through time domain summation to form the residual signal that is passed to the next iteration or level of the HFB.
The masking level from Masking Calculator 101 and the tone list from Multi-Order Tone Extractor 102 are inputs to the Tone Selector 103. The Tone Selector 103 first sorts the tone list provided to it from Multi-Order Tone Extractor 102 by relative power over the masking level provided by Masking Calculator 101. It then uses an iterative process to determine which tonal components will fit into a frame of encoded data in the master bit stream. The amount of space available in a frame for tonal components depends on the predetermined, before scaling, data rate of the encoded master bit stream. If the entire frame is allocated for tonal components then no residual coding is performed. In general, some portion of the available data rate is allocated for the tonal components with the remainder (minus overhead) reserved for the residual components.
Channel groups are suitably selected for multichannel signals and primary/secondary channels identified within each channel group according to a metric such as contribution to perceptual quality. The selected tonal components are preferably stored using differential coding. For stereo audio, the two-bit field indicates the primary and secondary channels. The amplitude/phase and differential amplitude/phase are stored for the primary and secondary channels, respectively. For multichannel audio the primary channel is stored with its amplitude and phase and a bit-mask (See
During this iterative process, some or all of the tonal components that are determined not to fit in a frame may be converted back into the time domain and combined with residual signal 111. If, for example, the data rate is sufficiently high, then typically all of the deselected tonal components are recombined. If, however, the data rate is lower, the relatively strong ‘deselected’ tonal components are suitably left out of the residual. This has been found to improve perceptual quality at lower data rates. The deselected tonal components represented by signal 110, are locally decoded via Local Decoder 104 to convert them back into the time domain on line 114 and combined with Residual Signal 111 from Multi-Order Tone Extractor 102 in Combiner 105 to form a combined Residual signal 113. Note that the signals appearing on 114 and 111 are both time domain signals so that this combining process can be easily affected. The combined Residual Signal 113 is further processed by the Residual Encoder 107.
The first action performed by Residual Encoder 107 is to process the combined Residual Signal 113 through a filter bank which subdivides the signal into critically sampled time domain frequency sub-bands. In a preferred embodiment, when the hierarchical filterbank is used to extract the tonal components, these time-sample components can be read directly out of the hierarchical filterbank thereby eliminating the need for a second filterbank dedicated to the residual signal processing. In this case, as shown in
The Code String Generator 108 takes input from the Tone Selector 103, on line 120, and Residual Encoder 107 on line 122, and encodes values from these two inputs using entropy coding well known in the art into bit stream 124. The Bit Stream Formatter 109 assures that psychoacoustic elements from the Tone Selector 103 and Residual Encoder 107, after being coded through the Code String Generator 108, appear in the proper position in the master bit stream 126. The ‘rankings’ are implicitly included in the master bit stream by the ordering of the different components.
A scaler 115 eliminates a sufficient number of the lowest ranked encoded components from each frame of the master bit stream 126 produced by the encoder to form a scaled bit stream 116 having a data rate less than or approximately equal to a desired data rate.
Hierarchical Filterbank
The Multi-Order Tone Extractor 102 preferably uses a ‘modified’ hierarchical filterbank to provide a multi-resolution time/frequency resolution from which both the tonal components and the residual components can be efficiently extracted. The HFB decomposes the input signal into transform coefficients at successively lower frequency resolutions and back into time-domain sub-band samples at successively finer time scale resolution at each successive iteration. The tonal components generated by the hierarchical filterbank are exactly the same as those generated by multiple overlapping FFTs however the computational burden is much less. The Hierarchical Filterbank addresses the problem of modeling the unequal time/frequency resolution of the human auditory system by simultaneously analyzing the input signal at different time/frequency resolutions in parallel to achieve a nearly arbitrary time/frequency decomposition. The hierarchical filterbank makes use of a windowing and overlap-add step in the inner transform not found in known decompositions. This step and the novel design of the window function allow this structure to be iterated in an arbitrary tree to achieve the desired decomposition, and could be done in a signal-adaptive manner.
As shown in
A fundamental challenge in audio coding is the modeling of the time/frequency resolution of human perception. Transient signals, such as a handclap, require a high resolution in the time domain, while harmonic signals, such as a horn, require high resolution in the frequency domain to be accurately represented by an encoded bit stream. But it is a well-known principle that time and frequency resolution are inverses of each other and no single transform can simultaneously render high accuracy in both domains. The design of an effective audio codec requires balancing this tradeoff between time and frequency resolution.
Known solutions to this problem utilize window switching, adapting the transform size to the transient nature of the input signal (See K. Brandenburg et al., “The ISO-MPEG-Audio Codec: A Generic Standard for Coding of High Quality Digital Audio”, Journal of Audio Engineering Society, Vol. 42, No. 10, October, 1994). This adaptation of the analysis window size introduces additional complexity and requires a detection of transient events in the input signal. To manage algorithmic complexity, the prior art window switching methods typically limit the number of different window sizes to two. The hierarchical filterbank discussed herein avoids this coarse adjustment to the signal/auditory characteristics by representing/processing the input signal by a filterbank which provides multiple time/frequency resolutions in parallel.
There are many filterbanks, known as hybrid filterbanks, which decompose the input signal into a given time/frequency representation. For example, the MPEG Layer 3 algorithm described in ISO/IEC 11172-3 utilizes a Pseudo-Quadrature Mirror Filterbank followed by an MDCT transform in each subband to provide the desired frequency resolution. In our hierarchical filterbank we utilize a transform, such as an MDCT, followed by the inverse transform (e.g. IMDCT) on groups of spectral lines to perform a flexible time/frequency transformation of the input signal.
Unlike hybrid filterbanks, the hierarchical filterbank uses results from two consecutive, overlapped outer transforms to compute ‘overlapped’ inner transforms. With the hierarchical filterbank it is possible to aggregate more then one transform on top of the first transform. This is also possible with prior-art filterbanks (e.g. tree-like filterbanks), but is impractical due to the fast degradation of frequency-domain separation with increase in number of levels. The hierarchical filterbank avoids this frequency-domain degradation at the expense of some time-domain degradation. This time-domain degradation can, however, be controlled through the proper selection of window shape(s). With the selection of the proper analysis window, the coefficients of the inner transform can also be made invariant to time shifts equal to the size of inner transform (not to the size of the outmost transform as in conventional approaches).
A suitable window W(x) referred to herein as the “Shmunk Window”, for use with the hierarchical filterbank is defined by:
Where x it the time domain sample index (0<x<=L), and L is the length of the window in samples.
The frequency response 2603 of the Shmunk window in comparison with the commonly used Kaiser-Bessel derived window 2602 is shown in
A hierarchical filterbank of general applicability for providing a time/frequency decomposition is illustrated in
(a) As shown in
(b) As shown in
(c) Optionally ringing reduction is applied to one or more of the transform coefficients 2804 by applying a linear combination of one or more adjacent transform coefficients (step 2904);
(d) The N/2 transform coefficients 2804 are divided into P groups of Mi coefficients, such that the sum of the Mi coefficients is
(e) For each of P groups, a (2*Mi)-point inverse transform (represented by the upward arrow 2806 in
(d) In each sub-band, the (2*Mi) sub-band samples are multiplied by a (2*Mi)-point window function 2706 (step 2908);
(e) In each sub-band, the Mi previous samples are overlapped and added to corresponding current values to produce Mi new samples for each sub-band (step 2910);
(f) N is set equal to the previous Mi and select new values for P and Mi, and
(g) The above steps are repeated (step 2912) on one or more of the sub-bands of Mi new samples using the successively smaller transform sizes for N until the desired time/transform resolution is achieved (step 2914). Note, steps may be iterated on all of the sub-bands, only the lowest sub-bands or any desired combination thereof. If the steps are iterated on all of the sub-bands the HFB is uniform, otherwise it is non-uniform.
The frequency response 3300 plot of an implementation of the filterbank of
The potential applications for this hierarchical filterbank go beyond audio, to processing of video and other types of signals (e.g. seismic, medical, other time-series signals). Video coding and compression have similar requirements for time/frequency decomposition, and the arbitrary nature of the decomposition provided by the Hierarchical Filterbank may have significant advantages over current state-of-the-art techniques based on Discrete Cosine Transform and Wavelet decomposition. The filterbank may also be applied in analyzing and processing seismic or mechanical measurements, biomedical signal processing, analysis and processing of natural or physiological signals, speech, or other time-series signals. Frequency domain information can be extracted from the transform coefficients produced at each iteration at successively lower frequency resolutions. Likewise time domain information can be extracted from the time-domain sub-band samples produced at each iteration at successively finer time scales.
Hierarchical Filterbank: Uniformly Spaced Sub-Bands
1. Input time samples 3902 are windowed in N-point, 50% overlapping frames 3904.
2. A N-point MDCT 3906 is performed on each frame.
3. The resulting MDCT coefficients are grouped in P groups 3908 of M coefficients in each group.
4. A (2*M)-point IMDCT 3910 is performed on each group to form (2*M) sub-band time samples 3911.
5. The resulting time samples 3911 are windowed in (2*M)-point, 50% overlapping frames and overlap-added (OLA) 3912 to form M time samples in each sub-band 3914.
In an exemplary implementation, N=256, P=32, and M=4. Note that different transform sizes and sub-band groupings represented by different choices for N, P, and M can also be employed to achieve a desired time/frequency decomposition.
Hierarchical Filterbank: Non-Uniformly Spaced Sub-Bands
Another embodiment of a Hierarchical Filterbank 3000 is shown in
As shown in
In the example shown in
Thus, assuming an input 3002 with a sample rate of 44100 Hz, the filterbank shown produces 96 coefficients representing the frequency range 5513 to 22050 Hz at “Out1” 3008, 96 coefficients representing the frequency range 1379 to 5512 Hz at “Out2” 3014, and 128 coefficients representing the frequency range 0 to 1378 Hz at “Out3” 3018,
It should be noted that the use of MDCT/IMDCT for the frequency transform/inverse transform are exemplary and other time/frequency transformations can be applied as part of the present invention. Other values for the transform sizes are possible, and other decompositions are possible with this approach, by selectively expanding any branch in the hierarchy described above.
Multichannel Joint Coding of Tonal and Residual Components
The Tone Selector 103 in
where:
Tone Selector 103 then uses an iterative process to determine which tonal components from the sorted tone list for the frame will fit into the bit stream. In stereo or multichannel audio signals, where the amplitude of a tone is about the same in more than one channel, only the full amplitude and phase is stored in the primary channel; the primary channel being the channel with the highest amplitude for the tonal component. Other channels having similar tonal characteristics store the difference from the primary channel.
The data for each transform size encompasses a number of sub-frames, the smallest transform size covering 2 sub-frames; the second 4 sub-frames; the third 8 sub-frames; the fourth 16 sub-frames; and the fifth 32 sub-frames. There are 16 sub-frames to 1 frame. Tone data is grouped by size of the transform in which the tone information was found. For each transform size, the following tonal component data is quantized, entropy-encoded and placed into the bit stream: entropy-coded sub-frame position, entropy-coded spectral position, entropy-coded quantized amplitude, and quantized phase.
In the case of multichannel audio, for each tonal component, one channel is selected as the primary channel. The determination of which channel should be the primary channel may be fixed or may be made based on the signal characteristics or perceptual criteria. The channel number of the primary channel and its amplitude and phase are stored in the bit stream. As shown in
The output 4211 of Multi-Order Tone Extractor 102 is made up of frames of MDCT coefficients at one or more resolutions. The Tone Selector 103 determines which tonal components can be retained for insertion into the bit stream output frame by Code String Generator 108, based on their relevance to decoded signal quality. Those tonal components determined not to fit in the frame are output 110 to the Local Decoder 104. The Local Decoder 104 takes the output 110 of the Tone Selector 103 and synthesizes all tonal components by adding each tonal component scaled with synthesis coefficients 2000 from a lookup table (
As shown in
For stereo or multichannel audio, several calculations are made in Channel Selection block 501 to determine the primary and secondary channel for encoding tonal components, as well as the method for encoding tonal components (for example, Left-Right, or Middle-Side). As shown in
The grouping mode is also determined as shown in step 3704 of
Pm>2·Ps
For multichannel signals, the above is performed for each channel group.
For a stereo signal, Grid Calculation 502 provides a stereo panning grid in which stereo panning can roughly be reconstructed and applied to the residual signal. The stereo grid is 4 sub-bands by 4 time intervals, each sub-band in the stereo grid covers 4 sub-bands and 32 samples from the output of Filter Bank 500, starting with frequency bands above 3 k Hz. Other grid sizes, frequency sub-bands covered, and time divisions could be chosen. Values in the cells of the stereo grid are the ratio of the power of the given channel to that of the primary channel, for the range of values covered by the cell. The ratio is then quantized to the same table as that used to encode tonal components. For multichannel signals, the above stereo grid is calculated for each channel group.
For multichannel signals, Grid Calculation 502 provides multiple scale factor grids, one per each channel group, that are inserted into the bit stream in order of their psychoacoustic importance in the spatial domain. The ratio of the power of the given channel to the primary channel for each group of 4 sub-bands by 32 samples is calculated. This ratio is then quantized and this quantized value plus logarithm sign of the power ratio is inserted into the bit stream.
Scale Factor Grid Calculation 503 calculates grid G1, which is placed in the bit stream. The method for calculating G1 is now described. G0 is first derived from G. G0 contains all 32 sub-bands but only half the time resolution of G. The contents of the cells in G0 are quantized values of the maximum of two neighboring values of a given sub-band from G. Quantization (referred to in the following equations as Quantize) is performed using the same modified logarithmic quantization table as was used to encode the tonal components in the Multi-Order Tone Extractor 102. Each cell in G0 is thus determined by:
G0m,n=Quantize(Maximum(Gm,2n,Gm,2n+1)) nε[0 . . . 63]
where:
G1 is derived from G0. G1 has 11 overlapping sub-bands and ⅛ the time resolution of G0, forming a grid 11×8 in dimension. Each cell in G1 is quantized using the same table as used for tonal components and found using the following formula:
Wl is a weight value obtained from the Table 1 in
G0 is recalculated from G1 in Local Grid Decoder 506. In Time Sample Quantization Block 507, output time samples (“time-sample components’) are extracted from the hierarchical filterbank (Grid G), which pass through Quantization Level Selection Block 504, scaled by dividing the time-sample components by the respective values in the recalculated G0 from Local Grid Decoder 506 and quantized to the number of quantization levels, as a function of sub-band, determined by quantization level selection block 504. These quantized time samples are then placed into the encoded bit stream along with the quantized grid G1. In all cases, a model reflecting the psychoacoustic importance of these components is used to determine priority for the bit stream storage operation.
In an additional enhancement step to improve the coding gain for some signals, grids including G, G1 and partial grids may be further processed by applying a two-dimensional Discrete Cosine Transform (DCT) prior to quantization and coding. The corresponding Inverse DCT is applied at the decoder following inverse quantization to reconstruct the original grids.
Typically, each frame of the master bit stream will include (a) a plurality of quantized tonal components representing frequency domain content at different frequency resolutions of the input signal, b) quantized residual time-sample components representing the time-domain residual formed from the difference between the reconstructed tonal components and the input signal, and c) scale factor grids representing the signal energies of the residual signal, which span a frequency range of the input signal. For a multichannel signal each frame may also contain d) partial grids representing the signal energy ratios of the residual signal channels within channel groups and e) a bitmask for each primary specifying the joint-encoding of secondary channels for tonal components. Usually a portion of the available data rate in each frame is allocated from the tonal components (a) and a portion is allocated for the residual components (b,c). However, in some cases all of the available rate may be allocated to encode the tonal components. Alternately, all of the available rate may be allocated to encode the residual components. In extreme cases, only the scale factor grids may be encoded, in which case the decoder uses a noise signal to reconstruct an output signal. In most any actual application, the scaled bit stream will include at least some frames that contain tonal components and some frames that include scale factor grids.
The structure and order of components placed in the master bit stream, as defined by the present invention, provides for wide bit range, fined grained, bit stream scalability. It is this structure and order that allows the bit stream to be smoothly scaled by external mechanisms.
Each psychoacoustic component removed does not utilize the same number of bits. The scaling resolution for the current implementation of the present invention ranges from 1 bit for components of lowest psychoacoustic importance to 32 bits for those components of highest psychoacoustic importance. The mechanism for scaling the bit stream does not need to remove entire chunks at a time. As previously mentioned, components within each chunk are arranged so that the most psychoacoustically important data is placed at the beginning of the chunk. For this reason, components can be removed from the end of the chunk, one component at a time, by a scaling mechanism while maintaining the best audio quality possible with each removed component. In one embodiment of the present invention, entire components are eliminated by the scaling mechanism, while in other embodiments, some or all of the components may be eliminated. The scaling mechanism removes components within a chunk as required, updating the Chunk Length field of the particular chunk from which the components were removed, the Frame Chunk Length 915 and the Frame Checksum 901. As will be seen from the detailed discussion of the exemplary embodiments of the present invention, with updated Chunk Length for each chuck scaled, as well as updated Frame Chunk Length and Frame Checksum information available to the decoder, the decoder can properly process the scaled bit stream, and automatically produce a fixed sample rate audio output signal for delivery to the DAC, even though there are chunks within the bit stream that are missing components, as well as chunks that are completely missing from the bit stream.
To reduce computational burden, the Inverse Frequency Transform 604 and Inverse Filter Bank 605 which convert the signals back into the time domain can be implemented with an inverse Hierarchical Filterbank, which integrates these operations with the Combiner 607 to form decoded time domain output audio signal 614. The use of the hierarchical filterbank in the decoder is novel in the way in which the tonal components are combined with the residual in the hierarchical filterbank at the decoder. The residual signals are forward transformed using MDCTs in each sub-band, and then the tonal components are reconstructed and combined prior to the last stage IMDCT. The multi-resolution approach could be generalized for other applications (e.g. multiple levels, different decompositions would still be covered by this aspect of the invention).
Inverse Hierarchical Filterbank
In order to reduce complexity of the decoder, the hierarchical filterbank may be used to combine the steps of Inverse Frequency Transform 604, Inverse Filterbank 605, Overlap-Add 608, and Combiner 607. As shown in
The overall operation of the decoder for a single channel using the HFB 2400 is shown in
The tonal components (T5) 2407 at the lowest frequency resolution (P=16, M=256) are read from the bit stream by Bit stream Parser 600. Tone decoder 601 inverse quantizes 2408 and synthesizes 2409 the tonal component to produce P groups of M frequency domain coefficients.
The Grid G time samples 4002 are windowed and overlap-added 2410 as shown in
The next lowest frequency resolution tonal components (T4) are read from the bit stream, and combined with the output of the previous stage of the hierarchical filterbank as described above, and then this iteration continues for P=8, 4, 2, 1 and M=512, 1024, 2048, and 4096 until all frequency components have been read from the bit stream, combined and reconstructed.
At the final stage of the decoder, the inverse transform produces N full-bandwidth time samples which are output as Decoded Output 614. The preceding values of P, M and N are for one exemplary embodiment only and do not limit the scope of the present invention. Other buffer, window and transform sizes and other transform types may also be used.
As described, the decoder anticipates receiving a frame that includes tonal components, time-sample components and scale factor grids. However, if one or more of these are missing from the scaled bit stream the decoder seamlessly reconstructs the decoded output. For example, if the frame includes only tonal components then the time-samples at 4002 are zero and no residual is combined 2403 with the synthesized tonal components in the first stage of the inverse HFB. If one or more of the tonal components T5, . . . T1 are missing, than a zero value is combined 2403 at that iteration. If the frame includes only the scale-factor grids, then the decoder substitutes a noise signal for Grid G to decode the output signal. As a result, the decoder can seamlessly reconstruct the decoded output signal as the composition of each frame of the scaled bit stream may change due to the content of the signal, changing data rate constraints, etc.
The general form of the Inverse Hierarchical Filterbank 2850 is shown in
In
In an exemplary implementation, N=256, P=32, and M=4. Note that different transform sizes and sub-band groupings represented by different choices for N, P, and M can also be employed to achieve a desired time/frequency decomposition.
Inverse Hierarchical Filterbank: Non-Uniformly Spaced Sub-Bands
Another embodiment of the Inverse Hierarchical Filterbank is shown in
In this case, the first synthesis element 3110 omits the steps of buffering 3122, windowing 3124, and the MDCT 3126 of the detailed element shown in
The output of the first element 3110 and 96 coefficients 3106 are input to the second element 3112 and combined as shown in
Further details regarding the decoder blocks will now be described.
Bit stream Parser 600
The Bit stream Parser 600 reads IFF chunk information from the bit stream and passes elements of that information on to the appropriate decoder, Tone Decoder 601 or Residual Decoder 602. It is possible that the bit stream may have been scaled before reaching the decoder. Depending on the method of scaling employed, psychoacoustic data elements at the end of a chunk may be invalid due to missing bits. Tone Decoder 601 and Residual Decoder 602 appropriately ignore data found to be invalid at the end of a chunk. An alternative to Tone Decoder 601 and Residual Decoder 602 ignoring whole psychoacoustic data elements, when bits of the element are missing, is to have these decoders recover as much of the element as possible by reading in the bits that do exist and filling in the remaining missing bits with zeros, random patterns or patterns based on preceding psychoacoustic data elements. Although more computationally intensive, the use of data based on preceding psychoacoustic data elements is preferred because the resulting decoded audio can more closely match the original audio signal.
Tone Decoder 601
Tone information found by the Bit stream Parser 600 is processed via Tone Decoder 601. Re-synthesis of tonal components is performed using the hierarchical filterbank as previously described. Alternatively, an Inverse Fast Fourier Transform whose size is the same size as the smallest transform size which was used to extract the tonal components at the encoder can be used.
The following steps are performed for tonal decoding:
a) Initialize the frequency domain sub-frame with zero values
b) Re-synthesize the required portion of tonal components from the smallest transform size into the frequency domain sub-frame
c) Re-synthesize and add at the required positions, tonal components from the other four transform sizes into the same sub-frame. The re-synthesis of these other four transform sizes can occur in any order.
Tone Decoder 601 decodes the following values for each transform size grouping: quantized amplitude, quantized phase, spectral distance from the previous tonal component for the grouping, and the position of the component within the full frame. For multichannel signals, the secondary information is stored as differences from the primary channel values and needs to be restored to absolute values by adding the values obtained from the bit stream to the value obtained for the primary channel. For multichannel signals, per-channel ‘presence’ of the tonal component is also provided by the bit mask 3602 which is decoded from the bit stream. Further processing on secondary channels is done independently of the primary channel. If Tone Decoder 601 is not able to fully acquire the elements necessary to reconstruct a tone from the chunk, that tonal element is discarded. The quantized amplitude is dequantized using the inverse of the table used to quantize the value in the encoder. The quantized phase is dequantized using the inverse of the linear quantization used to quantize the phase in the encoder. The absolute frequency spectral position is determined by adding the difference value obtained from the bit stream to the previously decoded value. Defining Amplitude to be the dequantized amplitude, Phase to be the dequantized phase, and Freq to be the absolute frequency position, the following pseudo-code describes the re-synthesis of tonal components of the smallest transform size:
Re[Freq]+=Amplitude*sin(2*Pi*Phase/8);
Im[Freq]+=Amplitude*cos(2*Pi*Phase/8);
Re[Freq+1]+=Amplitude*sin(2*Pi*Phase/8);
Im[Freq+1]+=Amplitude*cos(2*Pi*Phase/8);
Re-synthesis of longer base functions are spread over more sub-frames therefore the amplitude and phase values need to be updated according to the frequency and length of the base function. The following pseudo-code describes how this is done:
where:
Re-synthesis of lower frequencies in the largest three transform sizes via the method described above, causes audible distortion in the output audio, therefore the following empirically based correction is applied to spectral lines less than 60 in groups 3, 4, and 5:
where:
Since the bit stream does not contain any information as to the number of tonal components encoded, the decoder just reads tone data for each transform size until it runs out of data for that size. Thus, tonal components removed from the bit stream by external means, have no affect on the decoder's ability to handle data still contained in the bit stream. Removing elements from the bit stream just degrades audio quality by the amount of the data component removed. Tonal chunks can also be removed, in which case the decoder does not perform any reconstruction work of tonal components for that transform size.
Inverse Frequency Transform 604
The Inverse Frequency Transform 604 is the inverse of the transform used to create the frequency domain representation in the encoder. The current embodiment employs the inverse hierarchical filterbank described above. Alternately, an Inverse Fast Fourier Transform which is the inverse of the smallest FFT used to extract tones by the encoder provided overlapping FFTs were used at encode time.
Residual Decoder 602
A detailed block diagram of Residual Decoder 602 is shown in
where:
Time samples found by Bit stream Parser 600 are dequantized in Dequantizer 700. Dequantizer 700 dequantizes time samples from the bit stream using the inverse process of the encoder. Time samples from sub-band zero are dequantized to 16 levels, sub-bands 1 and 2 to 8 levels, sub-bands 11 through 25 to three levels, and sub-bands 26 through 31 to 2 levels. Any missing or invalid time samples are replaced with a pseudo-random sequence of values in the range of −1 to 1 having a white-noise spectral energy distribution. This improves scaled bit stream audio quality since such a sequence of values has characteristics that more closely resemble the original signal than replacement with zero values.
Channel Demuxer 701
Secondary channel information in the bit stream is stored as the difference from the primary channel for some sub-bands, depending on flags set in the bit stream. For these sub-bands, Channel Demuxer 701, restores values in the secondary channel from the values in the primary channel and difference values in the bit stream. If secondary channel information is missing the bit stream, secondary channel information can roughly be recovered from the primary channel by duplicating the primary channel information into secondary channels and using the stereo grid, to be subsequently discussed.
Channel Reconstruction 706
Stereo Reconstruction 706 is applied to secondary channels when no secondary channel information (time samples) are found in the bit stream. The stereo grid, reconstructed by Grid Decoder 702, is applied to the secondary time samples, recovered by duplicating the primary channel time sample information, to maintain the original stereo power ratio between channels.
Multichannel Reconstruction
Multichannel Reconstruction 706 is applied to secondary channels when no secondary information (either time samples or grids) for the secondary channels is present in the bit stream. The process is similar to Stereo Reconstruction 706, except that the partial grid reconstructed by Grid Decoder 702, is applied to the time samples of the secondary channel within each channel group, recovered by duplicating primary channel time sample information to maintain proper power level in the secondary channel. The partial grid is applied individually to each secondary channel in the reconstructed channel group following scaling by other scale factor grid(s) including grid G0 in the scaling step 703 by multiplying time samples of Grid G by corresponding elements of the partial grid for each secondary channel. The Grid G0, partial grids may be applied in any order in keeping with the present invention.
While several illustrative embodiments of the invention have been shown and described, numerous variations and alternate embodiments will occur to those skilled in the art. Such variations and alternate embodiments are contemplated, and can be made without departing from the spirit and scope of the invention as defined in the appended claims.
This application claims benefit of priority under 35 U.S.C. 119(e) to U.S. Provisional Application No. 60/691,558 entitled “Scalable Compressed Audio Bit Stream and Codec Using a Hierarchical Filterbank” and filed on Jun. 17, 2005, the entire contents of which are incorporated by reference.
| Number | Date | Country | |
|---|---|---|---|
| 60691558 | Jun 2005 | US |