The present invention relates generally to the field of digital signal processing. More specifically, the present invention is related to efficient coding of high frequency signal information.
In prior art audio compression schemes, such as perceptual audio coding (PAC), audio is typically coded as the output of a filterbank. The filterbank provides a frequency or a time-frequency representation of the signal. Additionally, the filterbank outputs are quantized using a quantization function based on a psychoacoustic model, wherein the psychoacoustic model accounts for the non-linear frequency sensitivity of the human ear (destination) by using a non-linear frequency resolution (bark scale) in the quantizer. However, often there are non-linearities involved at the signal production stage (i.e., in the source), which result in interdependencies between the low and high frequency components of a signal. The linear filterbanks employed in PAC or similar codecs (e.g., modified cosine discrete transform (MDCT) and/or wavelets) are not capable of taking advantage of such redundancies in the signal which arise due to non-linearities at the signal production stage.
Furthermore, though the linear filterbank used in PAC or similar codecs (i.e., wavelet/MDCT) does a good job of de-correlating the signal in time domain, however, significant correlation often exists in the frequency domain representation of the signal. This correlation may be both short term (i.e., between samples located in adjacent frequency bins) and long term (i.e., between frequency bins which are far apart in frequency). This is particularly true for musical instruments and voiced speech which have a clearly defined harmonic structure. Thus, conventional audio coding schemes make little, if any, effort of taking advantage of this correlation.
Furthermore, in prior art PAC systems, several features, such as Huffman scale factor quantization or multidimensional peaks, had to be permanently selected or deselected prior to the system being deployed in the field. Additionally, the present invention's enhanced PAC algorithm incorporates techniques for efficient coding of higher frequency components in the signal. These techniques are often suitable for only a segment of higher frequencies. Furthermore, separate systems that incorporated PAC with differing pre-selected feature sets were not functionally interoperable.
High quality speech is produced via various coding techniques, one of which is code-excited linear prediction or CELP. The CELP coder is a model wherein the vocal tract and excitation is modeled via short-term synthesis filters, and the glottal excitation is modeled via long-term synthesis filters. Thus, the CELP encoder synthesizes speech via these short-term and long-term synthesis filters in a feedback loop.
A basic CELP coder is illustrated in
P1(Z)=ΣβZZp
where p is the pitch period, and β is the predictor tap.
On the other hand, the short-term predictor (often referred to as linear prediction coding (LPC) predictor) is an nth order predictor with a transfer function of:
wherein a1 though an are the predictor coefficients.
As illustrated in
The present invention provides for a method and a system that takes advantage of interdependencies between the higher frequency and lower frequency signal components that may arise due to non-linearities in signal production or because of a periodic harmonic structure. This results in a more efficient coding scheme than the prior art, which is therefore capable of generating higher audio bandwidth and/or better audio quality at lower bit rates. Long-term and short-term frequency domain correlation is eliminated in a signal via frequency domain predictors. The prediction efficiency can be potentially and adaptively increased with the help of a non-linear model. Thus, the present invention's coding scheme compresses information consisting of coded low frequency components (from a low pass filter with a cut-off frequency of f1) as well as a parametric representation for the high frequency components (from a high pass filter with a cut-off frequency of fh) based on a linear/non-linear model. The parametric representation requires significantly fewer bits than conventional coding of the higher frequency components. These parameters for the high frequency model representation are updated every audio frame.
Additionally, the present invention works in the frequency domain representations of the signal (such as the MDCT representation which is naturally available to the PAC encoder and decoder), wherein low pass and high pass signal components are easily obtained by windowing the appropriate ranges of frequencies in the signal. Furthermore, the power functions (in a non-linear model) of the signal are replaced by corresponding convolution functions in the frequency domain of the same order. Also, the model of the present invention can be adapted to different frequency bands (i.e., a separate set of model parameters can be estimated and transmitted for different frequency regions, thereby reducing the overall estimation error). Furthermore, the convolution operation adds less to the decoder complexity than the power function.
In an extended embodiment of the present invention, the high frequency component is represented as the model output plus a residual component, wherein the reconstruction error or residual R(f) is coded separately using the conventional PAC coding scheme. With a high degree of model fit, the resulting residual is significantly less complex to encode, thus requiring lesser number of bits to encode than the original high frequency component. The present invention also allows for compression mechanisms to be determined “on-the-fly” and transmitted via the header at playback time. The type of features which may be adaptively chosen include techniques such as lattice quantization of scale factors, multidimensional coding of the peaks, and selection of a frequency range most amenable towards efficient high frequency coding.
As noted above, prior art systems make little effort to exploit the strong frequency domain correlation that is exhibited by many signals containing a strong harmonic structure. This aspect is illustrated in
In the present invention, long term and short term frequency domain correlation is eliminated in the signal with the help of frequency domain predictors. This is done for every audio frame (an audio frame in PAC consists of 1024 pulse code modulated (PCM) samples). The focus is primarily on the high frequency components of the signal, denoted as XHFC(f), and on inter-harmonic correlation removal. It should further be noted that the inter-harmonic correlation is eliminated with the help of a long-term prediction filter, such as a three-tap filter shown below:
In the above equation, βi represent the filter taps and M is the optimum correlation lag, i.e., the lag for which frequency components exhibit maximum inter-frequency correlation. This filter is illustrated in
The predictor taps βi (β1, β2, β3 in case of the three-tap filter in
R·a=r
The estimation of the optimal predictor coefficients is described in detail later in the specification.
In an enhancement to this scheme, the “whitened” high frequency residual may be further whitened using a conventional short-term predictor. The resulting residual may then even be modeled as Gaussian white noise and coded with the help of a random code-book. In a further enhancement to the above scheme, the high frequency components in the signal are modeled as being derivable from another signal(s) that is (are) obtained by applying non-linear processing to a low pass filtered version of the same signal (baseband). The nature of the non-linear processing and/or the dependency of the high frequency components on the non-linearly processed baseband are adaptively estimated on a frame-by-frame basis. The scheme therefore takes advantage of any interdependencies between the higher frequency and lower frequency signal components that may arise due to non-linearities in the signal production. This results in a more efficient coding scheme than the prior art, which is capable of generating higher audio bandwidth and/or better audio quality at lower bit rates.
The above-described enhancement of the present invention is outlined in
In a practical coding scheme a convenient form for the non-linearity in
The parametric model description for high frequency components, therefore, consists of the order of the polynomial non-linearity N and the coefficients αi's. For each frame of audio, one then needs to solve an identification problem to find optimal estimates for N and αi's so that the model in equation (1) provides the best description for high frequency components in the signal (e.g., the power of reconstruction error, RHFC is minimized). A simple two-step solution to this identification problem works as follows. As mentioned above, for a fixed N, closed form expressions for optimal αi's can be obtained by solving a set of matrix equation of the form
R·a=r (2)
where R=[Rij], i=1, . . . N, j=1, . . . , N, and Rij=<[xLFC(t)]i·[xLFC(t)]j>; a=[α1, α2, . . . , αN]′; and, r=[ri], for i=1, . . . , N, and ri=<xHFC(t)·[xLFC(t)]i>. Therefore, for a given N, the above equation may be solved to obtain the set of optimal coefficients {αi} and the corresponding minimum approximation error may then be computed. The model order N is obtained by examining the minimum approximation error over a small range of N and then choosing N for which the optimal approximation error is minimized.
In the development of proposed scheme it was further realized that it is advantageous to work with the frequency domain representations of the signal. In a frequency domain representation (such as the MDCT representation which is naturally available to the PAC encoder and decoder), low pass and high pass signal components are easily obtained by windowing the appropriate ranges of frequencies in the signal. Furthermore, the power functions in (1) are replaced by corresponding convolution functions of the same order. In other words if XLFC(f) and XHFC(f) denote the frequency transforms of xLFC(t) and xHFC(t) respectively, then equation (1) in frequency domain may be rewritten as
where (X*X* . . . *X)i represents the ith order convolution of X to itself; e.g., (X*X* . . . *X)i=X*X.
Working in the frequency domain offers several additional advantages. One advantage is that the model itself can be adapted to different frequency bands (i.e., a separate set of model parameters can be estimated and transmitted for different frequency regions, thereby reducing the overall estimation error). Furthermore, the convolution operation adds less to the decoder complexity than the power function. When the frequency domain representations are used, the model parameters may be estimated using exactly the same procedure as outlined above with the time domain representation.
In summary, in the extended embodiment of the present invention, the high-frequency component is represented as
Wherein, in the first part of the present invention,
X′LFC(f)=XLFC(f) (4a)
and in the second (optional) part of the present invention,
It should be noted that the non-linear part is a beautification/refinement and is not “essential” to the invention. Therefore, various embodiments can be envisioned, depending on the processing power available.
In this coding scheme, model parameters are estimated as above. In addition, the model reconstruction error or residual R(f) is coded separately using either (i) conventional PAC coding scheme or (ii) using efficient vector quantization techniques. Assuming a high degree of model fit, the resulting residual is significantly less complex to encode, thus requiring lesser number of bits to encode than the original high frequency component. A modified scheme is illustrated in
Audio signal content can have a wide array of characteristics that change over time, e.g., from speech only, to voice over music, to all genres of music. Most compression algorithms allow for a single method of compression to be used, i.e., transform based, model based, etc. However, this does not capture the time-varying nature of audio, nor does it contain the capability of representing the audio efficiently. A flexible content-based compressed audio bitstream header allows the processing to change along with the audio signal. Improvements in the overall audio quality and interoperability between systems are achieved by allowing the systems to choose compression mechanisms “on-the-fly” and transmit the processing state via the bitstream header.
A flexible content-based compressed audio bitstream header allows the system to produce additional coding gains by changing or using a combination of algorithms that produces the best compression ratio while maintaining a high-level of subjective audio quality. That compression mechanism can then be determined “on-the-fly” and transmitted via the header at playback time. The type of features which may be adaptively chosen include techniques such as lattice quantization of scale factors, multidimensional coding of the peaks, and selection of a frequency range most amenable towards efficient high frequency coding.
A general description of the header content of the PAC V4 bitstream is described in this section. Each field of the header provides information from the encoder to the decoder on what processing to perform while reconstructing a frame of compressed audio data.
M (Mono) Field 702—This 1-bit field defines if one or two channels are to be decoded to produce stereo outputs. If the value of this field is “0”, then two channel are to be decoded (“stereo”), and if the value of this field is “1”, then only one channel is decoded (“mono”).
Q (Huffman Scale Factor Lattice Quantization) 704—This 1-bit field defines which codebooks to use to decode the Huffman scale factors. If the value of this field is “0”, then non-lattice codebooks are used; and if the value of this field is “1”, then lattice codebooks are used.
P (Multi-dimensional Peaks) 706—This 1-bit field defines whether to decode the spectrum peaks using the multidimensional (MD) peaks codebook. Thus, a value of “1” in this field decodes the spectrum peaks using MD peaks codebook, and a value of “0” in this filed decodes the spectrum using non-MD peaks codebook.
PM (Prediction Mode) 708—This 2-bit defines if high frequency prediction will be used and what method will be implemented (e.g., a value of “00” corresponds to a unused field; a value of “01” corresponds to a recursive prediction mode; a value of “10” corresponds to a non-recursive prediction mode; and a value of “11” corresponds to a spread/conv prediction mode.
SB (Start Bin) 710—This 2-bit indicates at what frequency bin the high frequency prediction should begin.
EB (End Bin) 712—This 2-bit indicates at what frequency bin the high frequency prediction should end.
R (Residue Coding) 714—This 1-bit field defines whether to decode the high frequency residue if it has been included. A value of “0” indicates no residue, and therefore no decoding is necessary. On the other hand, a value of “1” indicates a residue and thus requires residue coding.
N (Non-Linear Companding) 716—This 1-bit field defines whether or not to perform non-linear companding. A value of “0” indicates no companding, and a value of “1” indicates companding.
U (Unsampling) 718—This 1-bit field indicates whether or not to upsample and compand audio data.
SN (Sequence Number) 720—This 2-bit field indicates if there is a different sequence set exists for different upsampling ratios.
X (Expansion) 722—This 1-bit field provides for future upgrades and backwards compatibility. If the bit is set, it is interpreted to be the S bit and indicates additional data.
S (Stereo High Frequency Coding) 724—This bit indicates that the high frequency content is stereo. A value of “0” indicates that stereo coding is not necessary and a value of “1” indicates that stereo coding is necessary.
H (HF Stability) 726—This 1-bit field indicates whether or not to use the stable parameters for the recursive prediction mode.
It should be noted that the Shaded fields (SB 710, EB 712, R 714, S 724, and H 726) in
The present invention incorporates a computer program code based product, which is a storage medium having program code stored therein, which can be used to instruct a computer to perform any of the methods associated with the present invention. The computer storage medium includes any of, but not limited to, the following: CD-ROM, DVD, magnetic tape, optical disc, hard drive, floppy disk, ferroelectric memory, flash memory, ferromagnetic memory, optical storage, charge coupled devices, magnetic or optical cards, smart cards, EEPROM, EPROM, RAM, ROM, DRAM, SRAM, SDRAM, or any other appropriate static or dynamic memory, or data storage devices.
Implemented in computer program code based products are software modules for: extracting low-frequency components of said signal; receiving said extracted high and low frequency components and producing a set of linear predictive filter coefficients by modeling said high frequency components as a function of low frequency components, said function given by either:
or a combination of the above two functions, wherein (X*X* . . . *X)i represents the ith order convolution of X onto itself; XHFC(f) and XLFC(f) denote the frequency transform of said high and low frequency components respectively; M is the optimum correlation lag; N represents the model order; encoding said extracted low-frequency components, and multiplexing said set of linear predictive filter coefficients and said encoded contents and forming an encoded output signal.
A system and method has been shown in the above embodiments for the effective implementation of an efficient coding of high frequency signal information in a signal using non-linear prediction based on a low pass baseband. The above system and method may be implemented in various computing environments. For example, the present invention may be implemented on a conventional IBM PC or equivalent, multi-nodal system (e.g., LAN) or networking system (e.g., Internet, WWW, wireless web). All programming and data related thereto are stored in, computer memory, static or dynamic, and may be retrieved by the user in any of: conventional computer storage, display (i.e., CRT) and/or hardcopy (i.e., printed) formats. The programming of the present invention may be implemented by one of skill in the art of digital signal processing.
While various preferred embodiments have been shown and described, it will be understood that there is no intent to limit the invention by such disclosure, but rather, it is intended to cover all modifications and alternate constructions falling within the spirit and scope of the invention, as defined in the appended claims. For example, the present invention should not be limited by the order of the tap filter used, number of fields in the bitstream header, software/program, computing environment, or specific hardware.
Number | Name | Date | Kind |
---|---|---|---|
5710863 | Chen | Jan 1998 | A |
6680972 | Liljeryd et al. | Jan 2004 | B1 |
Number | Date | Country | |
---|---|---|---|
20040064311 A1 | Apr 2004 | US |