I. Technical Field
This disclosure generally relates to digital signal processing, and more specifically, to techniques for encoding and decoding signals for storage and/or communication.
II. Background
In digital communications, signals are typically coded for transmission and decoded for reception. Coding of signals concerns converting the original signals into a format suitable for propagation over a transmission medium. The objective is to preserve the quality of the original signals, but at a low consumption of the medium's bandwidth. Decoding of signals involves the reverse of the coding process.
A known coding scheme uses the technique of pulse-code modulation (PCM).
To conserve bandwidth, the digital values of the PCM pulses 20 can be compressed using a logarithmic companding process prior to transmission. At the receiving end, the receiver merely performs the reverse of the coding process mentioned above to recover an approximate version of the original time-varying signal x(t). Apparatuses employing the aforementioned scheme are commonly called the a-law or μ-law codecs.
As the number of users increases, there is a further practical need for bandwidth conservation. For instance, in a wireless communication system, a multiplicity of users are often limited to sharing a finite amount frequency spectrum. Each user is normally allocated a limited bandwidth among other users. Thus, as the number of users increases, so does the need to further compress digital information in order to converse the bandwidth available on the transmission channel.
For voice communications, speech coders are frequently used to compress voice signals. In the past decade or so, considerable progress has been made in the development of speech coders. A commonly adopted technique employs the method of code excited linear prediction (CELP). Details of CELP methodology can be found in publications, entitled “Digital Processing of Speech Signals,” by Rabiner and Schafer, Prentice Hall, ISBN: 0132136031, September 1978; and entitled “Discrete-Time Processing of Speech Signals,” by Deller, Proakis and Hansen, Wiley-IEEE Press, ISBN: 0780353862, September 1999. The basic principles underlying the CELP method is briefly described below.
Referring to
For simplicity, take only the three PCM pulse groups 22A-22C for illustration. During encoding prior to transmission, the digital values of the PCM pulse groups 22A-22C are consecutively fed to a linear predictor (LP) module. The resultant output is a set of frequency values, also called an “LP filter” or simply “filter” which basically represents the spectral content of the pulse groups 22A-22C. The LP filter is then quantized.
The LP module generates an approximation of the spectral representation of the PCM pulse groups 22A-22C. As such, during the predicting process, errors or residual values are introduced. The residual values are mapped to a codebook which carries entries of various combinations available for close matching of the coded digital values of the PCM pulse groups 22A-22C. The best fitted values in the codebook are mapped. The mapped values are the values to be transmitted. The overall process is called time-domain linear prediction (TDLP).
Thus, using the CELP method in telecommunications, the encoder (not shown) merely has to generate the LP filters and the mapped codebook values. The transmitter needs only to transmit the LP filters and the mapped codebook values, instead of the individually coded PCM pulse values as in the a- and μ-law encoders mentioned above. Consequently, substantial amount of communication channel bandwidth can be saved.
On the receiver end, it also has a codebook similar to that in the transmitter. The decoder (not shown) in the receiver, relying on the same codebook, merely has to reverse the encoding process as aforementioned. Along with the received LP filters, the time-varying signal x(t) can be recovered.
Heretofore, many of the known speech coding schemes, such as the CELP scheme mentioned above, are based on the assumption that the signals being coded are short-time stationary. That is, the schemes are based on the premise that frequency contents of the coded frames are stationary and can be approximated by simple (all-pole) filters and some input representation in exciting the filters. The various TDLP algorithms, in arriving at the codebooks as mentioned above, are based on such a model. Nevertheless, voice patterns among individuals can be very different. Non-speech audio signals, such as sounds emanated from various musical instruments, are also distinguishably different from speech signals. Furthermore, in the CELP process as described above, to expedite real-time signal processing, a short time frame is normally chosen. More specifically, as shown in
As an improvement over TLDP algorithms, frequency domain linear prediction (FDLP) schemes have been developed to improve preservation of signal quality, applicable not only to human speech, but also to a variety of other sounds, and further, to more efficiently utilize communication channel bandwidth. FDLP is the basically the frequency-domain analogue of TLDP; however, FDLP coding and decoding schemes are capable processing much longer temporal frames when compared to TLDP. Similarly to how TLDP fits an all-pole model to the power spectrum of an input signal, FDLP fits an all-pole model to the squared Hilbert envelop of an input signal. Although FDLP represents a significant advance in audio and speech coding techniques, there exists a need to improve the compression efficiency of FDLP codecs.
Disclosed herein is a new and improved approach to FDLP audio encoding and decoding. The techniques disclosed herein apply temporal masking to an estimated Hilbert carrier produced by an FDLP encoding scheme. Temporal masking is a property of the human auditory system, where sounds appearing for up to 100-200 ms after a strong, transient, temporal signal get masked by the auditory system due to this strong temporal component. It has been discovered that modeling the temporal masking property of the human ear in an FDLP codec improves the compression efficiency of the codec.
According to an aspect of the approach disclosed herein, a method of encoding a signal includes providing a frequency transform of the signal, applying a frequency domain linear prediction (FDLP) scheme to the frequency transform to generate a carrier, determining a temporal masking threshold, and quantizing the carrier based on the temporal masking threshold.
According to another aspect of the approach, a system for encoding a signal includes a frequency transform component configured to produce a frequency transform of the signal, an FDLP component configured to generate a carrier in response to the frequency transform, a temporal mask configured to determine a temporal masking threshold, and a quantizer configured to quantize the carrier based on the temporal masking threshold.
According to another aspect of the approach, a system for encoding a signal includes means for providing a frequency transform of the signal, means for applying an FDLP scheme to the frequency transform to generate a carrier, means for determining a temporal masking threshold, and means for quantizing the carrier based on the temporal masking threshold.
According to another aspect of the approach, a computer-readable medium embodying a set of instructions executable by one or more processors includes code for providing a frequency transform of the signal, code for applying an FDLP scheme to the frequency transform to generate a carrier, code for determining a temporal masking threshold, and code for quantizing the carrier based on the temporal masking threshold.
According to another aspect of the approach, a method of decoding a signal includes providing quantization information determined according to a temporal masking threshold, inverse quantizing a portion of the signal, based on the quantization information, to recover a carrier, and applying an inverse-FDLP scheme to the carrier to recover a frequency transform of a reconstructed signal.
According to another aspect of the approach, a system for decoding a signal includes: a de-packetizer configured to provide quantization information determined according to a temporal masking threshold; an inverse-quantizer configured to inverse quantizing a portion of the signal, based on the quantization information, to recover a carrier; and an inverse-FDLP component configured to output a frequency transform of a reconstructed signal in response to the carrier.
According to another aspect of the approach, a system for decoding a signal includes means for providing quantization information determined according to a temporal masking threshold; means for inverse quantizing a portion of the signal, based on the quantization information, to recover a carrier; and means for applying an inverse-FDLP scheme to the carrier to recover a frequency transform of a reconstructed signal.
According to another aspect of the approach, a computer-readable medium embodying a set of instructions executable by one or more processors includes code for providing quantization information determined according to a temporal masking threshold; code for inverse quantizing a portion of the signal, based on the quantization information, to recover a carrier; and code for applying an inverse-FDLP scheme to the carrier to recover a frequency transform of a reconstructed signal.
According to another aspect of the approach, a method of determining a temporal masking threshold includes providing a first-order masking model of a human auditory system, determining the temporal masking threshold by applying a correction factor to the first-order masking model, and providing the temporal masking threshold in a codec.
According to another aspect of the approach, a system for determining a temporal masking threshold includes a modeler configured to providing a first-order masking model of a human auditory system, a processor configured to determine the temporal masking threshold by applying a correction factor to the first-order masking model, and a temporal mask configured to provide the temporal masking threshold in a codec.
According to another aspect of the approach, a system for determining a temporal masking threshold includes means for providing a first-order masking model of a human auditory system, means for determining the temporal masking threshold by applying a correction factor to the first-order masking model, and means for providing the temporal masking threshold in a codec.
According to another aspect of the approach, a computer-readable medium embodying a set of instructions executable by one or more processors includes code for providing a first-order masking model of a human auditory system, code for determining the temporal masking threshold by applying a correction factor to the first-order masking model, and code for providing the temporal masking threshold in a codec.
Other aspects, features, embodiments and advantages of the audio coding technique will be or will become apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional features, embodiments, processes and advantages be included within this description and be protected by the accompanying claims.
It is to be understood that the drawings are solely for purpose of illustration. Furthermore, the components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the disclosed audio coding technique. In the figures, like reference numerals designate corresponding parts throughout the different views.
The following detailed description, which references to and incorporates the drawings, describes and illustrates one or more specific embodiments. These embodiments, offered not to limit but only to exemplify and teach, are shown and described in sufficient detail to enable those skilled in the art to practice what is claimed. Thus, for the sake of brevity, the description may omit certain information known to those of skill in the art.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment or variant described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or variants. All of the embodiments and variants described in this description are exemplary embodiments and variants provided to enable persons skilled in the art to make and use the invention, and not necessarily to limit the scope of legal protection afforded the appended claims.
In this specification and the appended claims, unless specifically specified wherever appropriate, the term “signal” is broadly construed. Thus the term signal includes continuous and discrete signals, and further frequency-domain and time-domain signals. In addition, the term “frequency transform” and “frequency-domain transform” are used interchangeably. Likewise, the term “time transform” and “time-domain transform” are used interchangeably.
A novel and non-obvious audio coding technique based on modeling spectral dynamics is disclosed. Briefly, frequency decomposition of the input audio signal is employed to obtain multiple frequency sub-bands that closely follow critical decomposition. Then, in each sub-band, a so-called analytic signal is pre-computed and the squared magnitude of the analytic signal is transformed using a discrete Fourier transform (DFT), and then linear prediction is applied resulting in a Hilbert envelope and a Hilbert Carrier for each of the sub-bands. Because of employment of linear prediction of frequency components, the technique is called Frequency Domain Linear Prediction (FDLP). The Hilbert envelope and the Hilbert Carrier are analogous to spectral envelope and excitation signals in the Time Domain Linear Prediction (TDLP) techniques. Disclosed in further detail below is a technique of temporal masking to improve the compression efficiency of the FDLP codecs. Specifically, the concept of forward masking is applied to the encoding of sub-band Hilbert carrier signals. By doing this, the bit-rate of an FDLP codec may be substantially reduced without significantly degrading signal quality.
More specifically, the FDLP coding scheme is based on processing long (hundreds of ms) temporal segments. A full-band input signal is decomposed into sub-bands using QMF analysis. In each sub-band, FDLP is applied and line spectral frequencies (LSFs) representing the sub-band Hilbert envelopes are quantized. The residuals (sub-band carriers) are processed using DFT and corresponding spectral parameters are quantized. In the decoder, spectral components of the sub-band carriers are reconstructed and transformed into time-domain using inverse DFT. The reconstructed FDLP envelopes (from LSF parameters) are used to modulate the corresponding sub-band carriers. Finally, the inverse QMF block is applied to reconstruct the full-band signal from frequency sub-bands.
Turning now to the drawings, and in particular to
In the encoding section 32, there is an encoder 38 connected to a data packetizer 40. The encoder 38 implements an FDLP technique for encoding input signals as described herein. The packetizer 40 formats and encapsulates an encoded input signal and other information for transport through the data handler 36. A time-varying input signal x(t), after being processed through the encoder 38 and the data packetizer 40 is directed to the data handler 36.
In a somewhat similar manner but in the reverse order, in the decoding section 34, there is a decoder 42 coupled to a data de-packetizer 44. Data from the data handler 36 are fed to the data de-packetizer 44 which in turn sends the de-packetized data to the decoder 42 for reconstruction of the original time-varying signal x(t). The reconstructed signal is represented by x′(t). The de-packetizer 44 extracts the encoded input signal and other information from incoming data packets. The decoder 42 implements an FDLP technique for decoding the encoded input signal as described herein.
The QMF 302 performs a QMF analysis on the discrete input signal. Essentially, the QMF analysis decomposes the discrete input signal into thirty-two non-uniform, critically sampled sub-bands. For this purpose, the input audio signal is first decomposed into sixty-four uniform sub-bands using a uniform QMF decomposition. The sixty-four uniform QMF sub-bands are then merged to obtain the thirty-two non-uniform sub-bands. An FDLP codec based on uniform QMF decomposition producing the sixty-four sub-bands may operate at about 130 kbps. The QMF filter bank can be implemented in a tree-like structure, e.g., a six stage binary tree. The merging is equivalent to tying some branches in the binary tree at particular stages to form the non-uniform bands. This tying may follow the human auditory system, i.e., more bands at higher frequencies are merged together than at the lower frequencies since the human ear is generally more sensitive to lower frequencies. Specifically, the sub-bands are narrower at the low-frequency end than at the high-frequency end. Such an arrangement is based on the finding that the sensory physiology of the mammalian auditory system is more attuned to the narrower frequency ranges at the low end than the wider frequency ranges at the high end of the audio frequency spectrum. A graphical schematic of perfect reconstruction non-uniform QMF decomposition resulting from an exemplary merging of the sixty-four sub-bands into thirty-two sub-bands is shown in
Each of the thirty-two sub-bands output from the QMF 302 is provided to the tonality detector 304. The tonality detector applies a technique of spectral noise shaping (SNS) to overcome spectral pre-echo. Spectral pre-echo is a type of undesirable audio artifact that occurs when tonal signals are encoded using an FDLP codec. As is understood by those of ordinary skill in the art, a tonal signal is one that has strong impulses in the frequency domain. In an FDLP codec, tonal sub-band signals can cause errors in the quantization of an FDLP carrier that spread across the frequencies around the tone. In the reconstructed audio signal output by an FDLP decoder, this appears as an audio framing artifacts occurring with the period of a frame duration. This problem is referred to as the spectral pre-echo.
To reduce or eliminate the problem of spectral pre-echo, the tonality detector 304 checks each sub-band signal before it is processed by the FDLP component 308. If a sub-band signal is identified as tonal, it is passed through the TDLP filter 306. If not, the non-tonal sub-band signal is passed to the FDLP component 308 without TDLP filtering.
Since tonal signals are highly predictable in the time domain, the residual of the time-domain linear prediction (the TDLP filter output) of a tonal sub-band signal has frequency characteristics that can be efficiently modeled by the FDLP component 308. Thus, for a tonal sub-band signal, the FDLP encoded sub-band signal is output from the encoder 38 along with TDLP filter parameters (LPC coefficients) for the sub-band. At the receiver, inverse-TDLP filtering is applied on the FDLP-decoded sub-band signal, using the transported LPC coefficients, to reconstruct the sub-band signal. Further details of the decoding process are described below in connection with
The FDLP component 308 processes each sub-band in turn. Specifically, the sub-band signal is predicted in the frequency domain and the prediction coefficients form the Hilbert envelope. The residual of the prediction forms the Hilbert carrier signal. The FDLP component 308 splits an incoming sub-band signal into two parts: an approximation part represented by the Hilbert envelope coefficients and an error in approximation represented by the Hilbert carrier. The Hilbert envelope is quantized in the line spectral frequency (LSF) domain by the FDLP component 308. The Hilbert carrier is passed to the DFT component 310, where it is encoded into the DFT domain.
The line spectral frequencies (LSFs) correspond to an auto-regressive (AR) model of the Hilbert carrier and are computed from the FDLP coefficients. The LSFs are vector quantized by the first split VQ 312. A 40th-order all-pole model may be used by the first split VQ 312 to perform the split quantization.
The DFT component 310 receives the Hilbert carrier from the FDLP component 308 and outputs a DFT magnitude signal and DFT phase signal for each sub-band Hilbert carrier. The DFT magnitude and phase signals represent the spectral components of the Hilbert carrier. The DFT magnitude signal is provided to the second split VQ 316, which performs a vector quantization of the magnitude spectral components. Since a full-search VQ would likely be computationally infeasible, a split VQ approach is employed to quantize the magnitude spectral components. The split VQ approach reduces computational complexity and memory requirements to manageable limits without severely affecting the VQ performance. To perform split VQ, the vector space of spectral magnitudes is divided into separate partitions of lower dimension. The VQ codebooks are trained (on a large audio database) for each partition, across all the frequency sub-bands, using the Linde-Buzo-Gray (LBG) algorithm. The bands below 4 kHz have a higher resolution VQ codebook, i.e., more bits are allocated to the lower sub-bands, than the higher frequency sub-bands.
The scalar quantizer 318 performs a non-uniform scalar quantization (SQ) of DFT phase signals corresponding to the Hilbert carriers of the sub-bands. Generally, the DFT phase components are uncorrelated across time. The DFT phase components have a distribution close to uniform, and therefore, have high entropy. To prevent excessive consumption of bits required to represent DFT phase coefficients, those corresponding to relatively low DFT magnitude spectral components are transmitted using lower resolution SQ, i.e., the codebook vector selected from the DFT magnitude codebook is processed by adaptive thresholding in the scalar quantizer 318. The threshold comparison is performed by the phase bit-allocator 320. Only the DFT spectral phase components whose corresponding DFT magnitudes are above a predefined threshold are transmitted using high resolution SQ. The threshold is adapted dynamically to meet a specified bit-rate of the encoder 38.
The temporal mask 314 is applied to the DFT phase and magnitude signals to adaptively quantize these signals. The temporal mask 314 allows the audio signal to be further compressed by reducing, in certain circumstances, the number of bits required to represent the DFT phase and magnitude signals. The temporal mask 314 includes one or more threshold values that generally define the maximum level of noise allowed in the encoding process so that the audio remains perceptually acceptable to users. For each sub-band frame processed by the encoder 38, the quantization noise introduced into the audio by the encoder 38 is determined and compared to a temporal masking threshold. If the quantization noise is less than the temporal masking threshold, the number of quantization levels of the DFT phase and magnitude signals (i.e., number of bits used to represent the signals) is reduced, thereby increasing the quantization noise level of the encoder 38 to approach or equal the noise level indicated by the temporal mask 314. In the exemplary encoder 38, the temporal mask 314 is specifically used to control the bit-allocation for the DFT magnitude and phase signals corresponding to each of the sub-band Hilbert carriers.
The application of the temporal mask 314 may be done in the specific following manner. An estimation of the mean quantization noise present in the baseline codec (the version of the codec where there is no temporal masking) is performed for each sub-band sub-frame. The quantization noise of the baseline codec may be introduced by quantizing the DFT signal components, i.e., the DFT magnitude and phase signals output from the DFT component 310, and are preferably measured from these signals. The sub-band sub-frames may be 200 milliseconds in duration. If the mean of the quantization noise in a given sub-band sub-frame is above the temporal masking threshold (e.g., mean value of the temporal mask), no bit-rate reduction is applied to the DFT magnitude and phase signals for that sub-band frame. If the mean value of the temporal mask is above the quantization noise mean, the amount of bits needed to encode the DFT magnitude and phase signals for that sub-band frame (i.e., the split VQ bits for DFT magnitude and SQ bits for DFT phase) is reduced in by an amount so that the quantization noise level approaches or equals the maximum permissible threshold given by the temporal mask 314.
The amount of bit-rate reduction is determined based on the difference in dB sound pressure level (SPL) between the baseline codec quantization noise and the temporal masking threshold. If the difference is large, the bit-rate reduction is great. If the difference is small, the bit-rate reduction is small.
The temporal mask 314 configures the second split VQ 316 and SQ 318 to adaptively effect the mask-based quantizations of the DFT phase and magnitude parameters. If the mean value of the temporal mask is above the noise mean for a given sub-band sub-frame, the amount of bits needed to encode the sub-band sub-frame (split VQ bits for DFT magnitude parameters and scalar quantization bits for DFT phase parameter) is reduced in such a way that the noise level in a given sub-frame (e.g. 200 milliseconds) may become equal (in average) to the permissible threshold (e.g., mean, median, rms) given by the temporal mask. In the exemplary encoder 38 disclosed herein, eight different quantizations are available so that the bit-rate reduction is at eight different levels (in which one level corresponds to no bit-rate reduction).
Information regarding the temporal masking quantization of the DFT magnitude and phase signals is transported to the decoding section 34 so that it may be used in the decoding process to reconstruct the audio signal. The level of bit-rate reduction for each sub-band sub-frame is transported as side information along with the encoded audio to the decoding section 34.
The components of the decoder 42 essentially perform the inverse operation of those included in the encoder 38. The decoder 42 includes a first inverse vector quantizer (VQ) 504, a second inverse VQ 506, and an inverse scalar quantizer (SQ) 508. The first inverse split VQ 504 receives encoded data representing the Hilbert envelope, and the second inverse split VQ 506 and inverse SQ 508 receive encoded data representing the Hilbert carrier. The decoder 42 also includes an inverse DFT component 510, and inverse FDLP component 512, a tonality selector 514, an inverse TDLP filter 516, and a synthesis QMF 518.
For each sub-band, received vector quantization indices for LSFs corresponding to Hilbert envelope are inverse quantized by the first inverse split VQ 504. The DFT magnitude parameters are reconstructed from the vector quantization indices that are inverse quantized by the second inverse split VQ 506. DFT phase parameters are reconstructed from scalar values that are inverse quantized by the inverse SQ 508. The temporal masking quantization value(s) are applied by the second inverse split VQ 506 and inverse SQ 508. The inverse DFT component 510 produces the sub-band Hilbert carrier in response to the outputs of the second inverse split VQ 506 and inverse SQ 508. The inverse FDLP component 512 modulates the sub-band Hilbert carrier using reconstructed Hilbert envelope.
The tonality flag is provided to tonality selector 514 in order to allow the selector 514 to determine whether inverse TDLP filtering should be applied. If the sub-band signal is tonal, as indicated by the flag transmitted from the encoder 38, the sub-band signal is sent to the inverse TDLP filter 516 for inverse TDLP filtering prior to QMF synthesis. If not, the sub-band signal bypasses the inverse TDLP filter 516 to the synthesis QMF 518.
The synthesis QMF 518 performs the inverse operation of the QMF 302 of the encoder 38. All sub-bands are merged to obtain the full-band signal using QMF synthesis. The discrete full-band signal is converted to a continuous signal using appropriate D/A conversion techniques to obtain the time-varying reconstructed continuous signal x′(t).
A non-tonal sub-band signal is provided directly to the FDLP codec 602, bypassing the TDLP filter 306; and the output of the FDLP codec 602 represents the reconstructed sub-band signal, without any further filtering by the inverse TDLP filter 516.
Next, in step 704, the discrete input signal x(n) is partitioned into frames. One of such frame of the time-varying signal x(t) is signified by the reference numeral 460 as shown in
The discrete version of the signal s(t) is represented by s(n), where n is an integer indexing the sample number. The time-continuous signal s(t) is related to the discrete signal s(n) by the following algebraic expression:
s(t)=s(nτ) (1)
where τ is the sampling period as shown in
In step 706, each frame is decomposed into a plurality of frequency sub-bands. QMF analysis may be applied to each frame to produce the sub-band frames. Each sub-band frame represents a predetermined bandwidth slice of the input signal over the duration of a frame.
In step 708, a determination is made for each sub-band frame whether it is tonal. This can be performed by a tonality detector, such as the tonality detector 314 described above in connection with
In step 712, the sampled signal, or TDLP residual if the signal is tonal, within each sub-band frame undergoes a frequency transform to obtain a frequency-domain signal for the sub-band frame. The sub-band sampled signal is denoted as sk(n) for the kth sub-band. In the exemplary decoder 38 disclosed herein, k is an integer value between 1 and 32, and the method of discrete Fourier transform (DFT) is preferably employed for the frequency transformation. A DFT of sk(n) can be expressed as:
T
k(f)={sk(n)} (2)
where sk(n) is as defined above, denotes the DFT operation, f is a discrete frequency within the sub-band in which 0≦f≦N, Tk is the linear array of the N transformed values of the N pulses of sk(n) and N is an integer.
At this juncture, it helps to make a digression to define and distinguish the various frequency-domain and time-domain terms. The discrete time-domain signal in the kth sub-band sk(n) can be obtained by an inverse discrete Fourier transform (IDFT) of its corresponding frequency counterpart Tk(f). The time-domain signal in the kth sub-band sk(n) essentially composes of two parts, namely, the time-domain Hilbert envelope hk(n) and the Hilbert carrier ck(n). Stated in another way, modulating the Hilbert carrier ck(n) with the Hilbert envelope hk(n) will result in the time-domain signal in the kth sub-band sk(n). Algebraically, it can be expressed as follows:
s
k(n)={right arrow over (h)}k(n)·{right arrow over (c)}k(n) (3)
Thus, from equation (3), if the time-domain Hilbert envelope hk(n) and the Hilbert carrier ck(n) are known, the time-domain signal in the kth sub-band sk(n) can be reconstructed. The reconstructed signal approximates that of a lossless reconstruction.
FDLP is applied to each sub-band frequency-domain signal to obtain a Hilbert envelope and Hilbert carrier corresponding to the respective sub-band frame (step 714). The Hilbert envelope portion is approximated by the FDLP scheme as an all-pole model. The Hilbert carrier portion, which represents the residual of the all-pole model, is approximately estimated.
As mentioned earlier, the time-domain term Hilbert envelope hk(n) in the kth sub-band can be derived from the corresponding frequency-domain parameter Tk(f). In step 714, the process of frequency-domain linear prediction (FDLP) of the parameter Tk(f) is employed to accomplish this. Data resulting from the FDLP process can be more streamlined, and consequently more suitable for transmission or storage.
In the following paragraphs, the FDLP process is briefly described followed with a more detailed explanation.
Briefly stated, in the FDLP process, the frequency-domain counterpart of the Hilbert envelope hk(n) is estimated, which counterpart is algebraically expressed as {tilde over (T)}k(f). However, the signal intended to be encoded is sk(n). The frequency-domain counterpart of the parameter sk(n) is Tk(f). To obtain Tk(f) from sk(n) an excitation signal, such as white noise is used. As will be described below, since the parameter {tilde over (T)}k(f) is an approximation, the difference between the approximated value {tilde over (T)}k(f) and the actual value Tk(f) can also be estimated, which difference is expressed as Ck(f). The parameter Ck(f) is called the frequency-domain Hilbert carrier, and is also sometimes called the residual value. After performing an inverse FLDP process, the signal sk(n) is directly obtained.
Hereinbelow, further details of the FDLP process for estimating the Hilbert envelope and the Hilbert carrier parameter Ck(f) are described.
An auto-regressive (AR) model of the Hilbert envelope for each sub-band may be derived using the method shown by flowchart 500 of
X
k(f)=Tk(0), for f=0,
2Tk(f), for 1≦f≦N/2−1,
Tk(N/2), for f=N/2,
0, for N/2+1≦k≦N (4)
The N-point inverse DFT of Xk(f) is then computed to obtain the analytic signal vk(n).
Next, in step 505, the Hilbert envelop is estimated from the analytic signal vk(n). The Hilbert envelope is essentially the squared magnitude of the analytic signal, i.e.,
h
k(n)=|vk(n)2|=vk(n)vk*(n), (5)
where vk*(n) denotes the complex conjugate of vk(n).
In step 507, the spectral auto-correlation function of the Hilbert envelope is obtained as a discrete Fourier transform (DFT) of the Hilbert envelope of the discrete signal. The DFT of the Hilbert envelope can be written as:
where Xk(f) denotes the DFT of the analytic signal and r(f) denotes the spectral auto-correlation function. The Hilbert envelope of the discrete signal sk(n) and the auto-correlation in the spectral domain form Fourier Transform pairs. In a manner similar to the computation of the auto-correlation of the signal using the inverse Fourier transform of the power spectrum, the spectral auto-correlation function can thus be obtained as the Fourier transform of the Hilbert envelope. In step 509, these spectral auto-correlations are used by a selected linear prediction technique to perform AR modeling of the Hilbert envelope by solving, for example, a linear system of equations. As discussed in further detail below, the algorithm of Levinson-Durbin can be employed for the linear prediction. Once the AR modeling is performed, the resulting estimated FDLP Hilbert envelope is made causal to correspond to the original causal sequence sk(n). In step 511, the Hilbert carrier is computed from the model of the Hilbert envelope. Some of the techniques described hereinbelow may be used to derive the Hilbert carrier from the Hilbert envelop model.
In general, the spectral auto-correlation function produced by the method of
s
e(n)=(s(n)+s(−n))/2, (7)
where se[n] denotes the even-symmetric part of s. The Hilbert envelope of se(n) will be also be even-symmetric and hence, this will result in a real valued auto-correlation function in the spectral domain. This step of generating a real valued spectral auto-correlation is done for simplicity in the computation, although, the linear prediction can be done equally well for complex valued signals.
In an alternative configuration of the encoder 38, a different process, relying instead on a DCT, can be used to arrive at the estimated Hilbert envelope for each sub-band. In this configuration, the transform of the discrete signal sk(n) from the time domain into the frequency domain can be expressed mathematically as follows:
where sk(n) is as defined above, f is the discrete frequency within the sub-band in which 0≦f≦N, Tk is the linear array of the N transformed values of the N pulses of sk(n), and the coefficients c are given by c(0)=√{square root over (1/N)}, c(f)=√{square root over (2/N)} for 1≦f≦N−1, where N is an integer.
The N pulsed samples of the frequency-domain transform Tk(f) are called DCT coefficients.
The discrete time-domain signal in the kth sub-band sk(n) can be obtained by an inverse discrete cosine transform (IDCT) of its corresponding frequency counterpart Tk(f). Mathematically, it is expressed as follows:
where sk(n) and Tk(f) are as defined above. Again, f is the discrete frequency in which 0≦f≦N, and the coefficients c are given by c(0)=√{square root over (1/N)}, c(f)=√{square root over (2/N)} for 1≦f≦N−1.
Using either of the DFT or DCT approaches discussed above, the Hilbert envelope may be modeled using the algorithm of Levinson-Durbin. Mathematically, the parameters to be estimated by the Levinson-Durbin algorithm can be expressed as follows:
in which H(z) is a transfer function in the z-domain, approximating the time-domain Hilbert envelope hk(n); z is a complex variable in the z-domain; a(i) is the ith coefficient of the all-pole model which approximates the frequency-domain counterpart {tilde over (T)}k(f) of the Hilbert envelope hk(n); i=0, . . . , K−1. The time-domain Hilbert envelope hk(n) has been described above (e.g., see
Fundamentals of the Z-transform in the z-domain can be found in a publication, entitled “Discrete-Time Signal Processing,” 2nd Edition, by Alan V. Oppenheim, Ronald W. Schafer, John R. Buck, Prentice Hall, ISBN: 0137549202, and is not further elaborated in here.
In Equation (10), the value of K can be selected based on the length of the frame 460 (
In essence, in the FDLP process as exemplified by Equation (10), the DCT coefficients of the frequency-domain transform in the kth sub-band Tk(f) are processed via the Levinson-Durbin algorithm resulting in a set of coefficients a(i), where 0<i<K−1, of the frequency counterpart {tilde over (T)}k(f) of the time-domain Hilbert envelope hk(n).
The Levinson-Durbin algorithm is well known in the art and is not repeated in here. The fundamentals of the algorithm can be found in a publication, entitled “Digital Processing of Speech Signals,” by Rabiner and Schafer, Prentice Hall, ISBN: 0132136031, September 1978.
Returning now to the method of
As mentioned above and repeated in here, since the parameter {tilde over (T)}k(f) is a lossy approximation of the original parameter Tk(f), the difference of the two parameters is called the residual value, which is algebraically expressed as Ck(f). Differently put, in the fitting process via the Levinson-Durbin algorithm as aforementioned to arrive at the all-pole model, some information about the original signal cannot be captured. If signal encoding of high quality is intended, that is, if a lossless encoding is desired, the residual value Ck(f) needs to be estimated. The residual value Ck(f) basically comprises the frequency components of the carrier frequency ck(n) of the signal sk(n).
There are several approaches in estimating the Hilbert carrier ck(n).
Estimation of the Hilbert carrier in the time-domain as residual value ck(n) is simply derived from a scalar division of the original time-domain sub-band signal sk(n) by its Hilbert envelope hk(n). Mathematically, it is expressed as follows:
c
k(n)=sk(n)/hk(n) (11)
where all the parameters are as defined above.
It should be noted that Equation (11) is shown a straightforward way of estimating the residual value. Other approaches can also be used for estimation. For instance, the frequency-domain residual value Ck(f) can very well be generated from the difference between the parameters Tk(f) and {tilde over (T)}k(f). Thereafter, the time-domain residual value ck(n) can be obtained by a direct time-domain transform of the value Ck(f).
Another straightforward approach is to assume the Hilbert carrier ck(n) is mostly composed of white noise. One way to obtain the white noise information is to band-pass filter the original signal x(t) (
If the original signal x(t) (
As another alternative in estimating the residual signal, each sub-band k can be assigned, a priori, a fundamental frequency component. By analyzing the spectral components of the Hilbert carrier ck(n), the fundamental frequency component or components of each sub-band can be estimated and used along with their multiple harmonics.
For a more faithful signal reconstruction irrespective of whether the original signal source is voiced or unvoiced, a combination of the above mentioned methods can be used. For instance, via simple thresholding on the Hilbert carrier in the frequency domain Ck(f), it can be detected and determined whether the original signal segment s(t) is voiced or unvoiced. Thus, if the signal segment s(t) is determined to be voiced, the “peak picking” spectral estimation method. On the other hand, if the signal segment s(t) is determined to be unvoiced, the white noise reconstruction method as aforementioned can be adopted.
There is yet another approach that can be used in the estimation of the Hilbert carrier ck(n). This approach involves the scalar quantization of the spectral components of the Hilbert carrier in the frequency domain Ck(f). Here, after quantization, the magnitude and phase of the Hilbert carrier are represented by a lossy approximation such that the distortion introduced is minimized.
The estimated time-domain Hilbert carrier output from the FDLP for each sub-band frame is broken down into sub-frames. Each sub-frame represents a 200 millisecond portion of a frame, so there are five sub-frames per frame. Slightly longer, overlapping 210 ms long sub-frames (5 sub-frames created from 1000 ms frames) may be used in order to diminish transition effect or noise on frame boundaries. On the decoder side, a window which averages overlapping areas to get back the 1000 ms long Hilbert carrier may be applied.
The time-domain Hilbert carrier for each sub-band sub-frame is frequency transformed using DFT (step 720).
In step 722, a temporal mask is applied to determine the bit-allocations for quantization of the DFT phase and magnitude parameters. For each sub-band sub-frame, a comparison is made between a temporal mask value and the quantization noise determined for the baseline encoding process. The quantization of the DFT parameters may be adjusted as a result of this comparison, as discussed above in connection with
In step 728, the encoded data and side information for each sub-band frame are concatenated and packetized in a format suitable for transmission or storage. As needed, various algorithms well known in the art, including data compression and encryption, can be implemented in the packetization process. Thereafter, the packetized data can be sent to the data handler 36, and then a recipient for subsequent decoding, as shown in step 730.
In step 806, the DFT magnitude parameters representing the Hilbert carrier for each sub-band sub-frame are reconstructed from the VQ indices received by the decoder 42. The DFT phase parameters for each sub-band sub-frame are inverse quantized. The DFT magnitude parameters are inverse quantized using inverse split VQ and the DFT phase parameters are inverse quantized using inverse scalar quantization. The inverse quantizations of the DFT phase and magnitude parameter are performed using the bit-allocations assigned to each by the temporal masking that occurred in the encoding process.
In step 808, an inverse DFT is applied to each sub-band sub-frame to recover the time domain Hilbert carrier for the sub-band sub-frame. The sub-frames are then reassembled to form the Hilbert carriers for each sub-band frame.
In step 810, the received VQ indices for LSFs corresponding to Hilbert envelope for each sub-band frame are inverse quantized.
In step 812, each sub-band Hilbert carrier is modulated using the corresponding reconstructed Hilbert envelope. This may be performed by inverse FDLP component 512. The Hilbert envelope may be reconstructed by performing the steps of
In decision step 814, a check is made for each sub-band frame to determine whether it is tonal. This may be done by checking to determine whether a tonal flag sent from the encoder 38 is set. If the sub-band signal is tonal, inverse TDLP filtering is applied to the sub-band signal to recover the sub-band frame. If the sub-band signal is not tonal, the TDLP filtering is bypassed for the sub-band frame.
In step 818, all of the sub-bands are merged to obtain the full-band signal using QMF synthesis. This is performed for each frame.
In step 820, the recovered frames are combined to yield a reconstructed discrete input signal x′(n). Using suitable digital-to-analog conversion processes, the reconstructed discrete input signal x′(n) may be converted to a time-varying reconstructed input signal x′(t).
In step 902, a first-order temporal masking model of the human provides the starting point for determining exact threshold values. The temporal masking of the human ear can be explained as a change in the time course of recovery from masking or as a change in the growth of masking at each signal delay. The amount of forward masking is determined by the interaction of a number of factors including masker level, the temporal separation of the masker and the signal, frequency of the masker and the signal and duration of the masker and the signal. A simple first-order mathematical model, which provides a sufficient approximation for the amount of temporal mask, is given in Equation (12).
M[n]=a(b−log10 Δt)(s[n]−c) (12)
where M is the temporal mask in dB Sound Pressure Level (SPL), s is the dB SPL level of a sample indicated by integer index n, Δt is the time delay in milliseconds, and a, b and c are the constants, and c represents an absolute threshold of hearing.
The optimal values of a and b are predefined and know to those of ordinary skill in the art. The parameter c is the Absolute Threshold of Hearing (ATH) given by the graph 950 shown in
The temporal mask is calculated using Equation (12) for every discrete sample in a sub-band sub-frame, resulting in a plurality of temporal mask values. For any given sample, multiple mask estimates corresponding to several previous samples are present. The maximum among these prior sample mask estimates is chosen as the temporal mask value, in units of dB SPL, for the current sample.
In step 904, a correction factor is applied to the first-order masking model (Eq. 12) to yield adjusted temporal masking thresholds. The correction factor can be any suitable adjustment to the first-order masking model, including but not limited to the exemplary set of Equations (13) shown hereinbelow.
One technique for correcting the first-order model is to determine the actual thresholds of imperceptible noise resulting from temporal masking. These thresholds may be determined by adding white noise with the power levels specified by the first-order mask model. The actual amount of white noise that can be added to an original input signal, so that audio included in the original input signal is perceptually transparent, may be determined using a set of informal listening tests with a variety people. The amount of power (in dB SPL), to be reduced from the first-order temporal masking threshold, is made dependent on the ATH in that frequency band. From informal listening tests with adding white noise, it was empirically found that the maximum power of the white noise that can be added to the original input signal, so that the audio is still perceptually transparent, is given by following exemplary set of equations:
where T[n] represents the adjusted temporal masking threshold at sample n, Lm is a maximum value of the first-order temporal masking model (Eq. 12) computed at a plurality of previous samples, c represents an absolute threshold of hearing in dB, and n is an integer index representing the sample. On the average, the noise threshold is about 20 dB below the first-order temporal masking threshold estimated using Equation (12). As an example,
The set of Equations (13) is only one example of a correction factor that can be applied to the linear model (Eq. 12). Other forms and types of correction factors are contemplated by the coding scheme disclosed herein. For example, the threshold constants, i.e., 35, 25, 15, of Equations 13 can be other values, and/or the number of equations (partitions) in the set and their corresponding applicable ranges can vary from those shown in Equations 13.
The adjusted temporal masking thresholds also show the maximum permissible quantization noise in the time domain for a particular sub-band. The objective is to reduce the number of bits required to quantize the DFT parameters of the sub-band Hilbert carriers. Note that the sub-band signal is a product of its Hilbert envelope and its Hilbert carrier. As previously described, the Hilbert envelope is quantized using scalar quantization. In order to account for the envelope information while applying temporal masking, the logarithm of the inverse quantized Hilbert envelope of a given sub-band is calculated in the dB SPL scale. This value is then subtracted from the adjusted temporal masking thresholds obtained from Equations (13).
The various methods, systems, apparatuses, components, functions, state machines, devices and circuitry described herein may be implemented in hardware, software, firmware or any suitable combination of the foregoing. For example, the methods, systems, apparatuses, components, functions, state machines, devices and circuitry described herein may be implemented, at least in part, with one or more general purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), intellectual property (IP) cores or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The functions, state machines, components and methods described herein, if implemented in software, may be stored or transmitted as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer processor. Also, any transfer medium or connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable media.
The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use that which is defined by the appended claims. The following claims are not intended to be limited to the disclosed embodiments. Other embodiments and modifications will readily occur to those of ordinary skill in the art in view of these teachings. Therefore, the following claims are intended to cover all such embodiments and modifications when viewed in conjunction with the above specification and accompanying drawings.
The present application for patent claims priority to Provisional Application No. 60/957,977 entitled “Temporal Masking in Audio Coding Based on Spectral Dynamics in Sub-Bands” filed Aug. 24, 2007, and assigned to the assignee hereof and hereby expressly incorporated by reference herein. The present application relates to U.S. application Ser. No. 11/696,974, entitled “Processing of Excitation in Audio Coding and Decoding”, filed on Apr. 5, 2007, and assigned to the assignee hereof and expressly incorporated by reference herein; and relates to U.S. application Ser. No. 11/583,537, entitled “Signal Coding and Decoding Based on Spectral Dynamics”, filed Oct. 18, 2006, and assigned to the assignee hereof and expressly incorporated by reference herein; and relates to U.S. application Ser. No. ______, entitled “SPECTRAL NOISE SHAPING IN AUDIO CODING BASED ON SPECTRAL DYNAMICS IN FREQUENCY SUB-BANDS”, filed ______, 2008, with Docket No. 072260, and assigned to the assignee hereof and expressly incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
60957977 | Aug 2007 | US |