The present invention relates generally to audio signal coding or compression, and more particularly to frequency domain audio signal post-processing.
In modern audio/speech digital signal communication systems, a digital signal is compressed at an encoder and the compressed information is packetized and sent to a decoder through a communication channel, frame by frame in real time. A system made of an encoder and decoder together is called a CODEC.
In some applications, speech/audio compression is used to reduce the number of bits that represent the speech/audio signal thereby reducing the bandwidth (bit rate) needed for transmission. However, speech/audio compression may result in degradation of the quality of decompressed signal. In general, a higher bit rate results in higher sound quality, while a lower bit rate results in lower sound quality. Modern speech/audio compression techniques, however, can produce decompressed speech/audio signal of relatively high quality at relatively low bit rates by exploiting the perceptual masking effect of human hearing system.
In general, modern coding/compression techniques attempt to represent the perceptually significant features of the speech/audio signal, without preserving the actual speech/audio waveform. Numerous algorithms have been developed for speech/audio CODECs that reduce the number of bits required to digitally encode the original signal while attempting to maintain high quality of reconstructed signal.
Perceptual weighting filtering is a technology that exploits the human ear masking effect with time domain filtering processing to improve perceptual quality of signal coding or speech coding. This technology has been widely used in many standards during recent decades. One typical application of perceptual weighting is shown in
With a limited bit rate, the perceptually significant areas such as spectral peak areas are not overly compromised in order to improve the perceptually less significant areas such as spectral valley areas. Therefore, another method, called post-processing, is used to improve the perceptual quality at decoder side.
The psychoacoustic principle or perceptual masking effect is used in some audio compression algorithms for audio/speech equipment. Traditional audio equipment attempts to reproduce signals with fidelity to the original sample or recording. Perceptual coders, on the other hand, reproduce signals to achieve a good fidelity perceivable by the human ear. Although one main goal of digital audio perceptual coders is data reduction, perceptual coding can be used to improve the representation of digital audio through advanced bit allocation. One example of a perceptual coder is a multiband system that divides the audio spectrum in a fashion that mimics the critical bands of psychoacoustics. By modeling human perception, perceptual coders process signals much the way humans do, and take advantage of phenomena such as masking Such systems, however, rely on accurate algorithms. Because is difficult to have a very accurate perceptual model that covers common human hearing behavior, the accuracy of a mathematical perceptual model is limited. However, with limited accuracy, the perceptual coding concept has been implemented by some audio CODECs, hence, numerous MPEG audio coding schemes have benefitted from exploiting the perceptual masking effect. Several ITU standard CODECs also use the perceptual concept. For example, ITU G.729.1 performs so-called dynamic bit allocation based on perceptual masking concept.
Even though perceptual masking concepts have been applied to CODECs, sound quality still has room for improvement due to various reasons and limitations. For example, decoder side post-processing (see
The ITU-T G.729.1 standard defines a frequency domain post-processing module for the high band from 4000 Hz to 8000 Hz. This post-processing technology has been described in the U.S. Pat. No. 7,590,523, entitled “Speech Post-processing Using MDCT Coefficients,” which is incorporated herein by reference in its entirety.
As the proposed frequency domain post-processing is improved by benefitting from the perceptual masking principle, it is helpful to briefly describe the perceptual masking principle itself.
Auditory perception is based on critical band analysis in the inner ear where a frequency to place transformation occurs along the basilar membrane. In response to sinusoidal pressure, the basilar membrane vibrates producing the phenomenon of traveling waves. The basilar membrane is internally formed by thin elastic fibers tensed across the cochlear duct. As shown in
Simultaneous Masking is a frequency domain phenomenon where a low level signal, e.g., a small band noise (the maskee) can be made inaudible by simultaneously occurring stronger signal(the masker), e.g., a pure tone, if masker and maskee are close enough to each other in frequency. A masking threshold can be measured below which any signal will not be audible. As an example shown in
a and 6b illustrate the asymmetric nature of simultaneous masking
In summary, the masking effect can be summarized as a few points:
G.722 is an ITU standard CODEC that provides 7 kHz wideband audio at data rates from 48, 56 and 64 kbit/s. This is useful, for example, in fixed network voice over IP applications, where the required bandwidth is typically not prohibitive, and offers an improvement in speech quality over older narrowband CODECs such as G.711, without an excessive increase in implementation complexity. The coding system uses sub-band adaptive differential pulse code modulation (SB-ADPCM) with a bit rate of 64 kbit/s. In the SB-ADPCM technique used, the frequency band is split into two sub-bands (higher and lower band) and the signals in each sub-band are encoded using ADPCM technology. The system has three basic modes of operation corresponding to the bit rates used for 7 kHz audio coding: 64, 56 and 48 kbit/s. The latter two modes allow an auxiliary data channel of 8 and 16 kbit/s respectively to be provided within the 64 kbit/s by making use of bits from the lower sub-band.
a is a block diagram of the SB-ADPCM encoder. The transmit quadrature mirror filters (QMFs) have two linear-phase non-recursive digital filters that split the frequency band of 0 to 8000 Hz into two sub-bands: the lower sub-band being 0 to 4000 Hz, and the higher sub-band being 4000 to 8000 Hz. Input signal 701 xin 701 to the transmit QMFs 720 is sampled at 16 kHz. Outputs, xH 702 and xL 703 for the higher and lower sub-bands, respectively, are sampled at 8 kHz. The lower sub-band input signal after subtraction of an estimate of the input signal produces a difference signal that is adaptively quantized by assigning 6 binary digits to have a 48 kbit/s signal IL 705. A 4-bit operation, instead of 6-bit operation, is used in both the lower sub-band ADPCM encoder 722 and in the lower sub-band ADPCM decoder 732 (
b is a block diagram of a SB-ADPCM decoder. De-multiplexer (DMUX) 730 decomposes the received 64 kbit/s octet-formatted signal Ir, 707 into two signals, hr 709 and IH 708, which form codeword inputs to the lower and higher sub-band ADPCM decoders, respectively. Low sub-band ADPCM decoder 732 reconstructs rL 711 follows the same structure of ADPCM encoder 722 (See
G.722 Super Wideband Extension means that the wideband portion from 0 to 8000 Hz is still coded with G.722 CODEC while the super wideband portion from 8000 to 14000 Hz of the input signal is coded by using a different coding approach, where the decoded output of the super wideband portion is combined with the output of G.722 decoder to enhance the quality of the final output sampled at 32 kHz. Higher layers at higher bit rates of G.722 Super Wideband Extension can also be used to further enhance the quality of the wideband portion from 0 to 8000 Hz.
The ITU-T G.729.1/G.718 super wideband extension is a recently developed standard that is based on a G.729.1 or G.718 CODEC as the core layer of the extended scalable CODEC. The core layer of G.729.1 or G.718 encodes and decodes the wideband portion from 50 to 7000 Hz and outputs a signal sampled at 16 kHz. The extended layers add the encoding and decoding of the super wideband portion from 7000 to 14000 Hz. The extended layers output a final signal sampled at 32 kHz. The high layers of the extended scalable CODEC also add the enhancements and improvements of the wideband portion (50-7000 Hz) to the coding error produced by G.729.1 or G.718 CODEC.
The ITU-T G.729.1 encoder is also called a G.729EV coder, which is an 8-32 kbit/s scalable wideband (50-7000 Hz) extension of ITU-T Rec. G.729. By default, the encoder input and decoder output are sampled at 16 kHz. The bitstream produced by the encoder is scalable and has 12 embedded layers, which will be referred to as Layers 1 to 12. Layer 1 is the core layer corresponding to a bit rate of 8 kbit/s. This layer is compliant with G.729 bitstream, which makes G.729EV interoperable with G.729. Layer 2 is a narrowband enhancement layer adding 4 kbit/s, while Layers 3 to 12 are wideband enhancement layers adding 20 kbit/s with steps of 2 kbit/s.
This coder operates with a digital signal sampled at 16000 Hz followed by conversion to 16-bit linear PCM for the input to the encoder. A 8000 Hz input sampling frequency is also supported. Similarly, the format of the decoder output is 16-bit linear PCM with a sampling frequency of 8000 Hz or 16000 Hz. Other input/output characteristics are converted to 16-bit linear PCM with 8000 or 16000 Hz sampling before encoding, or from 16-bit linear PCM to an appropriate format after decoding.
The G.729EV coder is built upon a three-stage structure: embedded Code-Excited Linear-Prediction (CELP) coding, Time-Domain Bandwidth Extension (TDBWE) and predictive transform coding that will be referred to as Time-Domain Aliasing Cancellation (TDAC). The embedded CELP stage generates Layers 1 and 2 which yield a narrowband synthesis (50-4000 Hz) at 8 and 12 kbit/s. The TDBWE stage generates Layer 3 and allows producing a wideband output (50-7000 Hz) at 14 kbit/s. The TDBWE algorithm is also borrowed to perform FEC Frame Erasure Concealment (FEC) or Packet Loss Concealment (PLC) for layers higher than 14 kbps. The TDAC stage operates in the Modified Discrete Cosine Transform (MDCT) domain and generates Layers 4 to 12 to improve quality from 16 to 32 kbit/s. TDAC coding represents jointly the weighted CELP coding error signal in the 50-4000 Hz band and the input signal in the 4000-7000 Hz band. The G.729EV coder operates on 20 ms frames. However, embedded CELP coding stage operates on 10 ms frames, like G.729. As a result two 10 ms CELP frames are processed per 20 ms frame.
G.718 is an ITU-T standard embedded scalable speech and audio CODEC providing high quality narrowband (250 Hz to 3500 Hz) speech over the lower bit rates and high quality wideband (50 Hz to 7000 Hz) speech over a complete range of bit rates. In addition, G.718 is designed to be robust to frame erasures, thereby enhancing speech quality when used in internet protocol (IP) transport applications on fixed, wireless and mobile networks. The CODEC has an embedded scalable structure, enabling maximum flexibility in the transport of voice packets through IP networks of today and in future media-aware networks. In addition, the embedded structure of G.718 allows the CODEC to be extended to provide a super-wideband (50 Hz to 14000 Hz). The bitstream may be truncated at the decoder side or by any component of the communication system to instantaneously adjust the bit rate to the desired value without the need for out-of-band signaling. The encoder produces an embedded bitstream structured in five layers corresponding to the five available bit rates: 8, 12, 16, 24 & 32 kbit/s.
The G.718 encoder can accept wideband sampled signals at 16 kHz, or narrowband signals sampled at either 16 KHz or 8 kHz. Similarly, the decoder output can be 16 kHz wideband, in addition to 16 kHz or 8 kHz narrowband. Input signals sampled at 16 kHz, but with bandwidth limited to narrowband, are detected by the encoder. The output of the G.718 CODEC operates with a bandwidth of 50 Hz to 4000 Hz at 8 and 12 kbit/s, and 50 Hz to 7000 Hz from 8 to 32 kbit/s. The CODEC operates on 20 ms frames and has a maximum algorithmic delay of 42.875 ms for wideband input and wideband output signals. The maximum algorithmic delay for narrowband input and narrowband output signals is 43.875 ms. The CODEC is also employed in a low-delay mode when the encoder and decoder maximum bit rates are set to 12 kbit/s. In this case, the maximum algorithmic delay is reduced by 10 ms.
The CODEC also incorporates an alternate coding mode, with a minimum bit rate of 12.65 kbit/s, which is a bitstream interoperable with ITU-T Recommendation G.722.2, 3GPP AMR-WB and 3GPP2 VMR-WB mobile wideband speech coding standards. This option replaces Layer 1 and Layer 2, and the layers 3-5 are similar to the default option with the exception that in Layer 3 few bits are used to compensate for the extra bits of the 12.65 kbit/s core. The decoder further decodes other G.722.2 operating modes. G.718 also includes discontinuous transmission mode (DTX) and comfort noise generation (CNG) algorithms that enable bandwidth savings during inactive periods. An integrated noise reduction algorithm can be used provided that the communication session is limited to 12 kbit/s.
The underlying algorithm is based on a two-stage coding structure: the lower two layers are based on Code-Excited Linear Prediction (CELP) coding of the band (50-6400 Hz), where the core layer takes advantage of signal-classification to use optimized coding modes for each frame. The higher layers encode the weighted error signal from the lower layers using overlap-add modified discrete cosine transform (MDCT) transform coding. Several technologies are used to encode the MDCT coefficients to maximize the performance for both speech and music.
In one embodiment, a method of frequency domain post-processing includes applying adaptive modification gain factor to each frequency coefficient, determining the gain factors based on Local Masking Magnitude, Local Masked Magnitude, and Average Magnitude. In an embodiment, Local Masking Magnitude M0(i) is estimated according to perceptual masking effect by taking a weighted sum around the location of the specific frequency at i:
where the weighting window w0i(k) is frequency dependent, F0(i) are the frequency coefficients before the post-processing is applied. Local Masked Magnitude M1(i) is estimated by taking a weighted sum around the location of the specific frequency at i similar to M0(i):
where the weighting window w1i(k) is frequency dependent, which is flatter and longer than w0i(k). Average Magnitude Mav is calculated on the whole spectrum band before the post-processing is performed.
In one example, the initial gain factor for each frequency is calculated as
where α (0≦α≦1) is a value close to 1. The gain factors can be further normalized to maintain the energy. In one embodiment, normalized gain factors Gaini(i) are controlled by a parameter:
Gain2(i)=β·Gain1(i)+(1−β)
where β (0≦β≦1) is a parameter to control strong post-processing or weak post-processing; this controlling parameter can be replaced by a smoothed one.
The foregoing has outlined, rather broadly, features of the present invention. Additional features of the invention will be described, hereinafter, which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures or processes for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
a and 1b illustrate a typical time domain CODEC;
a and 3b illustrate a typical frequency domain CODEC with perceptual masking model in encoder and post-processing in decoder;
a and 6b illustrate the asymmetry of simultaneous masking;
a and 7b illustrate block diagrams of a G.722 encoder and decoder;
Corresponding numerals and symbols in different figures generally refer to corresponding parts unless otherwise indicated. The figures are drawn to clearly illustrate the relevant aspects of embodiments of the present invention and are not necessarily drawn to scale. To more clearly illustrate certain embodiments, a letter indicating variations of the same structure, material, or process step may follow a figure number.
The making and using of the presently preferred embodiments are discussed in detail below. It should be appreciated, however, that the present invention provides many applicable inventive concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific ways to make and use the invention, and do not limit the scope of the invention.
In an embodiment, a post-processor working in the frequency domain at the decoder side is proposed to enhance the perceptual quality of music, audio or speech output signals. In one embodiment, post-processing is implemented by multiplying an adaptive gain factor to each frequency coefficient. The adaptive gain factors are estimated using the principle of perceptual masking effect.
In one aspect, the initial gain factors are calculated by comparing the mathematical values of the three defined parameters named as Local Masking Magnitude, Local Masked Magnitude, and Average Magnitude. The gain factors are then normalized to keep proper overall energy. In another aspect, the degree of the post-processing can be strong or weak, which is controlled depending on the real quality of decoded signal and other possible factors.
In some embodiments, frequency domain post-processing is used rather than time domain post-processing. For example, when frequency domain coefficients are already available at decoder, frequency domain post-processing may be simpler to perform than time domain post-processing. Also, in some cases, time domain post-processing may encounter difficulty improving quality for music signals, so frequency domain post-processing is used instead. Further more if there are no time domain parameters available to support time domain post-processing and frequency domain post-processing is not more complex than time domain post-processing, frequency domain processing is used in some embodiments.
In
One of the advantage of embodiments that incorporate frequency domain post-processing over the time-domain post-processing is its ability to enhance not only regular harmonics (equally spaced harmonics) but also irregular harmonics (not equally spaced harmonics). Equally spaced harmonics correspond to periodic signals, which is the case of voiced speech. Music signals, on the other hand, often have irregular harmonics. The ITU-T G.729.1/G.718 super-wideband extension standard decoder receives three portions of a bitstream; the first portion is used to decode the core of G.729.1 or G.718; the second portion is used to decode the MDCT enhancement layers for improving the band from 50 to 7000 Hz; and the third portion is transmitted to reconstruct the super-wideband from 7000 Hz to 14000 Hz.
In embodiments using a G.729.1 core, G.729.1 CELP decoder 901 outputs a time domain signal representing the narrow band, sampled at 8 kHz, and output 905 from enhancement layers 920 adds high band MDCT coefficients (4000-7000 Hz) and the narrow band MDCT coefficients (50-4000 Hz) to improve the coding of CELP error in the weighted domain. In embodiments that use a G.718 core, G.718 CELP decoder 901 outputs the time domain signal representing the band from 50 Hz to 6400 Hz, which is sampled at 16 kHz. Output 905 from the enhancement layers 920 adds high band MDCT coefficients (6400-7000 Hz) and improvement MDCT coefficients of the band from 50 Hz to 6400 Hz in the weighted domain. The time domain signal from the core CELP output is weighted through the weighting filter 902 and then transformed into MDCT domain by the block 903. Coefficients 904 obtained from MDCT block 903 is added together with the reconstructed coefficients 905 of the enhancement layers to form a complete set MDCT coefficients 906 representing frequencies from 50 Hz to 7000 Hz in the weighted domain.
In some embodiments, MDCT coefficients 906 are ready to be post-processed by the embodiment frequency domain post-processing block 907. In an embodiment, post-processed coefficients are inverse-transformed back into the time domain by Inverse MDCT block 908. This time domain signal is still in the weighted domain and it can be further post-processed for special purposes such as echo reduction. The weighted time domain signal is then filtered with the inverse weighting filter 909 to get the signal output in normal time domain.
In an embodiment that uses a G.729.1/G.718 super-wideband extension CODEC, the signal in normal time domain is post-processed again with the time domain post-processing block 910 and then up-sampled to the final output sampling rate 32 kHz before added to super-wideband output 914. Super-wideband MDCT coefficients 913 are decoded in the MDCT domain by block 924 and transformed into time domain by inverse MDCT transformation 922. The final time domain output 915 sampled at 32 kHz covers the decoded spectrum from 50 Hz to 14,000 Hz.
In some embodiments, CELP post processing filters of ITU-T G.729.1/G.718 super-wideband extension may perform well for normal speech signal, however, for some music signals, frequency domain post-processing can increase output sound quality. In the decoder of ITU-T G.729.1/G.718 super-wideband extension, the MDCT coefficients of the frequency region [0-7 kHz] are available in weighted domain, having in total 280 coefficients: F0(i)={circumflex over (M)}16(i), i=0,1, . . . 279. In embodiments, these frequency coefficients are used to perform frequency domain post-processing for music signals before the music signals are transformed back into time domain. Such processing can also be used for other audio signals besides music, in further embodiments.
Since the gain factor for each frequency coefficient may be different for different frequencies, the spectrum shape is modified after the post-processing. In embodiments, a gain factor estimation algorithm is used in frequency domain post-processing. In some embodiments, gain factor estimation algorithm is based on the perceptual masking principle.
When encoding the signal in the time domain using a perceptual weighting filter, as shown in
Turning back to
As described hereinabove with respect to
where F0(i) represents the frequency coefficients before the post-processing is applied. In some embodiments, the weighting window w0i(k) is not symmetric. One example of the weighting window w0i(k) 1101 is shown in
In some embodiments, the weighting window w0i(k) is different for every different i. In other embodiments, however, the window is the same for a small interval on the frequency index for the sake of simplicity. In embodiments, window coefficients can be pre-calculated, normalized, and saved in tables.
Local Masked Magnitude M1(i) is viewed as the estimated local “perceptual error floor.” Because the encoder encodes a signal in the perceptual domain, high energy frequency coefficients at decoder side can have low relative error but high absolute error and low energy frequency coefficient at decoder side can have high relative error but low absolute error. The errors at different frequencies also perceptually influence each other in a way similar to the masking effect of a normal signal. Therefore, in some embodiments, the Local Masked Magnitude M1(i) is estimated similarly to M0(i):
Here, the shape of the weighting window w1i(k) 1102 is flatter and longer than w0i(k) as shown in
In embodiments, the ratio M0(i)/M1(i) reflects the local relative perceptual quality at location i. Considering the possible influence of global energy, one way to initialize the estimate of the gain factor along the frequency is described in the block 1007:
where α (0≦α≦1) is a value close to 1. In some embodiments, α=15/16. In further embodiments, other values for a can be used, for example, between 0.9 and 1.0. In some embodiments, α is used to control the influence of the global energy which is represented here by the overall spectrum average magnitude 1006:
where, NF is the total number of the frequency coefficients. In some embodiments, for example, to avoid too much overall energy change after the post-processing, gain normalization 1008 is applied. The whole spectrum band can be divided into few sub-bands and then the gain normalization is performed on each sub-band by multiplying a factor Norm as shown in the block 1008:
Gain1(i)=Gain0(i)·Norm. (4)
In embodiments that apply full gain normalization, normalization factor Norm is defined as,
If partial normalization is used, the real normalization factor could be a value between Norm of Equation (5) and 1. Alternatively, if it is known that the quality of some sub-band is poor, for example, in cases of rough quantization precision and low signal level, a the real normalization factor could be below Norm of (5).
In some embodiments, the gain factor estimated with Equation (3) indicates that strong post-processing is needed. In other embodiments, and in some real applications, sometimes only weak post-processing or even no post-processing is used depending on the decoded signal quality. Therefore, in some embodiments, an overall controlling of the post-processing is introduced by using the controlling parameter: β (0≦β≦1), with β=0 meaning no postprocessing and β=1 meaning full postprocessing. For example, in an embodiment, block 1009 calculates:
Gain2(i)=β·Gain1(i)+(1−β), (6)
where β (0≦β≦1) is a parameter to control strong post-processing or weak post-processing. In some embodiments, parameter β can be constant, and in some embodiments it can also be real time variable depending on many factors such as transmitted bit rate, CODEC real time quality, speech/music characteristic, and/or noisy/clean signal characteristics.
As an example, the setting of β for ITU-T G.729.1/G.718 super-wideband extension is related to the output of the signal type classifier:
where β0 is a constant value of about 0.5, and the Category determination algorithm can be found as follows.
A sound signal is separated into categories that provide information on the nature of the sound signal. In one embodiment, a mean of past 40 values of total frame energy variation is found by
where
E
Λ
└i┘
=E
t
└i┘
−E
t
└i−1┘, for i=−40, . . . , −1.
The superscript i denotes a particular past frame. Then, a statistical deviation is calculated between the past 15 values of total energy variation and the 40-value mean:
In an embodiment, the resulting energy deviation is compared to four thresholds to determine the efficiency of the inter-tone noise reduction for the specific frame. The output of the signal type classifier module is an index corresponding to one of five categories, numbered 0 to 4. The first type (Category 0) corresponds to a non-tonal sound, like speech, which is not affected by the inter-tone noise reduction algorithm. This type of sound signal has generally a large statistical deviation. The three middle categories (1 to 3) include sounds with different types of statistical deviations. The last category (Category 4) includes sounds that exhibit minimal statistical deviation.
In an embodiment, the thresholds are adaptive in order to prevent wrong classification. Typically, a tonal sound like music exhibits a much lower statistical deviation than a non-tonal sound like speech. But even music could contain higher statistical deviation and, similarly, speech could contain lower statistical deviation.
In an embodiment, two counters of consecutive categories are used to increase or decrease the respective thresholds. The first counter is incremented in frames, where Category 3 or 4 is selected. This counter is set to zero, if Category 0 is selected and is left unchanged otherwise. The other counter has an inverse effect. It is incremented if Category 0 is selected, set to zero if Category 3 or 4 is selected and left unchanged otherwise. The initial values for both counters are zero. If the counter for Category 3 or Category 4 reaches the number of 30, all thresholds are increased by 0.15625 to allow more frames to be classified in Category 4. On the other side, if the counter for Category 0 reaches a value of 30, all thresholds are decreased by 0.15625 to allow more frames to be classified in Category 0. In alternative embodiments, more or less categories can be determined, and other threshold counter and determination schemes can be used.
The thresholds are limited by a maximal and minimal value to ensure that the sound type classifier is not locked to a fixed category. The initial, minimal and maximal values of the thresholds are defined as follows:
where the superscript [j]=0, . . . , 3 denotes the category j. In alternative embodiments, other initial, minimal and maximal threshold values can be used.
The categories are selected based on a comparison between the calculated value of statistical deviation, Edev, and the four thresholds. The selection algorithm proceeds as follows:
In case of frame erasure, in one embodiment, all thresholds are reset to their minimum values and the output of the classifier is forced to Category 0 for 2 consecutive frames after the erased frame (3 frames including the erased frame).
In some embodiments, β is slightly reduced in the following way:
where Voicing is a smoothed value of the normalized voicing factor from the CELP:
Voicing0.5 Voicing+0.5Gp
G
p
=E
p/(Ep+Ec)
Ep is the energy of the adaptive codebook excitation component, and Ec is the energy of the fixed codebook excitation component.
In embodiments, Sharpness is a spectral sharpness parameter defined as the ratio between average magnitude and peak magnitude in a frequency subband. For some embodiments processing typical music signals, if Sharpness and Voicing values are small, a strong postprocessing is needed. In some embodiments, better CELP performance will create a larger voicing value, and, hence, a smaller β value and weaker post-processing. Therefore, when Voicing is close to 1, it could mean that the CELP CODEC works well in some embodiments. When Sharpness is large, the spectrum of the decoded signal could be noise-like.
In some embodiments, additional gain factor processing is performed before the gain factors are multiplied with the frequency coefficients F0(i). For example, for ITU-T G.729.1/G.718 super-wideband extension, some extra processing of the current controlling parameter is added, such as smoothing the current controlling parameter with the previous controlling parameter:
Gain2(i)=
The current gain factors are then further smoothed with the previous gain factors:
Finally, the determined modification gains factors are multiplied with the frequency coefficients F0(i) to get the post-processed frequency coefficients F1(i) as shown in the blocks 1011 and 1012:
F
1(i)=F0(i)·
In some embodiments, inverse transformation block 1013 is optional. In some embodiments, use of block 1013 depends on whether the original decoder already includes an inverse transformation.
In embodiments that use ITU-T G.729.1, a frequency domain post-processing module for the high band from 4000 Hz to 8000 Hz is implemented. In some embodiments of the present invention, however, the post-processing is performed in one step without distinguishing envelope or fine structure. Furthermore, in embodiments, modification gain factors are generated based on sophisticated perceptual masking effects.
Audio access device 6 uses microphone 12 to convert sound, such as music or a person's voice into analog audio input signal 28. Microphone interface 16 converts analog audio input signal 28 into digital audio signal 32 for input into encoder 22 of CODEC 20. Encoder 22 produces encoded audio signal TX for transmission to network 36 via network interface 26 according to embodiments of the present invention. Decoder 24 within CODEC 20 receives encoded audio signal RX from network 36 via network interface 26, and converts encoded audio signal RX into digital audio signal 34. Speaker interface 18 converts digital audio signal 34 into audio signal 30 suitable for driving loudspeaker 14.
In an embodiments of the present invention, where audio access device 6 is a VOIP device, some or all of the components within audio access device 6 are implemented within a handset. In some embodiments, however, Microphone 12 and loudspeaker 14 are separate units, and microphone interface 16, speaker interface 18, CODEC 20 and network interface 26 are implemented within a personal computer. CODEC 20 can be implemented in either software running on a computer or a dedicated processor, or by dedicated hardware, for example, on an application specific integrated circuit (ASIC). Microphone interface 16 is implemented by an analog-to-digital (A/D) converter, as well as other interface circuitry located within the handset and/or within the computer. Likewise, speaker interface 18 is implemented by a digital-to-analog converter and other interface circuitry located within the handset and/or within the computer. In further embodiments, audio access device 6 can be implemented and partitioned in other ways known in the art.
In embodiments of the present invention where audio access device 6 is a cellular or mobile telephone, the elements within audio access device 6 are implemented within a cellular handset. CODEC 20 is implemented by software running on a processor within the handset or by dedicated hardware. In further embodiments of the present invention, audio access device may be implemented in other devices such as peer-to-peer wireline and wireless digital communication systems, such as intercoms, and radio handsets. In applications such as consumer audio devices, audio access device may contain a CODEC with only encoder 22 or decoder 24, for example, in a digital microphone system or music playback device. In other embodiments of the present invention, CODEC 20 can be used without microphone 12 and speaker 14, for example, in cellular base stations that access the PTSN. In some embodiments, decoder 24 performs embodiment audio post-processing algorithms.
In an embodiment, a method of frequency domain post-processing includes applying an adaptive modification gain factor to each frequency coefficient, and determining gain factors based on Local Masking Magnitude and Local Masked Magnitude. In a further embodiment, the frequency domain of performing the post-processing is in a MDCT domain or a FFT domain. In some embodiments, post-processing is performed with an audio post-processor.
In some embodiments, Local Masking Magnitude M0(i) is estimated according to perceptual masking effect. M0(i) is estimated by taking a weighted sum around the location of the specific frequency at i:
where the weighting window w0i(k) is frequency dependent, and F0(i) are the frequency coefficients before the post-processing is applied. In some embodiments, w0i(k) is asymmetric.
In some embodiments, Local Masked Magnitude M1(i) is estimated according to perceptual masking effect. M1(i) can be estimated by taking a weighted sum around the location of the specific frequency at i similar to M0(i):
where the weighting window w1i(k) is frequency dependent, and w1i(k) is flatter and longer than w0i(k). In some embodiments, w1i(k) is asymmetric.
In an embodiment, a method of frequency domain post-processing includes applying adaptive modification gain factor to each frequency coefficient and determining gain factors based on Local Masking Magnitude, Local Masked Magnitude, and Average Magnitude. In an embodiment, post-processing is performed in a frequency domain comprising MDCT domain or FFT domain.
In an embodiment, Local Masking Magnitude M0(i) is estimated according to perceptual masking effect. In one example, M0(i) is estimated by taking a weighted sum around the location of the specific frequency at i:
where the weighting window w0i(k) is frequency dependent, and F0(i) are the frequency coefficients before the post-processing is applied. In some embodiments, w0i(k) is asymmetric.
In a further embodiment, Local Masked Magnitude M1(i) is estimated according to perceptual masking effect. In an example, Local Masked Magnitude M1(i) is estimated by taking a weighted sum around the location of the specific frequency at i similar to M0(i):
where the weighting window w1i(k) is theoretically asymmetric and frequency dependent, and flatter and longer than w0i(k). In some embodiments, w0i(k) and/or w1i(k) are asymmetric.
In an embodiment, Average Magnitude Mav is calculated on a whole spectrum band which needs to be post-processed. In one example, the Average Magnitude Mav is calculated by
where NF is the total number of the frequency coefficients.
In an embodiment, one way to calculate the initial gain factor for each frequency is
where α (0≦α≦1) is a value close to 1. In some embodiments, α is 15/16. In further embodiments, a is between 0.9 and 1.0. In a further embodiment, the gain factors can be further normalized to maintain the energy:
Gain1(i)=Gain0(i)·Norm,
where the normalization factor Norm is defined as,
In a further embodiment, the normalized gain factors can be controlled by a parameter:
Gain2(i)=β·Gain1(i)+(1−β)
where β (0≦β≦1) is a parameter to control strong post-processing or weak post-processing. In a further embodiment, this controlling parameter can be replaced by a smoothed one with the previous controlling parameter such as:
0.75
In a further embodiment, finally determined gain factors are multiplied with the frequency coefficients to get the post-processed frequency coefficients. Further embodiment methods include, for example, receiving the frequency domain audio signal from a mobile telephone network, and converting the post-processed frequency domain signal into a time domain audio signal.
In some embodiments, the method is implemented by a system configured to operate over a voice over internet protocol (VOIP) system or a cellular telephone network. In further embodiments, the system has a receiver that includes an audio decoder configured to receive the audio parameters and produce an output audio signal based on the received audio parameters. Frequency domain post-processing according to embodiments is included in the system.
Although the embodiments and their advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present invention. For example, it is contemplated that the circuitry disclosed herein can be implemented in software, or vice versa.
This patent application claims priority to U.S. Provisional Application No. 61/175,573 filed on May 5, 2009, entitled “Frequency Domain Post-processing Based on Perceptual Masking,” which application is incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
61175573 | May 2009 | US |