The present invention relates generally to audio classification based on perceptual quality for low or medium bit rates.
Audio signals are typically encoded prior to being stored or transmitted in order to achieve audio data compression, which reduces the transmission bandwidth and/or storage requirements of audio data. Audio compression algorithms reduce information redundancy through coding, pattern recognition, linear prediction, and other techniques. Audio compression algorithms can be either lossy or lossless in nature, with lossy compression algorithms achieving greater data compression than lossless compression algorithms.
Technical advantages are generally achieved, by embodiments of this disclosure which describe methods and techniques for improving AUDIO/VOICED classification based on perceptual quality for low or medium bit rates.
In accordance with an embodiment, a method for classifying signals prior to encoding is provided. In this example, the method includes receiving a digital signal comprising audio data. The digital signal is initially classified as an AUDIO signal. The method further includes re-classifying the digital signal as a VOICED signal when one or more periodicity parameters of the digital signal satisfy a criteria, and encoding the digital signal in accordance with a classification of the digital signal. The digital signal is encoded in the frequency-domain when the digital signal is classified as an AUDIO signal. The digital signal is encoded in the time-domain when the digital signal is re-classified as a VOICED signal. An apparatus for performing this method is also provided.
In accordance with another embodiment, another method for classifying signals prior to encoding is provided. In this example, the method includes receiving a digital signal comprising audio data. The digital signal is initially classified as an AUDIO signal. The method further includes determining normalized pitch correlation values for subframes in the digital signal, determining an average normalized pitch correlation value by averaging the normalized pitch correlation values, and determining pitch differences between subframes in the digital signal by comparing the normalized pitch correlation values associated with the respective subframes. The method further includes re-classifying the digital signal as a VOICED signal when each of the pitch differences is below a first threshold and the averaged normalized pitch correlation value exceeds a second threshold, and encoding the digital signal in accordance with a classification of the digital signal. The digital signal is encoded in the frequency-domain when the digital signal is classified as an AUDIO signal. The digital signal is encoded in the time-domain when the digital signal is classified as a VOICED signal.
Corresponding numerals and symbols in the different figures generally refer to corresponding parts unless otherwise indicated. The figures are drawn to clearly illustrate the relevant aspects of the embodiments and are not necessarily drawn to scale.
The embodiments of the invention are described below with reference to the accompanying drawings.
Audio signals are typically encoded in either the time-domain or the frequency domain. More specifically, audio signals carrying speech data are typically classified as VOICE signals and are encoded using time-domain encoding techniques, while audio signals carrying non-speech data are typically classified as AUDIO signals and are encoded using frequency-domain encoding techniques. Notably, the term “audio (lowercase) signal” is used herein to refer to any signal carrying sound data (speech data, non-speech data, etc.), while the term “AUDIO (uppercase) signal” is used herein to refer to a specific signal classification. This traditional manner of classifying audio signals typically generates higher quality encoded signals because speech data is generally periodic in nature, and therefore more amenable to time-domain encoding, while non-speech data is typically aperiodic in nature, and therefore more amenable to frequency-domain encoding. However, some non-speech signals exhibit enough periodicity to warrant time-domain encoding.
Aspects of this disclosure re-classify audio signals carrying non-speech data as VOICE signals when a periodicity parameter of the audio signal exceeds a threshold. In some embodiments, only low and/or medium bit-rate AUDIO signals are considered for re-classification. In other embodiments, all AUDIO signals are considered. The periodicity parameter can include any characteristic or set of characteristics indicative of periodicity. For example, the periodicity parameter may include pitch differences between subframes in the audio signal, a normalized pitch correlation for one or more subframes, an average normalized pitch correlation for the audio signal, or combinations thereof. Audio signals which are re-classified as VOICED signals may be encoded in the time-domain, while audio signals that remain classified as AUDIO signals may be encoded in the frequency-domain.
Generally speaking, it is better to use time domain coding for speech signal and frequency domain coding for music signal in order to achieve best quality. However, for some specific music signal such as very periodic signal, it may be better to use time domain coding by benefiting from very high Long-Term Prediction (LTP) gain. The classification of audio signals prior to encoding should therefore be performed carefully, and may benefit from the consideration of various supplemental factors, such as the bit rate of the signals and/or characteristics of the coding algorithms. A best classification or selection between time domain coding and frequency domain coding needs to be decided carefully, considering also bit rate range and characteristic of coding algorithms. At low or medium bit rates, perceptual quality of some specific AUDIO signal or music signal can be improved a lot by simply improving classification or selection of time domain coding and frequency domain coding.
Speech data is typically characterized by a fast changing signal in which the spectrum and/or energy varies faster than other signal types (e.g., music, etc.). Speech signals can be classified as UNVOICED signals, VOICED signals, GENERIC signals, or TRANSITION signals depending on the characteristics of their audio data. Non-speech data (e.g., music, etc.) is typically defined as a slow changing signal, the spectrum and/or energy of which changes slower than speech signal. Normally, music signal may include tone and harmonic types of AUDIO signal. For high-bit rate coding, it may typically be advantageous to use frequency-domain coding algorithm to code non-speech signals. However, when low or medium bit rate coding algorithms are used, it may be advantageous to use time-domain coding to encode tone or harmonic types of non-speech signals that exhibit strong periodicity, as frequency domain coding may be unable to precisely encode the entire frequency band at a low or medium bit rate. In other words, encoding non-speech signals that exhibit strong periodicity in the frequency domain may result in some frequency sub-bands not being encoded or being roughly encoded. On the other hand, CELP type of time domain coding has LTP function which can benefit a lot from strong periodicity. The following description will give a detailed example.
Several parameters are defined first. For a pitch lag P, a normalized pitch correlation is often defined in mathematical form as
In this equation, sw(n) is a weighted speech signal, the numerator is a correlation, and the denominator is an energy normalization factor. Suppose Voicing notes an average normalized pitch correlation value of the four subframes in a current speech frame: Voicing=[R1(P1)+R2(P2)+R3(P3)+R4(P4)]/4. R1(P1), R2(P2), R3(P3), and R4(P4) are the four normalized pitch correlations calculated for each subframe of the current speech frame; P1, P2, P3, and P4 for each subframe are the best pitch candidates found in the pitch range from P=PIT_MIN to P=PIT_MAX. The smoothed pitch correlation from a previous frame to the current frame can be found using the following expression: Voicing_sm(3·Voicing_sm+Voicing)/4.
Pitch differences between subframes can be defined using the following expressions:
dpit1−|P1−P2|
dpit2=|P2−P3|
dpit3=|P3−P4|
Suppose an audio signal is originally classified as an AUDIO signal and would be coded with frequency domain coding algorithm such as the algorithm shown in
Accordingly, at low or medium bit rates, the perceptual quality of some AUDIO signal or music signals can be improved by re-classifying them as VOICED signals prior to encoding. The following is a C-code example for re-classifying signals:
Audio signals can be encoded in the time-domain or the frequency domain. Traditional time domain parametric audio coding techniques make use of redundancy inherent in the speech/audio signal to reduce the amount of encoded information as well as to estimate the parameters of speech samples of a signal at short intervals. This redundancy primarily arises from the repetition of speech wave shapes at a quasi-periodic rate, and the slow changing spectral envelop of speech signal. The redundancy of speech wave forms may be considered with respect to several different types of speech signal, such as voiced and unvoiced. For voiced speech, the speech signal is essentially periodic; however, this periodicity may be variable over the duration of a speech segment and the shape of the periodic wave usually changes gradually from segment to segment. A time domain speech coding could greatly benefit from exploring such periodicity. The voiced speech period is also called pitch, and pitch prediction is often named Long-Term Prediction (LTP). As for unvoiced speech, the signal is more like a random noise and has a smaller amount of predictability. Voiced and unvoiced speech are defined as follows.
In either case, parametric coding may be used to reduce the redundancy of the speech segments by separating the excitation component of speech signal from the spectral envelop component. The slowly changing spectral envelope can be represented by Linear Prediction Coding (LPC) also called Short-Term Prediction (STP). A time domain speech coding could also benefit a lot from exploring such a Short-Term Prediction. The coding advantage arises from the slow rate at which the parameters change. Yet, it is rare for the parameters to be significantly different from the values held within a few milliseconds. Accordingly, at the sampling rate of 8 kHz, 12.8 kHz or 16 kHz, the speech coding algorithm is such that the nominal frame duration is in the range of ten to thirty milliseconds. A frame duration of twenty milliseconds seems to be the most common choice. In more recent well-known standards such as G.723.1, G.729, G.718, EFR, SMV, AMR, VMR-WB or AMR-WB, the Code Excited Linear Prediction Technique (“CELP”) has been adopted; CELP is commonly understood as a technical combination of Coded Excitation, Long-Term Prediction and Short-Term Prediction. Code-Excited Linear Prediction (CELP) Speech Coding is a very popular algorithm principle in speech compression area although the details of CELP for different codec could be significantly different.
The weighting filter 110 is somewhat related to the above short-term prediction filter. An embodiment weighting filter is represented by the following equation:
where β<α, 0<β<1, 0<α≤1. The long-term prediction 105 depends on pitch and pitch gain. A pitch can be estimated from the original signal, a residual signal, or a weighted original signal. The long-term prediction function in principal can be expressed as follows: B(z)=1−gp·z−Pitch.
The coded excitation 108 normally comprises a pulse-like signal or a noise-like signal, which can be mathematically constructed or saved in a codebook. Finally, the coded excitation index, quantized gain index, quantized long-term prediction parameter index, and quantized short-term prediction parameter index are transmitted to the decoder.
Long-Term Prediction can play an important role for voiced speech coding because voiced speech has strong periodicity. The adjacent pitch cycles of voiced speech are similar each other, which means mathematically the pitch gain Gp in the following excitation express is high or close to 1 when expressed as follows: e(n)=Gp·ep(n)+Gc·ec(n), where ep(n) is one subframe of sample series indexed by n, coming from the adaptive codebook 307 which comprises the past excitation 304; ep(n) may be adaptively low-pass filtered as low frequency area is often more periodic or more harmonic than high frequency area. ec(n) is from the coded excitation codebook 308 (also called fixed codebook) which is a current excitation contribution; ec(n) may also be enhanced such as high pass filtering enhancement, pitch enhancement, dispersion enhancement, formant enhancement, etc. For voiced speech, the contribution of ep(n) from the adaptive codebook could be dominant and the pitch gain Gp 305 is around a value of 1. The excitation is usually updated for each subframe. Typical frame size is 20 milliseconds (ms) and typical subframe size is 5 milliseconds.
For voiced speech, one frame typically contains more than 2 pitch cycles.
In modern audio/speech digital signal communication system, a digital signal is compressed at an encoder, and the compressed information or bit-stream can be packetized and sent to a decoder frame by frame through a communication channel. The combined encoder and decoder is often referred to as a codec. Speech/audio compression may be used to reduce the number of bits that represent speech/audio signal thereby reducing the bandwidth and/or bit rate needed for transmission. In general, a higher bit rate will result in higher audio quality, while a lower bit rate will result in lower audio quality.
Audio coding based on filter bank technology is widely used. In signal processing, a filter bank is an array of band-pass filters that separates the input signal into multiple components, each one carrying a single frequency sub-band of the original input signal. The process of decomposition performed by the filter bank is called analysis, and the output of filter bank analysis is referred to as a sub-band signal having as many sub-bands as there are filters in the filter bank. The reconstruction process is called filter bank synthesis. In digital signal processing, the term filter bank is also commonly applied to a bank of receivers, which also may down-convert the sub-bands to a low center frequency that can be re-sampled at a reduced rate. The same synthesized result can sometimes be also achieved by under-sampling the band-pass sub-bands. The output of filter bank analysis may be in a form of complex coefficients; each complex coefficient having a real element and imaginary element respectively representing a cosine term and a sine term for each sub-band of filter bank.
Filter-Bank Analysis and Filter-Bank Synthesis is one kind of transformation pair that transforms a time domain signal into frequency domain coefficients and inverse-transforms frequency domain coefficients back into a time domain signal. Other popular analysis techniques may be used in speech/audio signal coding, including synthesis pairs based on Cosine/Sine transformation, such as Fast Fourier Transform (FFT) and inverse FFT, Discrete Fourier Transform (DFT) and inverse DFT), Discrete cosine Transform (DCT) and inverse DCT), as well as modified DCT (MDCT) and inverse MDCT.
In the application of filter banks for signal compression or frequency domain audio compression, some frequencies are perceptually more important than others. After decomposition, perceptually significant frequencies can be coded with a fine resolution, as small differences at these frequencies are perceptually noticeable to warrant using a coding scheme that preserves these differences. On the other hand, less perceptually significant frequencies are not replicated as precisely, therefore, a coarser coding scheme can be used, even though some of the finer details will be lost in the coding. A typical coarser coding scheme may be based on the concept of Bandwidth Extension (BWE), also known as High Band Extension (HBE). One recently popular specific BWE or HBE approach is known as Sub Band Replica (SBR) or Spectral Band Replication (SBR). These techniques are similar in that they encode and decode some frequency sub-bands (usually high bands) with little or no bit rate budget, thereby yielding a significantly lower bit rate than a normal encoding/decoding approach. With the SBR technology, a spectral fine structure in high frequency band is copied from low frequency band, and random noise may be added. Next, a spectral envelope of the high frequency band is shaped by using side information transmitted from the encoder to the decoder.
Use of psychoacoustic principle or perceptual masking effect for the design of audio compression makes sense. Audio/speech equipment or communication is intended for interaction with humans, with all their abilities and limitations of perception. Traditional audio equipment attempts to reproduce signals with the utmost fidelity to the original. A more appropriately directed and often more efficient goal is to achieve the fidelity perceivable by humans. This is the goal of perceptual coders. Although one main goal of digital audio perceptual coders is data reduction, perceptual coding can be used to improve the representation of digital audio through advanced bit allocation. One of the examples of perceptual coders could be multiband systems, dividing up the spectrum in a fashion that mimics the critical bands of psychoacoustics (Ballman 1991). By modeling human perception, perceptual coders can process signals much the way humans do, and take advantage of phenomena such as masking. While this is their goal, the process relies upon an accurate algorithm. Due to the fact that it is difficult to have a very accurate perceptual model which covers common human hearing behavior, the accuracy of any mathematical expression of perceptual model is still limited. However, with limited accuracy, the perception concept has helped a lot the design of audio codecs. Numerous MPEG audio coding schemes have benefitted from exploring perceptual masking effect. Several ITU standard codecs also use the perceptual concept; for example, ITU G.729.1 performs so-called dynamic bit allocation based on perceptual masking concept; the dynamic bit allocation concept based on perceptual importance is also used in recent 3GPP EVS codec.
For low or medium bit rate audio coding, short-term linear prediction (STP) and long-term linear prediction (LTP) can be combined with a frequency domain excitation coding.
The bus may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, video bus, or the like. The CPU may comprise any type of electronic data processor. The memory may comprise any type of system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like. In an embodiment, the memory may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs.
The mass storage device may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus. The mass storage device may comprise, for example, one or more of a solid state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like.
The video adapter and the I/O interface provide interfaces to couple external input and output devices to the processing unit. As illustrated, examples of input and output devices include the display coupled to the video adapter and the mouse/keyboard/printer coupled to the I/O interface. Other devices may be coupled to the processing unit, and additional or fewer interface cards may be utilized. For example, a serial interface such as Universal Serial Bus (USB) (not shown) may be used to provide an interface for a printer.
The processing unit also includes one or more network interfaces, which may comprise wired links, such as an Ethernet cable or the like, and/or wireless links to access nodes or different networks. The network interface allows the processing unit to communicate with remote units via the networks. For example, the network interface may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas. In an embodiment, the processing unit is coupled to a local-area network or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, remote storage facilities, or the like.
Although the description has been described in detail, it should be understood that various changes, substitutions and alterations can be made without departing from the spirit and scope of this disclosure as defined by the appended claims. Moreover, the scope of the disclosure is not intended to be limited to the particular embodiments described herein, as one of ordinary skill in the art will readily appreciate from this disclosure that processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, may perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.
This application is a continuation of U.S. patent application Ser. No. 14/027,052, filed on Sep. 13, 2013, which claims the benefit of U.S. Provisional Application No. 61/702,342 filed on Sep. 18, 2012, entitled “Improving AUDIO/VOICED Classification Based on Perceptual Quality for Low or Medium Bit Rates,” which is incorporated herein by reference as if reproduced in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
6298322 | Lindemann | Oct 2001 | B1 |
6456965 | Yeldener | Sep 2002 | B1 |
6496797 | Redkov et al. | Dec 2002 | B1 |
6549885 | Ehara et al. | Apr 2003 | B2 |
8224657 | Jelinek et al. | Jul 2012 | B2 |
8447620 | Neuendorf et al. | May 2013 | B2 |
9015039 | Gao | Apr 2015 | B2 |
9037456 | Mittal et al. | May 2015 | B2 |
9037457 | Geiger et al. | May 2015 | B2 |
9037474 | Gao | May 2015 | B2 |
9099099 | Gao et al. | Aug 2015 | B2 |
20010023396 | Gersho et al. | Sep 2001 | A1 |
20020111797 | Gao | Aug 2002 | A1 |
20020177994 | Chang et al. | Nov 2002 | A1 |
20030009325 | Kirchherr et al. | Jan 2003 | A1 |
20030088401 | Terez | May 2003 | A1 |
20030125935 | Zinser et al. | Jul 2003 | A1 |
20040260545 | Gao | Dec 2004 | A1 |
20040267525 | Lee et al. | Dec 2004 | A1 |
20050114124 | Liu et al. | May 2005 | A1 |
20050154584 | Jelinek et al. | Jul 2005 | A1 |
20070143107 | Ben-David et al. | Jun 2007 | A1 |
20080147414 | Sonchang-Yong et al. | Jun 2008 | A1 |
20080249784 | Stachurski | Oct 2008 | A1 |
20090037168 | Gao | Feb 2009 | A1 |
20090119097 | Master et al. | May 2009 | A1 |
20100268530 | Sun et al. | Oct 2010 | A1 |
20110218800 | Zhang et al. | Sep 2011 | A1 |
20120101813 | Vaillancourt et al. | Apr 2012 | A1 |
20130166287 | Gao | Jun 2013 | A1 |
20130185063 | Atti et al. | Jul 2013 | A1 |
20130246068 | Lee | Sep 2013 | A1 |
20140081629 | Gao et al. | Mar 2014 | A1 |
20140330415 | Ramo et al. | Nov 2014 | A1 |
20160027450 | Gao | Jan 2016 | A1 |
Number | Date | Country |
---|---|---|
101256772 | Sep 2008 | CN |
2014-500521 | Jan 2014 | JP |
20080055026 | Jun 2008 | KR |
20080097684 | Nov 2008 | KR |
02065457 | Aug 2002 | WO |
2008072913 | Jun 2008 | WO |
2010003521 | Jan 2010 | WO |
2012055016 | May 2012 | WO |
Entry |
---|
Milan Jelinek et al., “G.718: A New Embedded Speech and Audio Coding Standard with High Resilience to Error-Prone Transmission Channels,” ITU-T Standards, IEEE Communications Magazine, Oct. 2009, pp. 117-123, total 7 pages. |
ITU-T G.718. Series G: Transmission Systems and Media,Digital Systems and Networks, Digital terminal equipments—Coding of voice and audio signals, “Frame error robust narrow-band and wideband embedded variable bit-rate coding of speech and audio from 8-32 kbit/s,” Jun. 2008, total 257 pages. |
Number | Date | Country | |
---|---|---|---|
20170116999 A1 | Apr 2017 | US |
Number | Date | Country | |
---|---|---|---|
61703342 | Sep 2012 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14027052 | Sep 2013 | US |
Child | 15398321 | US |