1. Field
This disclosure relates generally to a communication system and, more specifically, to techniques for comfort noise generation in a communication system.
2. Related Art
The process of distinguishing conversational speech from silence, music, noise, or other non-speech signals is generally known as voice activity detection (VAD). VAD may be implemented in a communication system using various speech processing algorithms that facilitate detection of speech. VAD may also indicate whether speech is voiced, unvoiced, or sustained. In general, known VAD algorithms trade-off delay, sensitivity, accuracy, and computational cost. To detect voice, a VAD algorithm usually extracts measured features from an input signal and compares values associated with the features with predetermined thresholds. When VAD is employed with non-stationary noise, a time-varying threshold (calculated during voice-inactive segments) is usually employed. VAD algorithms usually formulate decision rules on a frame-by-frame basis using instantaneous measures of divergence distance between speech and noise. The different measures which are used in VAD algorithms may include spectral slope, correlation coefficients, logarithm likelihood ratio, cepstral, weighted cepstral, and modified distance measures.
Most modern telephone systems (such as wireless and voice over Internet protocol (VoIP) systems) use VAD as a form of squelching, such that low-level signals are ignored. In digital transmissions, ignoring low-level signals conserves bandwidth of a communication channel by discontinuing transmission when a signal level is below a threshold. When a telephony customer detects silence, especially for a prolonged time period, the customer may believe that a transmission has been dropped and hang-up prematurely. In order to prevent premature hang-up, comfort noise has been added (e.g., at a receiver-end in wireless and VoIP systems) between voice transmissions. The generated comfort noise has usually been at a relatively low audible level, and has typically varied based on an average of a received signal.
Echo cancellation is used in telephony to remove echo from a voice communication in order to improve voice quality. Echo cancellation involves first recognizing an originally transmitted signal that re-appears, with some delay, in a transmitted or received signal. Upon recognition, an echo can be removed by subtracting the echo from a transmitted or received signal. Echo cancellation is generally implemented using a digital signal processor (DSP).
Two primary sources of echo in telephony are acoustic echo and hybrid echo. Acoustic echo arises when sound from a speaker of a telephone handset is picked up by a microphone of the telephone handset. For example, acoustic echo may occur in conjunction with hands-free car phone systems, a standard telephone in speakerphone or hands-free mode, conference telephones, installed room systems that use ceiling speakers and table-top microphones, video conferencing systems, etc. Direct acoustic path echo is attributable to sound from a speaker of a handset that enters a microphone of the handset substantially unaltered. When indirect acoustic path echo (reverberation) occurs, the echo can be difficult to effectively cancel (unlike echo associated with a direct acoustic path) as the original sound is altered by ambient space. The altered echo may be attributed to certain frequencies being absorbed by soft furnishings and reflection of different frequencies at varying strength.
Acoustic echo cancellers are usually designed to deal with changes and additions to an original signal caused by imperfections of a speaker, imperfections of a microphone, reverberant space, and physical coupling. In general, acoustic echo cancellation (AEC) algorithms approximate results of a next sample by comparing the difference between current and one or more previous samples. The information has then been used to predict how sound is altered by an acoustic space. In this case, the model of the acoustic space is continually updated. The changing nature of a sampled signal is mainly due to changes in the acoustic environment, not changes in the characteristics of a loudspeaker, a microphone, or physical coupling. That is, changes in a sampled signal are usually attributable to objects moving in an acoustic environment and movement of a microphone within the environment. For example, when a door is closed or opened, a chair is pulled in closer to a table, or drapes are opened or closed a change in reverberation of sound in an acoustic space occurs. To address changes in acoustic space, an echo cancellation algorithm may employ non-linear processing (NLP), which allows an algorithm to make changes to an acoustic space model that are suggested (but not yet confirmed) by signal comparison.
Hybrid (electric) echo is generated in public switched telephone networks (PSTNs) as a result of the reflection of electrical energy by a hybrid circuit. Hybrid echo may also be generated in voice-over-packet network systems, if the systems contain network elements (such as access gateways) that are equipped with access loop interfaces. As is known, most telephone local loops are two-wire circuits, while transmission facilities are usually four-wire circuits. A hybrid circuit or hybrid (typically, a part of an electronic device called a subscriber line interface circuit (SLIC)) converts a signal between the two and four-wire circuits. Unfortunately, when an impedance mismatch occurs, a hybrid produces a hybrid echo signal. An adaptive filter (included in a line echo canceller or a network echo canceller) learns about characteristics of the hybrid during an adaptation process. The output signal from the adaptive filter is inverted and combined with the hybrid echo signal. When the adaptation process is performed correctly, the result of combination of the hybrid echo signal and the inverted output signal of the adaptive filter produces a very small signal (called an error signal). Ideally, the error signal is small such that the error signal is not perceived audibly.
In practice, the adaptation process usually never produces an ideal characteristic of the hybrid and the error signal is often so large that other approaches for reducing the error signal are needed. A typical method of reducing the energy of the error signal is based on NLP. NLP also usually reduces natural/environmental background noise injected at a near-end of a network connection. As a result, a far-end talker is not exposed to the natural/environmental background noise injected to the telephone connection at the near-end. To compensate and produce more natural conditions, under which the far-end talker participates in the telephone call, an injection of comfort noise by the echo canceller has been employed. Ideally, comfort noise should be indistinguishable from the natural/environmental background noise present at the near-end.
The present invention is illustrated by way of example and is not limited by the accompanying figures, in which like references indicate similar elements. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale.
In the following detailed description of exemplary embodiments of the invention, specific exemplary embodiments in which the invention may be practiced are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that logical, architectural, programmatic, mechanical, electrical and other changes may be made without departing from the spirit or scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims and their equivalents. In particular, although the preferred embodiment is described below in conjunction with comfort noise generation in a network/line echo canceller, it will be appreciated that the present invention is not so limited and may be embodied in various devices in a wired or wireless communication system where the introduction of comfort noise is perceived to improve voice communication quality.
Various techniques according to the present disclosure address limitations in conventional comfort noise generation (CNG) for voice processing and transmission. Today, CNG is widely used in telecommunication voice processing in conjunction with network echo cancellation, acoustic echo control, voice activity detection (VAD), etc. According to the present disclosure, CNG is enhanced by providing both signal spectrum and signal level matching capabilities at relatively low computational expense. In general, conventional spectrum matching (SM) CNG approaches are impractical in cost-effective digital signal processor (DSP) implementations, due to the computational complexity of the conventional SM CNG approaches. For example, conventional SM CNG approaches have employed uniformly distributed filters, which require more filters to cover a given bandwidth than when non-uniformly distributed filters are employed. As another example, conventional SM CNG approaches have employed finite impulse response (FIR) filters, which require more coefficients than infinite impulse response (IIR) filters.
According to various aspects of the present disclosure, practical and effective techniques are disclosed that analyze and synthesize background noise to produce comfort noise that substantially duplicates background noise in both spectral content and level. The disclosed SM CNG techniques generally improve overall voice quality of voice solutions, while at the same time incurring relatively low computational cost (in million cycles per second (MCPS)) and relatively low memory usage (when embodied in a digital signal processor (DSP) or a general purpose processor). While the discussion herein is primarily directed to implementations that employ infinite impulse response (IIR) filters, many of the techniques disclosed herein are broadly applicable to implementations that employ other filter-types (e.g., finite impulse response (FIR) filters), albeit at increased computational cost in many cases.
According to the present disclosure, a number of techniques are provided to effectively analyze dominant spectrum components in a frequency band (e.g., a telephony band ranging from 0 Hz to 4 kHz) in order to efficiently synthesize far-end comfort noise that substantially matches near-end background noise (in both spectral content and in level). According to various embodiments, an analysis task block (ATB) and a synthesis task block (STB) are employed to substantially match comfort noise with background noise. In one or more embodiments, the STB includes a global adaptive signal gain driven by data generated in the ATB. In another embodiment, the STB includes a global adaptive signal gain, as well as individual adaptive signal gains (one for each frequency sub-band), driven by data generated in the ATB.
The ATB and STB may incorporate uniformly distributed filter banks (e.g., when discrete Fourier transform (DFT) filters (such as fast Fourier transform (FFT) filters) and inverse DFT filters (such as inverse FFT (IFFT) filters are employed)) or non-uniformly distributed filter banks (e.g., when infinite impulse response (IIR) filters are employed). For example, a voice band may be sub-divided into six sub-bands, with each sub-band employing a non-uniformly distributed IIR filter in the ATB and STB and six white noise generators (one for each sub-band) in the STB. It should be appreciated that a frequency band may sub-divided into more or less than six sub-bands, depending upon a voice quality desired. It should be appreciated that as the number of sub-bands is increased, the computational complexity of a solution increases. The present techniques are particularly advantageous in applications where one or more fixed-point DSPs are implemented to facilitate CNG. It should be appreciated that the ATB may be operated in an on/off manner to reduce power requirements or when computational power is required for another task, particularly when background noise varies in a relatively slow manner.
Location of the CNG device/function in a telephony network is application specific. The CNG function may be implemented solely in hardware, solely in software, or in a combination of hardware and software in various communication devices. For example, the CNG function may be implemented within software that executes on a digital signal processor (DSP) or a general purpose processor, or within hardware of an application specific integrated circuit (ASIC) or a programmable logic device (PLD). In a typical application, the CNG device/function is configured such that a low-level background noise signal is not directly transmitted through an entire communication path. Typically, at a transmitting-end, a background noise signal is identified in terms of level and spectral content (the operations are performed by an ATB) by temporarily breaking a signal path. Parametric information (e.g., individual level estimates (ILEs) and a global level estimate (GLE)) about the background noise signal is then passed (e.g., in a control packet or a data packet) to a receiving-end. Based upon the parametric information, the STB generates a comfort noise signal that is similar (in level and spectral content) to the background noise signal at the transmitting-end. CNG, according to the present disclosure, may be integrated in, for example, voice codecs, echo controllers and echo cancellers. While many conventional CNG techniques merely match a global level of an incoming low-level background noise signal, CNG according to the present disclosure substantially matches both global level and individual levels associated with a frequency band and sub-bands, respectively, of the background noise signal.
The present disclosure is generally directed to a spectrum matching (SM) CNG solution that is a relatively inexpensive technique (in terms of MCPS) for identifying background noise signal level and spectral content. The disclosed SM CNG solutions also provide a relatively inexpensive and accurate technique for generating comfort noise at a receiving-end. Various SM CNG solutions disclosed herein employ independent noise signal generation for each individual sub-band and may include automatic signal gain adjustment, which may be particularly advantageous in fixed-point DSP implementations (due to accuracy). In one or more embodiments, the ATB and the STB each include IIR filter banks and the STB includes a random signal source array (including a white noise signal source for each IIR filter in an IIR filter bank of the STB).
In various embodiments, an STB includes a dynamic global gain adjustment mechanism (i.e., a global gain control (GGC)) that operates on a composite output of the STB. In various embodiments, the STB also includes individual gain controls (IGCs), one for each sub-band, that operate on individual filter outputs (F<n>, where n=1, 2, . . . , N) to provide dynamic local gain adjustment. According to one or more embodiments, the ATB produces a total level estimate (i.e., a composite signal that corresponds to an integrated sum of the filter outputs) and individual level estimates (i.e., individual signals that each correspond to individual filter outputs). The filters may, for example, operate at a decimated rate D>1 (i.e., D=1 corresponds to a sampling rate used in a digital telephony/voice over internet protocol (VoIP) systems).
The selection of filter-types and filter coefficients may be performed in a number of different manners. In a typical filter selection process, filter sub-bands are first defined. For example, selection of non-uniform distributed filter sub-bands may be based, at least loosely, on the Bark scale to provide sub-bands that are approximately equal on a (base ten) logarithmic scale. For a given application, experimentation may be employed to minimize a number of filter sub-bands, while at the same time producing adequate signal spectrum shaping. For example, sub-bands may be selected in consideration of relatively low-level background noise (e.g., generally lower than −40 decibels relative to 1 mW at point of zero reference level (dBm0)), limited bandwidth (e.g., a sample rate of 8 kHz), and/or relatively slow varying background noise, which reduces an accuracy needed for signal spectrum reproduction. Filter parameters, such as pass-band (Apass) and stop-band (Astop), may be selected in view of low-level signal application and cycle impact. Filters may then be synthesized using various filter types, e.g., IIR filter types such as Chebyshev Type I, Chebyshev Type II, and Elliptic filters, and a least computationally expensive filter that meets specifications may then be chosen for implementation. For example, filters may be implemented in a C++ model of an echo canceller.
It should be appreciated that the above discussion provides an example for generating filter coefficients for the purpose of implementing low-cost analysis and synthesis filter banks within a SM CNG functional block. The SM CNG functional block may employ N−2 band-pass (BP) IIR filters, a low-pass (LP) IIR filter (at a low-end of a frequency band), and a high-pass (HP) IIR filter (at a high-end of a frequency band). With reference to
The SM CNG functionality may be implemented in various programming languages. For example, SM CNG functionality may be implemented in C++. Implementing the SM CNG functionality in C++ facilitates objective measurement of the disclosed techniques by comparing the spectrum of the input/output noise signals and by running special test vectors designed to facilitate evaluation of differences between level matching and spectrum matching from voice quality viewpoint. In general, spectrum matching in combination with level matching offers better voice quality than level matching alone.
Example C++ code (which is executed by, for example, a processor of an associated device, e.g., a network/line echo canceller) for performing an analysis task using an IIR filter bank is set forth below:
Example code (which is executed by, for example, a processor of an associated device, e.g., a network/line echo canceller) for performing a synthesis task using an IIR filter bank is set forth below:
Example code (which is executed by, for example, a processor of an associated device, e.g., a network/line echo canceller) for implementing an analysis task function using an IIR filter bank within an energy estimation function is set forth below:
Example code (which is executed by, for example, a processor of an associated device, e.g., a network/line echo canceller) for implementing a synthesis task function using an IIR filter bank with adaptive gain within nonlinear processing (NLP) functionality is set forth below:
In general, the techniques disclosed herein may be employed with IIR filter based analysis and synthesis tasks. Employing individual and global automatic level control elements to adjust sub-band levels and a global level, respectively, generally provides improved voice quality. As noted above, independent noise generators (one per sub-band) may be employed to reduce signal correlation in adjacent sub-bands (in the synthesis task). IIR filters in the analysis task may be configured to work continuously (e.g., during times indicated by double-talk functionality/nonlinear processor functionality) or in an on/off manner (e.g., in a variant of “sub-rate” approach). In general, the proposed solutions can be efficiently implemented in voice activity detection (VAD) or other functional components related to comfort noise generation. Tuning (adjusting gain coefficients, per sub-band and/or globally) may be readily performed during creation of a software version of an echo canceller.
With reference to
In
Respective outputs of the generators 130 are each coupled to respective inputs of the filters 124. It should be appreciated that the filters 110 correspond to the filters 124 in sub-band allocation and filter-type. That is, the filter blocks 108 and 122 are substantially the same. Signal levels provided at respective outputs of the filters 124 are based on the ILEs provided by the ATB 106. The respective outputs of the filters 124 are summed and provided to an input of a multiplier function 128. As is shown, a gain adjust (GA) function 126 (of the STB 120) receives an input that corresponds to the GLE and a feedback input that corresponds to an output of the multiplier function 128. The GA function 126 is configured to provide a control input to the multiplier function 128 to control a signal level at the output of the multiplier function 128 responsive to the GLE.
With reference to
It should be appreciated that more or less than six of the filters 210 may be employed, depending on the accuracy of the comfort noise desired. The filters 210 may be uniform or non-uniform filters. A global level estimator (e.g., an integrator function) 212 provides a global level estimate (GLE) of the background noise signal. The ILEs and the GLE are provided (in a data packet or a control packet) to a synthesis task block (STB) 220. The STB 220 includes multiple white noise generators 230 and multiple filters 224 (included in filter block 222) that are implemented to create a comfort noise signal (that is based on the background noise signal sampled by the ATB 206) for the far-end telephone during periods of silence (i.e., when a user of the telephone 202 is not talking). When a user of the near-end telephone 202 is not talking, the switch 214 (under NLP control) may disconnect the near-end telephone 202 from the canceller 205 to prevent echo. During this period, comfort noise may be provided from the STB 220 to a far-end telephone (not shown) via the switch 234.
Respective outputs of the generators 230 are each coupled to respective inputs of the filters (F1-F6) 224. It should be appreciated that the filters 224 correspond to the filters 210 in sub-band allocation and filter-type. Signal levels provided at respective outputs of the filters 224 are based on the ILEs provided by the ATB 206. The respective outputs of the filters 224 are summed (by adder 232) and provided to an input of a multiplier function 228. As is shown, a gain adjust (GA) function 226 (of the STB 220) receives an input that corresponds to the GLE and a feedback input that corresponds to an output of the multiplier function 228. The GA function 226 is configured to provide a control input to the multiplier function 228 to control a signal level at the output of the multiplier function 228 responsive to the GLE.
With reference to
It should be appreciated that more or less than six of the filters 310 may be employed, depending on the accuracy of the comfort noise desired. The filters 310 may be uniform or non-uniform filters. A global level estimator (e.g., an integrator function) 312 provides a global level estimate (GLE) of the background noise signal. The ILEs and the GLE are provided (in a data packet or a control packet) to a synthesis task block (STB) 320. The STB 320 includes multiple white noise generators 330 and multiple filters (included in filter and individual gain control (IGC) blocks 324) that are implemented to create a comfort noise signal (that is based on the background noise signal sampled by the ATB 306) for the far-end telephone during periods of silence (i.e., when a user of the telephone 302 is not talking).
Respective outputs of the generators 330 are each coupled to respective inputs of the blocks 324. As is discussed in further detail with respect to
With reference to
With reference to
Moving to
Accordingly, a number of comfort signal generation techniques have been disclosed herein that generally improve quality of a voice communication system.
As may be used herein, a software system can include one or more objects, agents, threads, subroutines, separate software applications, two or more lines of code or other suitable software structures operating in one or more separate software applications, on one or more different processors, or other suitable software architectures.
As will be appreciated, the processes in various embodiments of the present invention may be implemented using any combination of software, firmware or hardware. As a preparatory step to practicing the invention in software, code (whether software or firmware) according to a preferred embodiment will typically be stored in one or more machine readable storage mediums such as semiconductor memories such as read-only memories (ROMs), programmable ROMs (PROMs), etc., thereby making an article of manufacture in accordance with the invention. The article of manufacture containing the code is used by either executing the code directly from the storage device or by copying the code from the storage device into another storage device such as a random access memory (RAM), etc. An apparatus for practicing the techniques of the present disclosure could be one or more communication devices.
Although the invention is described herein with reference to specific embodiments, various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. For example, the comfort noise generation techniques disclosed herein are generally broadly applicable to wired and wireless communication systems that facilitate voice communication, in addition to data communication. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included with the scope of the present invention. Any benefits, advantages, or solution to problems that are described herein with regard to specific embodiments are not intended to be construed as a critical, required, or essential feature or element of any or all the claims.
Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements.