This invention relates to audio signal processing, and more specifically, the invention relates to systems and methods for enhancing receiver intelligibility.
Speech intelligibility is usually expressed as a percentage of words, sentences or phonemes correctly identified by a listener or a group of listeners. It is an important measure of the effectiveness or adequacy of a communication system or of the ability of people to communicate effectively in noisy environments.
Communication devices such as mobile phones, headsets, telephones and so forth may be used in vehicles or in other areas where there is often a high level of background noise. A high level of local background noise can make it difficult for a user of the communication device to understand the speech being received from the receiving side in the communication network. The ability of the user to effectively understand the speech received from the receiver side is obviously essential and is referred to as the intelligibility of the received speech.
In the past, the most common solution to overcome the background noise was to increase the volume at which the speakers of communication device output speech. One problem with this solution is that the maximum output sound level that a phone's speaker can generate is limited. Due to the need to produce cost-competitive cell phones, companies often use low-cost speakers with limited power handling capabilities. The maximum sound level such phone speakers generate is often insufficient due to high local background noise.
Attempts to overcome the local background noise by simply increasing the volume of the speaker output can also result in overloading the speaker. Overloading the loudspeaker introduces distortion to the speaker output and further decreases the intelligibility of the outputted speech. A technology that increases the intelligibility of speech received irrespective of the local background noise level is needed.
Several attempts to improve the intelligibility in communication devices are known in the related art. The requirements of an intelligent system cover naturalness of the enhanced signal, short signal delay and computational simplicity.
During the past two decades, Linear Predictive Coding (LPC) has become one of the most prevalent techniques for speech analysis. In fact, this technique is the basis of all the sophisticated algorithms that are used for estimating speech parameters, for example, pitch, formants, spectra, vocal tract and low bit representations of speech. The basic principle of linear prediction states that speech can be modeled as the output of a linear time-varying system excited by either periodic pulses or random noise. The most general predictor form in linear prediction is the Auto Regressive Moving Average (ARMA) model where a speech sample of ‘s(n)’ is predicted from ‘p’ past predicted speech samples s (n−1), . . . , s(n−p) with the addition of an excitation signal u(n) according to the following equation 1:
s(n)=Σk−1Paks(n−i)+GΣi−0qbiu(n−1) Equation 1
where G is the gain factor for the input speech and a.sub.k and b.sub.1 are filter coefficients. The related transfer function H (z) is given by following equation 2:
H(z)=S(z)/U(z) Equation 2
For an all-pole or Autoregressive (AR) model, the transfer function becomes as the following equation 3:
H(z)=1/(1−Σk−1pakz−k)=1/A(z) Equation 3
Estimation of LPC
Two widely used methods for estimating the LP coefficients exist: autocorrelation method and covariance method. Both methods choose the LP coefficients a.sub.k in such a way that the residual energy is minimized. The classical least squares technique is used for this purpose. Among different variations of LP, the autocorrelation method of linear prediction is the most popular. In this method, a predictor (an FIR of order m) is determined by minimizing the square of the prediction error, the residual, over an infinite time interval. Popularity of the conventional autocorrelation method of LP is explained by its ability to compute a stable all-pole model for the speech spectrum, with a reasonable computational load, which is accurate enough for most applications when presented by a few parameters. The performance of LP in modeling of the speech spectrum can be explained by the autocorrelation function of the all-pole filter, which matches exactly the autocorrelation of the input signal between 0 and m when the prediction order equals m. The energy in the residual signal is minimized. The residual energy is given by the following equation 4:
E=Σ
n=−∞
∞
e
2(n)=Σn=−∞∞[sn(n)−Σaksn(n−k)]2 Equation 4
The covariance method is very similar to the autocorrelation method. The basic difference is the length of the analysis window. The covariance method windows the error signals instead of the original signal. The energy E of the windowed error signal is given by following equation 5:
E=Σ
n=−∞
∞
e
2(n)=Σn=−∞∞e2(n)w(n) Equation 5
Comparing autocorrelation method and covariance method, the covariance method is quite general and can be used with no restrictions. The only problem is that of stability of the resulting filter, which is not a severe problem generally. In the autocorrelation method, on the other hand, the filter is guaranteed to be stable, but the problems of parameter accuracy can arise because of the necessity of windowing the time signal. This is usually a problem if the signal is a portion of an impulse response.
Usually in environments with significant local background noise, the signal received from the receiving side becomes unintelligible due to a phenomenon called masking. There are several kinds of masking, including but not limited to, auditory masking, temporal masking, simultaneous masking and so forth.
Auditory masking is a phenomenon when one sound is affected by the presence of another sound. Temporal masking is a phenomenon when a sudden sound makes other sounds inaudible. Simultaneous masking is the inability of hearing a sound in presence of other sound whose frequency component is very close to desired sound's frequency component.
In light of the above discussion, techniques are desirable for enhancing receiver intelligibility.
The present invention provides a communication device and method for enhancing audio signals. The communication device may monitor the local background noise in the environment and enhances the received communication signal in order to make the communication more relaxed. By monitoring the ambient or environmental noise in the location in which the communication device is operating and applying receiver intelligibility enhancement processing at the appropriate time, it is possible to significantly improve the intelligibility of the received communication signal.
In one aspect of the invention, the noise in the background in which the communication device is operating is monitored and analyzed.
In another aspect of the invention, the signals from a far-end are modified based on the characteristics of the background noise at near end.
In another aspect of the invention, Linear Predictive Coding (LPC) spectrum of a first audio signal buffer acquired from a near-end are used to modify the magnitude spectrum calculated from the Fast Fourier Transform (FFT) spectrum of a second audio signal buffer acquired from a far-end to generate an intelligibility enhanced second audio signal.
Having thus described the invention in general terms, reference will now be made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
The following detailed description is directed to certain specific embodiments of the invention. However, the invention can be embodied in a multitude of different ways as defined and covered by the claims and their equivalents. In this description, reference is made to the drawings wherein like parts are designated with like numerals throughout. Unless otherwise noted in this specification or in the claims, all of the terms used in the specification and the claims will have the meanings normally ascribed to these terms by workers in the art.
The present invention provides a novel and unique technique to improve the intelligibility in noisy environments experienced in communication devices such as a cellular telephone, wireless telephone, cordless telephone, and so forth. While the present invention has applicability to at least these types of communications devices, the principles of the present invention are particularly applicable to all types of communications devices, as well as other devices that process speech in noisy environments such as voice recorders, dictation systems, voice command and control systems, and the like. For simplicity, the following description may employ the terms “telephone” or “cellular telephone” as an umbrella term to describe the embodiments of the present invention, but those skilled in the art will appreciate that the use of such term is not to be considered limiting to the scope of the invention, which is set forth by the claims appearing at the end of this description.
Generally in conventional devices the signals received from far-end device 108 and outputted through an earpiece of the communication device 102 may not sound clear because of the background noise 106. The present invention provides techniques to generate and output clear and enhanced signals from the earpiece of communication device 102.
A Digital-To-Analog (DAC) convertor 218 connected to an earpiece 216 may convert digital audio signals to analog audio signals that may then be outputted by earpiece 216. Further, communication device 102 includes a receiver 210 that receives signals from a far-end device on communication channel 112. An enhancer 202 processes the signals received from microphones 212a-n and receiver 210 to enhance the signal received from receiver 210. Further, the enhanced signal is outputted from earpiece 216. Enhancer 202 may include a processor 204 and a memory 206. Processor 204 can be a general purpose fixed point or floating point Digital Signal Processor (DSP), or a specialized DSP (fixed point or floating point). Examples of processor 204 include, but are not limited to, processor Texas Instruments (TI) TMS320VC5510, TMS320VC6713, TMS320VC6416; Analog Devices (ADI) BlackFinn (BF) 531, BF532, 533; Cambridge Silicon Radio (CSR) Blue Core 5 Multi-media (BC5-MM) or Blue Core 7 Multi-media BC7-MM and so forth. Memory 206 can be for example, a Random Access Memory (RAM), SRAM (Static Random Access Memory), a Read Only Memory (ROM), a solid state memory, a computer readable media and so forth. Further, memory 206 may be implemented inside or outside communication device 102. Memory 206 may include instructions that can be executed by processor 204. Further, memory 206 may store data that may be used by processor 204. Processor 204 and memory 206 may communicate for data transfer through system bus 208.
The LPC coefficients are calculated based on the components of first audio signal buffer 302. In an embodiment of the invention, the LPC coefficients may be calculated using Durbin-Levinson method.
However, people skilled in the art will appreciate that other techniques such as covariance method, autocorrelation method or other methods may be used to calculate the LPC coefficients. The LPC spectrum is calculated based on the LPC coefficients.
The FFT of the second audio signal buffer 310 may be calculated at block 312. N point FFT may be used (N.gtoreq.128). The magnitude spectrum of the FFT may be calculated at block 314. Block 316 performs the spectral domain processing, wherein selective frequencies of the second audio signal buffer are boosted by at least 3 decibels (dB). The difference between the LPC spectrum and FFT magnitude spectrum, for all the N points, may be calculated. If the difference is more than K dB (K.gtoreq.5), the frequencies of the second audio signal buffer are boosted by at least 3 dB.
The third audio signal buffer 324 is an enhanced audio signal that may be converted from digital to analog and outputted from earpiece 216 of communication device 102.
In an embodiment of the invention first audio signal buffer 302, the second audio signal buffer 310 and the third audio signal buffer 324 may be stored in memory 206 for processing by processor 204.
First audio signal buffer 302 and second audio signal buffer 310 are processed by enhancer 202 to generate third audio signal buffer 324. The third audio signal buffer 324 may be converted from digital to analog and outputted from earpiece 216 of communication device 102. The third audio signal buffer 324 is an enhanced form of second audio signal buffer 310 that sounds clear to the user of communication device 102 even in presence of background noise 106.
In an embodiment of the invention, communication device 102 may include a switch (not shown) to activate and/or deactivate enhancer 202. Therefore, once enhancer 202 is deactivated, first audio signal buffer 302 and second audio signal buffer 310 are not processed and signal received from a far end device is outputted from earpiece 702.
Further at step 814, the Linear Prediction Coding (LPC) coefficients of the first audio signal buffer are calculated. Thereafter, at step 816, the LPC spectrum is calculated from the LPC coefficients. In an embodiment of the invention, steps 808, 814 and 810, 816 may be performed simultaneously. At step 818, spectral domain processing may be performed wherein selective frequencies of the second audio signal buffer are boosted by at least 3 decibels (dB). The difference between the LPC spectrum and FFT magnitude spectrum is calculated for all N points of the FFT (N.gtoreq.128). If the difference is more than K dB (K.gtoreq.5), the frequencies of the second audio signal buffer are boosted by at least 3 dB. Thereafter, at step 820, the inverse FFT may be calculated. Further, at step 822, overlap and add method is performed for the second audio signal buffer to generate the third audio signal buffer 824. Subsequently, third audio signal buffer 324 may be converted from digital to analog and outputted from earpiece 216 of communication device 102.
In one embodiment of the invention a system and a method for enhancing audio signals are disclosed.
The system comprises a first receiver receiving noise signals from a near-end location of the system and a second receiver configured to receive audio signals from far-end communication devices. A signal enhancer comprising a processor and a memory are also provided in the system. The processor is configured to process the noise signals and audio signals received by the first receiver and the second receiver respectively for enhancing the audio signals received from the second receiver. In one embodiment of the present invention, the audio signals are enhanced by generating a magnitude spectrum by calculating Fast Fourier Transform (FFT) of the audio signals and further processing the magnitude spectrum based on Linear Predictive Coding (LPC) spectrum of the noise signals.
According to the present invention, the processor is configured to segment the contents of the noise and audio signals wherein the noise signals are being continuously monitored and analyzed. The processor is also configured: to window the segmented contents to calculate Linear Prediction Coding (LPC) coefficients of the noise signals and calculate the LPC spectrum from the LPC coefficients; to calculate the Fast Fourier Transform (FFT) of the audio signals; to calculate the magnitude spectrum from the FFT of the audio signals; to calculate difference between the LPC spectrum and the FFT magnitude spectrum; to selectively boost frequencies of the audio signals by at least 3 decibels (dB) to modify magnitude spectrum of the audio signals; to calculate the inverse FFT of the modified magnitude spectrum; and to overlap and add the audio signals to enhance the audio signals and output the enhanced audio signals by an earpiece.
According to the present invention, the memory is configured to store the noise and audio signals and the enhanced audio signals and to store one or more program instructions executable by the processor.
At least one microphone is configured to acquire the noise signals. The audio signals comprise speech signals received through a communication channel wherein the communication channel is a wireless communication channel.
This written description uses examples to disclose the invention, including the best mode, and also to enable any person skilled in the art to practice the invention, including making and using any devices or systems and performing any incorporated methods. The patentable scope the invention is defined in the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims if they have structural elements that do not differ from the literal language of the claims, or if they include equivalent structural elements with insubstantial differences from the literal languages of the claims.
The application is a Continuation in Part (CIP) application to U.S. application Ser. No. 12/951,027 filed on Nov. 20, 2010 which is a CIP of Ser. No. 12/946,468 filed on Nov. 15, 2010 which is a CIP of Ser. No. 12/941,827 filed on Nov. 8, 2010 which is a CIP of Ser. No. 12/705,296 filed on Feb. 12, 2010 which is a CIP of Ser. No. 12/139,489 filed on Jun. 15, 2008 which claims the benefit of provisional patent application 60/944,180 filed on Jun. 15, 2007. The entire teachings and contents of the above referenced applications are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 12951027 | Nov 2010 | US |
Child | 14468156 | US |