The present application is related to co-pending U.S. patent application Ser. No. 13/975,344 entitled “METHOD FOR ADAPTIVE AUDIO SIGNAL SHAPING FOR IMPROVED PLAYBACK IN A NOISY ENVIRONMENT” filed on Aug. 25, 2013 by HUAN-YU SU, et al., co-pending U.S. patent application Ser. No. 14/193,606 entitled “IMPROVED ERROR CONCEALMENT FOR SPEECH DECODER” filed on Feb. 28, 2014 by HUAN-YU SU, co-pending U.S. patent application Ser. No. 14/534,531 entitled “ADAPTIVE DELAY FOR ENHANCED SPEECH PROCESSING” filed on Nov. 6, 2014 by HUAN-YU SU, co-pending U.S. patent application Ser. No. 14/534,472 entitled “ADAPTIVE SIDETONE TO ENHANCE TELEPHONIC COMMUNICATIONS” filed on Nov. 6, 2014 by HUAN-YU SU and co-pending U.S. patent application Ser. No. 14/629,864 entitled “IMPROVED NOISE SUPPRESSOR” filed concurrently herewith by HUAN-YU SU. The above referenced pending patent applications are incorporated herein by reference for all purposes, as if set forth in full.
The present invention is related to audio signal processing and more specifically to system and method and computer-program product for improving the audio quality of voice calls in a communication device.
The improved quality of voice communications over mobile telephone networks have contributed significantly to the growth of the wireless industry over the past two decades. Due to the mobile nature of the service, a user's quality of experience (QoE) can vary dramatically depending on many factors. Two such key factors include the wireless link quality and the background or ambient noise levels. It should be appreciated, that these factors are generally not within the user's control. In order to improve the user's QoE, the wireless industry continues to search for quality improvement solutions to address these key QoE factors.
In theory, ambient noise is always present in our daily lives and depending on the actual level, such noise can severely impact our voice communications over wireless networks. A high noise level reduces the signal to noise ratio (SNR) of a talker's speech. Studies from members of speech standard organizations, such as 3GPP and ITU-T, show that lower SNR speech results in lower speech coding performance ratings, or low MOS (mean opinion score). This has been found to be true for all LPC (linear predictive coding) based speech coding standards that are used in wireless industry today.
Another problem with high level ambient noise is that it prevents the proper operation of certain bandwidth saving techniques, such as voice activity detection (VAD) and discontinuous transmission (DTX). These techniques operate by detecting periods of “silence” or background noise. The failure of such techniques due to high background noise levels result in the unnecessary bandwidth consumption and waste.
Since the standardization of EVRC (enhanced variable rate codec, IS-127) in 1997, the wireless industry had embraced speech enhancement techniques that operate to cancel or reduce background noise. Traditional noise suppression techniques are typically based on the manipulation of speech signals in the spectrum domain, including techniques such as spectrum subtraction and the like. The problem with such prior-art techniques is that they all require the speech signals to be converted from the time domain to the spectrum domain and back again. For example, speech signals in the time domain are converted to the spectrum or frequency domain using Discrete Fourier transform or Fast Fourier transform (DFT/FFT) techniques. The signals are then manipulated in the spectrum domain using techniques such as spectrum subtraction and the like. Finally, the signals are converted back into the time domain using reverse DFT/FFT techniques.
One problem with such conventional methods of noise reduction is that they require large amounts of computational complexity. In addition, such methods typically introduce unwanted delay that worsens the mouth-to-ear latency.
Another problem with such conventional methods of spectrum domain manipulation is that unwanted spectrum distortion can be accidently introduced, making the noise reduced speech sound mechanical or ‘robotic’, which of course degrades the user perceived QoE in a different and unintentional way.
Due to the poor performance of traditional noise suppression techniques, another trend in the wireless industry is to use two or more microphones to maintain reasonably acceptable noise suppression. While in theory, multi-microphone techniques (and therefore multi-source speech signals) allow for better noise suppression, these technique carry with it significant cost and complexity increases that result in longer latency. In addition, such techniques still produce spectrally distorted voice quality.
In addition, at the receiving end of a communications system, the reconstructed (or down-link direction) speech signals are equivalent to a single source speech and as such, multi-source based noise suppression techniques are not applicable. Thus, there has been no attempt by the wireless industry to support noise suppression at the receiving end, or down-link direction, even though such an improvement will greatly enhance the user's perceived voice quality, especially when connected to another mobile device that does not support up-link noise suppression, such as older 2G/3G feature phones.
Accordingly, the present invention overcomes the deficiencies of prior-art systems and methods by providing a very low complexity and improved noise suppression system and method that can be used with low-cost single microphone systems in the up-link or down-link directions.
In addition, the present invention provides an improved noise suppression system and method that operates entirely in the time domain. Thus, the single gain based noise suppression technique of the present invention is extremely simple in terms of computational complexity, has zero additional latency, and is suitable for both up-link (Tx) and down-link (Rx) noise suppression techniques.
The present invention may be described herein in terms of functional block components and various processing steps. It should be appreciated that such functional blocks may be realized by any number of hardware components or software elements configured to perform the specified functions. For example, the present invention may employ various integrated circuit components, e.g., memory elements, digital signal processing elements, logic elements, look-up tables, and the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices. In addition, those skilled in the art will appreciate that the present invention may be practiced in conjunction with any number of data and voice transmission protocols, and that the system described herein is merely one exemplary application for the invention.
It should be appreciated that the particular implementations shown and described herein are illustrative of the invention and its best mode and are not intended to otherwise limit the scope of the present invention in any way. Indeed, for the sake of brevity, conventional techniques for signal processing, data transmission, signaling, packet-based transmission, network control, and other functional aspects of the systems (and components of the individual operating components of the systems) may not be described in detail herein, but are readily known by skilled practitioners in the relevant arts. Furthermore, the connecting lines shown in the various figures contained herein are intended to represent exemplary functional relationships and/or physical couplings between the various elements. It should be noted that many alternative or additional functional relationships or physical connections may be present in a practical communication system. It should be noted that the present invention is described in terms of a typical mobile phone system. However, the present invention can be used with any type of communication device including non-mobile phone systems, laptop computers, tablets, game systems, desktop computers, personal digital infotainment devices and the like. Indeed, the present invention can be used with any system that supports digital voice communications. Therefore, the use of cellular mobile phones as example implementations should not be construed to limit the scope and breadth of the present invention.
On the far-end phone, the reverse processing takes place. The radio signal containing the compressed speech is received by the far-end phone's antenna in the far-end mobile phone receiver 240. Next, the signal is processed by the receiver radio circuitry 241, followed by the channel decoder 242 to obtain the received compressed speech, referred to as speech packets or frames 246. Depending on the speech coding scheme used, one compressed speech packet can typically represent 5-30 ms worth of a speech signal. After the speech decoder 243, the reconstructed speech (or down-link speech) 248 is output to the digital to analog convertor 254.
Due to the never ending evolution of wireless access technology, it is worth mentioning that the combination of the channel encoder 216 and transmitter radio circuitry 217, as well as the reverse processing of the receiver radio circuitry 241 and channel decoder 242, can be seen as wireless modem (modulator-demodulator). Newer standards in use today, including LTE, WiMax and WiFi, and others, comprise wireless modems in different configurations than as described above and in
Referring now to
While such prior-art techniques using spectrum manipulation, as discussed above, can effectively remove the noise from the speech signal to produce an enhanced speech output, it has some well-known drawbacks. First, quasi-stationary noises do exist, but the large majority of real-life application conditions include noises that are rapidly changing. This fact results in an inevitable mismatch between the estimated noise spectrum and the actual noise spectrum. In addition, even when real-life quasi-stationary noises are present, there are inevitable signal variations at the millisecond level, resulting in local spectrum mismatch, which produces the well known “music tone” effect in the reproduced speech. Finally, when noise spectrum estimates accidentally include non-noise periods, i.e., when the voice-activity-detector misclassifies speech segments as noise, which corrupts the noise spectrum estimate 312, the spectrum manipulation 315 creates audible spectrum distortion in the output speech 325. With such unavoidable drawbacks, even though the noise might be largely reduced by such noise suppressors, the output speech 325 often sounds mechanical or has obvious artifacts that are objectionable to the human auditory system.
It should also be noted that multiple microphones are sometimes used to increase the detection accuracy and/or improve the noise spectrum estimate. From a signal processing point of view, having more reference data helps the detection accuracy. However, when the noise signal behavior inherently prevents the accurate detection of the true noise spectrum, such as fast changing noise having local spectrum variations, such traditional solutions still result in degraded output speech.
In addition, the noise suppressor in the prior-art models require a block of speech samples to effectuate the conversion to the spectrum domain. This, as shown in
The digital input speech 435 is evaluated to determine the noise level 481. Techniques such as voice activity detection and the like are used to maintain a high accuracy of the noise level determination. However, mistakes are tolerated by the proposed technique quite well, as compared to prior-art methods. Due to its nature, noises are inherently time varying. Not only will its nature change from time to time, (such as the case where a car noise, for example, is combined with a nearby talker's low level voice), but also its level will change, (such as the case where a truck suddenly approaches and passes by). Thus, an absolute and accurate detection of noise vs. speech is not practically possible. To overcome this inherent problem, the present invention uses a weighted mean factor as described below, with reference to
In parallel, the digital input speech signal 435 is also used to determine the actual signal level 484. It should be noted that when there is no active speech from the near-end talker, the signal level 484 and the noise level 481 are very close or identical. A large difference between these two levels indicate that the talker's active voice is present.
After the signal level determinations 481 and 484, those parameters are used by a multi-stage gain calculation module 485 to produce a signal gain factor 486. The output noise reduced signal 455 is the gain 486 shaped original speech signal 435.
Conventional voice activity detectors provide an indication on whether active speech is present. These conventional VAD devices work well with pure noise periods, but not so well with mixed speech and noise periods. While pure noise periods do exist, speech mixed with noise is also a very common phenomenon. Therefore, a simple binary decision mechanism, cannot provide an accurate indication for the purposes of the present invention.
Therefore, instead of using a typical VAD, the present invention provides a novel approach where the detected noise level and actual signal level are used as confidence parameters to calculate a gain factor. This concept is depicted in
The input speech (S) 401 is shown at the top of
In accordance with the present invention, an Ideal gain factor (G) 472 is calculated. This is accomplished by comparing the actual signal level with the detected noise level. When the signal level is close to the detected noise level, confidence is high that current signal is noise-only. Therefore the gain factor remains close to 0 under these conditions. However, when the current signal level is larger than that of the detected noise level, then the confidence is low that the current signal is noise-only, therefore the gain factor will be increased towards 1.0. This gain factor adaptation is performed on a sample by sample basis. An ideal gain factor should be close to 0.0 for pure noise, close to 1.0 when active speech is present, and take a value between 0.0 and 1.0 depending on the confidence about how much speech is present.
For normal applications, the gain factor will be close to 1.0 for signal periods where the near-end talker's speech is present. The gain factor will be very small, or even close to 0.0 for signal periods where there is only noise. For other segments, the gain factor would be between 0.0 and 1.0. For applications when AGC (automatic gain control) or ALC (automatic level control) is implemented in conjunction with the present invention, the gain factor can be larger than 1.0.
The present invention can be implemented as a sample-in/sample-out module, resulting in zero latency increase. Also the complexity is extremely small, since only a few multiply and addition operations are required per each speech sample.
The enhanced digital speech signal 525 is next fed into the speech encoder 515. The enhanced digital speech 525 is compressed by the speech encoder 515 in accordance with whatever wireless speech coding standard is being implemented. Next, the enhanced compressed speech packets 526 go through a channel encoder 516 to prepare the packets for radio transmission. The channel encoder is coupled with the transmitter radio circuitry 517 and is then transmitted over the near-end phone's antenna.
The present invention may be implemented using hardware, software or a combination thereof and may be implemented in a computer system or other processing system. Computers and other processing systems come in many forms, including wireless handsets, portable music players, infotainment devices, tablets, laptop computers, desktop computers and the like. In fact, in one embodiment, the invention is directed toward a computer system capable of carrying out the functionality described herein. An example computer system 701 is shown in
Computer system 701 also includes a main memory 706, preferably random access memory (RAM), and can also include a secondary memory 708. The secondary memory 708 can include, for example, a hard disk drive 710 and/or a removable storage drive 712, representing a magnetic disc or tape drive, an optical disk drive, etc. The removable storage drive 712 reads from and/or writes to a removable storage unit 714 in a well-known manner. Removable storage unit 714, represent magnetic or optical media, such as disks or tapes, etc., which is read by and written to by removable storage drive 712. As will be appreciated, the removable storage unit 714 includes a computer usable storage medium having stored therein computer software and/or data.
In alternative embodiments, secondary memory 708 may include other similar means for allowing computer programs or other instructions to be loaded into computer system 701. Such means can include, for example, a removable storage unit 722 and an interface 720. Examples of such can include a USB flash disc and interface, a program cartridge and cartridge interface (such as that found in video game devices), other types of removable memory chips and associated socket, such as SD memory and the like, and other removable storage units 722 and interfaces 720 which allow software and data to be transferred from the removable storage unit 722 to computer system 701.
Computer system 701 can also include a communications interface 724. Communications interface 724 allows software and data to be transferred between computer system 701 and external devices. Examples of communications interface 724 can include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, etc. Software and data transferred via communications interface 724 are in the form of signals which can be electronic, electromagnetic, optical or other signals capable of being received by communications interface 724. These signals 726 are provided to communications interface via a channel 728. This channel 728 carries signals 726 and can be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link, such as WiFi or cellular, and other communications channels.
In this document, the terms “computer program medium” and “computer usable medium” are used to generally refer to media such as removable storage device 712, a hard disk installed in hard disk drive 710, and signals 726. These computer program products are means for providing software or code to computer system 701.
Computer programs (also called computer control logic or code) are stored in main memory and/or secondary memory 708. Computer programs can also be received via communications interface 724. Such computer programs, when executed, enable the computer system 701 to perform the features of the present invention as discussed herein. In particular, the computer programs, when executed, enable the processor 704 to perform the features of the present invention. Accordingly, such computer programs represent controllers of the computer system 701.
In an embodiment where the invention is implemented using software, the software may be stored in a computer program product and loaded into computer system 701 using removable storage drive 712, hard drive 710 or communications interface 724. The control logic (software), when executed by the processor 704, causes the processor 704 to perform the functions of the invention as described herein.
In another embodiment, the invention is implemented primarily in hardware using, for example, hardware components such as application specific integrated circuits (ASICs). Implementation of the hardware state machine so as to perform the functions described herein will be apparent to persons skilled in the relevant art(s).
In yet another embodiment, the invention is implemented using a combination of both hardware and software.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
Number | Name | Date | Kind |
---|---|---|---|
3038119 | Billig | Jun 1962 | A |
4630304 | Borth | Dec 1986 | A |
4630305 | Borth | Dec 1986 | A |
4747143 | Kroeger | May 1988 | A |
5107539 | Kato | Apr 1992 | A |
5357567 | Barron | Oct 1994 | A |
5402496 | Soli | Mar 1995 | A |
5485522 | Solve | Jan 1996 | A |
5615256 | Yamashita | Mar 1997 | A |
5684921 | Bayya | Nov 1997 | A |
5867815 | Kondo | Feb 1999 | A |
5920834 | Sih | Jul 1999 | A |
6081777 | Grabb | Jun 2000 | A |
6314396 | Monkowski | Nov 2001 | B1 |
6505057 | Finn | Jan 2003 | B1 |
6728380 | Zhu | Apr 2004 | B1 |
7065486 | Thyssen | Jun 2006 | B1 |
8694311 | Jung | Apr 2014 | B2 |
20020019733 | Erell | Feb 2002 | A1 |
20020035470 | Gao | Mar 2002 | A1 |
20030040908 | Yang | Feb 2003 | A1 |
20040076271 | Koistinen | Apr 2004 | A1 |
20050004796 | Trump | Jan 2005 | A1 |
20050058301 | Brown | Mar 2005 | A1 |
20060126859 | Elberling | Jun 2006 | A1 |
20070009121 | Petersen | Jan 2007 | A1 |
20070165879 | Deng | Jul 2007 | A1 |
20070190982 | Le Faucheur | Aug 2007 | A1 |
20070219791 | Gao | Sep 2007 | A1 |
20080189104 | Zong | Aug 2008 | A1 |
20080219471 | Sugiyama | Sep 2008 | A1 |
20080243496 | Wang | Oct 2008 | A1 |
20080312916 | Konchitsky | Dec 2008 | A1 |
20090274310 | Taenzer | Nov 2009 | A1 |
20100262424 | Li | Oct 2010 | A1 |
20100278353 | Taenzer | Nov 2010 | A1 |
20110129095 | Avendano | Jun 2011 | A1 |
20110194699 | Baker | Aug 2011 | A1 |
20120076311 | Isabelle | Mar 2012 | A1 |
20120076312 | Iyengar | Mar 2012 | A1 |
20120123771 | Chen | May 2012 | A1 |
20120134509 | Matsumoto | May 2012 | A1 |
20120221329 | Harsch | Aug 2012 | A1 |
20130077802 | Sugiyama | Mar 2013 | A1 |
20130218560 | Hsiao | Aug 2013 | A1 |
20130294616 | Mulder | Nov 2013 | A1 |
20140247956 | Andersen | Sep 2014 | A1 |
20140249807 | Vaillancourt | Sep 2014 | A1 |
20140278397 | Chen | Sep 2014 | A1 |
20140337021 | Kim | Nov 2014 | A1 |
20140376731 | Isaka | Dec 2014 | A1 |
20150030184 | Yamada | Jan 2015 | A1 |
20150100310 | Cha | Apr 2015 | A1 |
20150172807 | Olsson | Jun 2015 | A1 |
Number | Date | Country |
---|---|---|
WO 2013091703 | Jun 2013 | DK |
2009171208 | Jul 2009 | JP |
Entry |
---|
Basbug, et al. “Noise reduction and echo cancellation front-end for speech codecs.” Speech and Audio Processing, IEEE Transactions on 11.1, Jan. 2003, pp. 1-13. |
Tsoukalas, et al. “Speech enhancement based on audible noise suppression.” Speech and Audio Processing, IEEE Transactions on 5.6, Nov. 1997, pp. 497-514. |
Berouti, Michael, et al. “Enhancement of speech corrupted by acoustic noise.” Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP'79.. vol. 4. IEEE, Apr. 1979, 208-211. |
Number | Date | Country | |
---|---|---|---|
61948309 | Mar 2014 | US |