The present application is related to co-pending U.S. patent application Ser. No. 13/975,344 entitled “METHOD FOR ADAPTIVE AUDIO SIGNAL SHAPING FOR IMPROVED PLAYBACK IN A NOISY ENVIRONMENT” filed on Aug. 25, 2013 by HUAN-YU SU, et al., co-pending U.S. patent application Ser. No. 14/193,606 entitled “IMPROVED ERROR CONCEALMENT FOR SPEECH DECODER” filed on Feb. 28, 2014 by HUAN-YU SU, co-pending U.S. patent application Ser. No. 14/534,531 entitled “ADAPTIVE DELAY FOR ENHANCED SPEECH PROCESSING” filed on Nov. 6, 2014 by HUAN-YU SU, co-pending U.S. co-pending patent application Ser. No. 14/534,472 entitled “ADAPTIVE SIDETONE TO ENHANCE TELEPHONIC COMMUNICATIONS” filed on Nov. 6, 2014 by HUAN-YU SU and co-pending U.S. patent application Ser. No. 14/629,819 entitled “NOISE SUPPRESSOR” filed concurrently herewith by HUAN-YU SU. The above referenced pending patent applications are incorporated herein by reference for all purposes, as if set forth in full.
The present invention is related to audio signal processing and more specifically to system and method and computer-program product for improving the audio quality of voice calls in a communication device.
The improved quality of voice communications over mobile telephone networks have contributed significantly to the growth of the wireless industry over the past two decades. Due to the mobile nature of the service, a user's quality of experience (QoE) can vary dramatically depending on many factors. Two such key factors include the wireless link quality and the background or ambient noise levels. It should be appreciated, that these factors are generally not within the user's control. In order to improve the user's QoE, the wireless industry continues to search for quality improvement solutions to address these key QoE factors.
In theory, ambient noise is always present in our daily lives and depending on the actual level, such noise can severely impact our voice communications over wireless networks. A high noise level reduces the signal to noise ratio (SNR) of a talker's speech. Studies from members of speech standard organizations, such as 3GPP and ITU-T, show that lower SNR speech results in lower speech coding performance ratings, or low MOS (mean opinion score). This has been found to be true for all LPC (linear predictive coding) based speech coding standards that are used in wireless industry today.
Another problem with high level ambient noise is that it prevents the proper operation of certain bandwidth saving techniques, such as voice activity detection (VAD) and discontinuous transmission (DTX). These techniques operate by detecting periods of “silence” or background noise. The failure of such techniques due to high background noise levels result in the unnecessary bandwidth consumption and waste. One reason for this problem is due to the fact that conventional systems tend to classify high level noises as active voice. Since the standardization of EVRC (enhanced variable rate codec, IS-127) in 1997, the wireless industry had embraced speech enhancement techniques based on noise cancellation or noise suppression techniques.
Traditional noise suppression techniques are typically based on the manipulation of speech signals in the spectrum domain, including techniques such as spectrum subtraction and the like. While such those prior-art techniques have gained a broad acceptance and have been deployed in recent years by virtually all major mobile phone manufactures, spectrum subtraction techniques require the speech signals to be converted from the time domain to the spectrum domain and back again. For example, speech signals in the time domain are converted to the spectrum or frequency domain using Discrete Fourier transform or Fast Fourier transform (DFT/FFT) techniques. The signals are then manipulated in the spectrum domain using techniques such as spectrum subtraction and the like. Finally, the signals are converted back into the time domain using reverse DFT/FFT techniques. The amount of noise reduction applied to the spectrum domain is called the noise spectrum estimate, which is obtained during periods of speech that are classified as being noise only.
Therefore, accurate estimates of the noise spectrum are important and vital steps to guarantee a high quality noise reduction to the speech signal that are based on traditional spectrum domain subtraction techniques. Such estimates generally assume that noise is quasi-stationary. That is, it is assumed that noise is not changing or is very slowly changing over a certain short period of time. Using such assumptions, one can monitor the noise spectrum during time periods where there is no talker's speech and only noise is present.
Unfortunately, in the real world noise is rarely time invariant. Consequently, noise spectrum estimates obtained from previous speech samples are generally lagging behind the true noise that is in the current input speech signal. This mismatch produces major quality degradations including unwanted spectrum distortions known as “music tone”, which causes the noise reduced speech to sound mechanical or “robotic”.
Another difficult problem with ambient noise is the noise type. While traditional noise suppressors reasonably handle stationary/quasi-stationary noises, such prior-art techniques have problems with noises from sources like a secondary talkers, or other dramatically time-varying sources, such as street and restaurant noises.
For voice communication applications, the use of a handset with a single microphone should be largely sufficient. However, due to the poor performances of traditional noise suppression techniques with single source speech, recent trends in the industry is to use dual-microphones or even multi-microphones to maintain a reasonably acceptable performance. Unfortunately, due to the traditional method of performing noise suppression, even with dual-microphone techniques and the associated cost increases, the resulting speech still has the typical artifacts, as described above. Further, under such conditions, the noise suppressors in such prior-art systems generally require relatively long periods of time to converge, which leaves users exposed to the presence of un-removed noises.
Accordingly, the present invention provides improved noise suppression techniques that work well with both single and multi-input speech sources. Further, the present invention alleviates the prior-art degradation problems related to spectrum distortion and provides for much faster converge in order to further improve the user's perceived QoE across all application scenarios.
More particularly, the present invention provides a new and improved method and system that exploits a single or multiple input sources using a long-term noise spectrum estimate that captures the time invariant (slowly variant) part of the noise, and further includes a short-term noise spectrum estimate, which captures the fast more rapidly changing part of the noise. The present invention further includes a selectively applied spectrum gain based shaping technique for reducing noise that completely eliminates artifacts such as “music tone” and other audible and objectionable distortions that are introduced by traditional methods.
The present invention also includes a largely relaxed dependency on an accurate noise spectrum estimates, rending the noise suppressor robust to rapidly changing noise conditions that are common in daily life.
The present invention may be described herein in terms of functional block components and various processing steps. It should be appreciated that such functional blocks may be realized by any number of hardware components or software elements configured to perform the specified functions. For example, the present invention may employ various integrated circuit components, e.g., memory elements, digital signal processing elements, logic elements, look-up tables, and the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices. In addition, those skilled in the art will appreciate that the present invention may be practiced in conjunction with any number of data and voice transmission protocols, and that the system described herein is merely one exemplary application for the invention.
It should be appreciated that the particular implementations shown and described herein are illustrative of the invention and its best mode and are not intended to otherwise limit the scope of the present invention in any way. Indeed, for the sake of brevity, conventional techniques for signal processing, data transmission, signaling, packet-based transmission, network control, and other functional aspects of the systems (and components of the individual operating components of the systems) may not be described in detail herein, but are readily known by skilled practitioners in the relevant arts. Furthermore, the connecting lines shown in the various figures contained herein are intended to represent exemplary functional relationships and/or physical couplings between the various elements. It should be noted that many alternative or additional functional relationships or physical connections may be present in a practical communication system. It should be noted that the present invention is described in terms of a typical mobile phone system. However, the present invention can be used with any type of communication device including non-mobile phone systems, laptop computers, tablets, game systems, desktop computers, personal digital infotainment devices and the like. Indeed, the present invention can be used with any system that supports digital voice communications. Therefore, the use of cellular mobile phones as example implementations should not be construed to limit the scope and breadth of the present invention.
On the far-end phone, the reverse processing takes place. The radio signal containing the compressed speech is received by the far-end phone's antenna in the far-end mobile phone receiver 240. Next, the signal is processed by the receiver radio circuitry 241, followed by the channel decoder 242 to obtain the received compressed speech, referred to as speech packets or frames 246. Depending on the speech coding scheme used, one compressed speech packet can typically represent 5-30 ms worth of a speech signal. After the speech decoder 243, the reconstructed speech (or down-link speech) 248 is output to the digital to analog convertor 254.
Due to the never ending evolution of wireless access technology, it is worth mentioning that the combination of the channel encoder 216 and transmitter radio circuitry 217, as well as the reverse processing of the receiver radio circuitry 241 and channel decoder 242, can be seen as wireless modem (modulator-demodulator). Newer standards in use today, including LTE, WiMax and WiFi, and others, comprise wireless modems in different configurations than as described above and in
The enhanced digital speech signal 325 is next fed into the speech encoder 315. The enhanced digital speech 325 is compressed by the speech encoder 315 in accordance with whatever wireless speech coding standard is being implemented. Next, the enhanced compressed speech packets 326 go through a channel encoder 316 to prepare the packets for radio transmission. The channel encoder is coupled with the transmitter radio circuitry 317 and is then transmitted over the near-end phone's antenna.
Referring now to
While such prior-art techniques using spectrum manipulation, as discussed above, can effectively remove the noise from the speech signal to produce an enhanced speech output, it has some well-known drawbacks. First, quasi-stationary noises do exist, but the large majority of real-life application conditions include noises that are rapidly changing. This fact results in an inevitable mismatch between the estimated noise spectrum and the actual noise spectrum. In addition, even when real-life quasi-stationary noises are present, there are inevitable signal variations at the millisecond level, resulting in local spectrum mismatch, which produces the well known “music tone” effect in the reproduced speech. Finally, when noise spectrum estimates accidentally include non-noise periods, i.e., when the voice-activity-detector misclassifies speech segments as noise, which corrupts the noise spectrum estimate 412, the spectrum manipulation 415 creates audible spectrum distortion in the output speech 425. With such unavoidable drawbacks, even though the noise might be largely reduced by such noise suppressors, the output speech 425 often sounds mechanical or has obvious artifacts that are objectionable to the human auditory system.
It should also be noted that multiple microphones are sometimes used to increase the detection accuracy and/or improve the noise spectrum estimate. From a signal processing point of view, having more reference data helps the detection accuracy. However, when the noise signal behavior inherently prevents the accurate detection of the true noise spectrum, such as fast changing noise having local spectrum variations, such traditional solutions still result in degraded output speech, This is true, even for conventional systems using multiple microphones.
In addition, traditional methods require accurate estimates of the noise spectrum. Such accurate estimates can only be obtained through certain periods of observation, known as the training or convergence period. Before the noise spectrum estimate is converged, zero or very little noise suppression is performed on the speech, leaving users to experience a large variation of residual noises. For example, users generally experience loud residual noises when the noise spectrum estimate is not converged, followed by low residual noise when the convergence is reached. In addition, this condition repeats whenever noise conditions change. That is, during the reconvergence periods, users again are exposed to loud residual noises followed by low residual noises until reconvergence is achieved.
It should be noted that in a typical dual-microphone configuration used on mobile phones, the main microphone (herein after referred to as “main-mic”) is placed close to the talker's mouth at the bottom of the phone. Thus, compared with the secondary microphone (“second-mic”), it picks up a much louder voice signal. The second-mic is usually placed at the opposite side of the phone, either on the top or the back, and therefore is only able to pick up the talker's voice at a reduced volume level. However, since ambient noises are generally from sources that are relatively far away from the phone, it is reasonable to assume that both microphones pick up the noises at comparable levels.
While the difference between the two microphone inputs have been used to improve voice activity detection in conventional systems, by improving the noise spectrum estimate, the present invention takes this concept much further to provide a dramatically improved noise suppression technique. Specifically, the present invention further exploits the high correlation between the input signals from the two microphones as follows. The present invention uses the traditional noise spectrum estimate from the main microphone or both microphones as a long-term estimate of the noise (i.e., that part of the noise that is reasonably close to time invariant, or at least quasi-time-invariant), and additionally uses the secondary microphone's input signal spectrum as a short-term estimate of the noise. The present invention uses the short-term estimate, assuming it to be rapidly time varying, such as the case of a close-by interference talker, where the noise is actually someone else's voice. Because the secondary microphone input speech also contains the talker's voice, a straight-forward spectrum subtraction method should not be used.
It is noted that that the term “main-mic input speech” can also refer to a certain combination of the input speech from the two or multiple microphones. Similarly that “second-mic input speech” can also refer to a certain combination of the secondary microphone input speech with the main-microphone input speech, or the secondary microphone input speech with other multiple microphone input speech signals.
In parallel, the input speech spectrum from the main-mic is compared with the long-term and short-term noise spectra, and a selective spectrum gain based shaping is performed 870 where input speech spectrum is close to either the long-term or short-term noise spectrum according to a predetermined threshold. The noise reduced output speech 808 is obtained by converting the modified input speech spectrum back to the time domain 880.
In a preferred embodiment of the present invention the predetermined threshold and the applied gain may differ depending on the various aspects of the speech signals and the design goals of the specific implementation of the present invention. For example, the predetermined threshold and/or the applied gain may differ for voiced and un-voiced segments, for highly voiced segments and weakly voiced segments, for signal level dependent and noise level dependent segments, and even for different frequencies in the spectrum domain. Any and all such variations may be implemented without departing from the scope and breadth of the present invention.
As stated, the present invention may be implemented using multiple microphones or a single microphone. The single microphone implementation will now be described with reference to
In a preferred embodiment of the present invention the predetermined threshold and the applied gain may differ depending on the various aspects of the speech signals and the design goals of the specific implementation of the present invention. For example, the predetermined threshold and/or the applied gain may differ for voiced and un-voiced segments, for highly voiced segments and weakly voiced segments, for signal level dependent and noise level dependent segments, and even for different frequencies in the spectrum domain. Any and all such variations may be implemented without departing from the scope and breadth of the present invention.
The present invention may be implemented using hardware, software or a combination thereof and may be implemented in a computer system or other processing system. Computers and other processing systems come in many forms, including wireless handsets, portable music players, infotainment devices, tablets, laptop computers, desktop computers and the like. In fact, in one embodiment, the invention is directed toward a computer system capable of carrying out the functionality described herein. An example computer system 1101 is shown in
Computer system 1101 also includes a main memory 1106, preferably random access memory (RAM), and can also include a secondary memory 1108. The secondary memory 1108 can include, for example, a hard disk drive 1110 and/or a removable storage drive 1112, representing a magnetic disc or tape drive, an optical disk drive, etc. The removable storage drive 1112 reads from and/or writes to a removable storage unit 1114 in a well-known manner. Removable storage unit 1114, represent magnetic or optical media, such as disks or tapes, etc., which is read by and written to by removable storage drive 1112. As will be appreciated, the removable storage unit 1114 includes a computer usable storage medium having stored therein computer software and/or data.
In alternative embodiments, secondary memory 1108 may include other similar means for allowing computer programs or other instructions to be loaded into computer system 1101. Such means can include, for example, a removable storage unit 1122 and an interface 1120. Examples of such can include a USB flash disc and interface, a program cartridge and cartridge interface (such as that found in video game devices), other types of removable memory chips and associated socket, such as SD memory and the like, and other removable storage units 1122 and interfaces 1120 which allow software and data to be transferred from the removable storage unit 1122 to computer system 1101.
Computer system 1101 can also include a communications interface 1124. Communications interface 1124 allows software and data to be transferred between computer system 1101 and external devices. Examples of communications interface 1124 can include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, etc. Software and data transferred via communications interface 1124 are in the form of signals which can be electronic, electromagnetic, optical or other signals capable of being received by communications interface 1124. These signals 1126 are provided to communications interface via a channel 1128. This channel 1128 carries signals 1126 and can be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link, such as WiFi or cellular, and other communications channels.
In this document, the terms “computer program medium” and “computer usable medium” are used to generally refer to media such as removable storage device 1112, a hard disk installed in hard disk drive 1110, and signals 1126. These computer program products are means for providing software or code to computer system 1101.
Computer programs (also called computer control logic or code) are stored in main memory and/or secondary memory 1108. Computer programs can also be received via communications interface 1124. Such computer programs, when executed, enable the computer system 1101 to perform the features of the present invention as discussed herein. In particular, the computer programs, when executed, enable the processor 1104 to perform the features of the present invention. Accordingly, such computer programs represent controllers of the computer system 1101.
In an embodiment where the invention is implemented using software, the software may be stored in a computer program product and loaded into computer system 1101 using removable storage drive 1112, hard drive 1110 or communications interface 1124. The control logic (software), when executed by the processor 1104, causes the processor 1104 to perform the functions of the invention as described herein.
In another embodiment, the invention is implemented primarily in hardware using, for example, hardware components such as application specific integrated circuits (ASICs). Implementation of the hardware state machine so as to perform the functions described herein will be apparent to persons skilled in the relevant art(s).
In yet another embodiment, the invention is implemented using a combination of both hardware and software.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
Number | Name | Date | Kind |
---|---|---|---|
5839101 | Vahatalo | Nov 1998 | A |
9484043 | Su | Nov 2016 | B1 |
20060184362 | Preuss | Aug 2006 | A1 |
20090063143 | Schmidt | Mar 2009 | A1 |
20100088092 | Bruhn | Apr 2010 | A1 |
20120197636 | Benesty | Aug 2012 | A1 |
20130035933 | Hirohata | Feb 2013 | A1 |
20130185078 | Tzirkel-Hancock | Jul 2013 | A1 |
20130197904 | Hershey | Aug 2013 | A1 |
20130238327 | Nonaka | Sep 2013 | A1 |
20150262590 | Joder | Sep 2015 | A1 |
20150302865 | Pilli | Oct 2015 | A1 |
Number | Date | Country | |
---|---|---|---|
61951224 | Mar 2014 | US | |
61951239 | Mar 2014 | US |