The present application is related to co-pending U.S. patent application Ser. No. 13/975,344 entitled “METHOD FOR ADAPTIVE AUDIO SIGNAL SHAPING FOR IMPROVED PLAYBACK IN A NOISY ENVIRONMENT” filed on Aug. 25, 2013 by HUAN-YU SU, et al., co-pending U.S. patent application Ser. No. 14/193,606 entitled “IMPROVED ERROR CONCEALMENT FOR SPEECH CODER” filed on Feb. 28, 2014 by HUAN-YU SU, and co-pending U.S co-pending patent application Ser. No. 14/534,472 entitled “ADAPTIVE SIDETONE TO ENHANCE TELEPHONIC COMMUNICATIONS” filed concurrently herewith by HUAN-YU SU. The above referenced pending patent applications are incorporated herein by reference for all purposes, as if set forth in full.
The improved quality of voice communications over mobile telephone networks have contributed significantly to the growth of the wireless industry over the past two decades. Due to the mobile nature of the service, a user's quality of experience (QoE) can vary dramatically depending on many factors. Two such key factors include the wireless link quality and the background or ambient noise levels. It should be appreciated, that these factors are generally not within the user's control. In order to improve the user's QoE, the wireless industry continues to search for quality improvement solutions to address these key QoE factors.
In theory, ambient noise is always present in our daily lives and depending on the actual level, such noise can severely impact our voice communications over wireless networks. A high noise level reduces the signal to noise ratio (SNR) of a talker's speech. Studies from members of speech standard organizations, such as 3GPP and ITU-T, show that lower SNR speech results in lower speech coding performance ratings, or low MOS (mean opinion score). This has been found to be true for all LPC (linear predictive coding) based speech coding standards that are used in wireless industry today.
Another problem with high level ambient noise is that it prevents the proper operation of certain bandwidth saving techniques, such as voice activity detection (VAD) and discontinuous transmission (DTX). These techniques operate by detecting periods of “silence” or background noise. The failure of such techniques due to high background noise levels result in the unnecessary bandwidth consumption and waste.
Since the standardization of EVRC (enhanced variable rate codec, IS-127) in 1997, the wireless industry had embraced speech enhancement techniques that operate to cancel or reduce background noise. Of course, such speech enhancement techniques require processing time, which is always at odds with the requirement for low latency in voice communications. Due to the interactive nature of live voice conversations, mobile telephone calls require extremely low end-to-end (or mouth-to-ear) delays or latency. Indeed ITU-T Recommendations call for such latency to be less than 300 ms, otherwise users start to be dissatisfied by the voice quality (c.f., Recommendation G.114). Since 2G/3G systems all have relatively long end-to-end latencies compared to ITU-T Recommendations, it is therefore an industry standard approach to limit the allowed latency increase of such speech enhancement techniques to some very small number, such as 5 to 10 ms. As can be appreciated, this may severely limit the effectiveness of such speech enhancement techniques.
Unfortunately, modern speech processing techniques inevitably require to perform a certain level of signal analyses, which rely on the availability of the input signal for a fixed amount of time. When the latency requirement is very short, a lack of sufficient observation time often results in incorrect analysis and bad decisions that translate to reduced performance. It is therefore intuitive that when more latency is allowed, better performance is possible. It is noted that low latency implementations of signal detection techniques can perform adequately under low noise conditions, but it becomes increasingly difficult under high noise level conditions.
In general, newer wireless access technologies such as LTE (Long-Term Evolution) have lower end-to-end latency periods than previous generations, such as GSM or W-CDMA. The present invention takes advantage of this factor to further improve speech enhancement techniques while still maintaining the overall latency requirements under ITU-T Recommendations.
The present invention addresses the need for increased quality by providing an adaptive system that, based on the ambient noise level, dynamically adjusts the latency allocation to achieve a higher level of performance in preprocessing across all application scenarios.
More particularly, the present invention provides an adaptive latency system and method that in low noise conditions, provides the same or shorter latency allocation time for the speech enhancement module, but while in high noise conditions, provides a larger latency increase allotment to the speech enhancement module for increased performance.
The present invention may be described herein in terms of functional block components and various processing steps. It should be appreciated that such functional blocks may be realized by any number of hardware components or software elements configured to perform the specified functions. For example, the present invention may employ various integrated circuit components, e.g., memory elements, digital signal processing elements, logic elements, look-up tables, and the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices. In addition, those skilled in the art will appreciate that the present invention may be practiced in conjunction with any number of data and voice transmission protocols, and that the system described herein is merely one exemplary application for the invention.
It should be appreciated that the particular implementations shown and described herein are illustrative of the invention and its best mode and are not intended to otherwise limit the scope of the present invention in any way. Indeed, for the sake of brevity, conventional techniques for signal processing, data transmission, signaling, packet-based transmission, network control, and other functional aspects of the systems (and components of the individual operating components of the systems) may not be described in detail herein, but are readily known by skilled practitioners in the relevant arts. Furthermore, the connecting lines shown in the various figures contained herein are intended to represent exemplary functional relationships and/or physical couplings between the various elements. It should be noted that many alternative or additional functional relationships or physical connections may be present in a practical communication system. It should be noted that the present invention is described in terms of a typical mobile phone system. However, the present invention can be used with any type of communication device including non-mobile phone systems, laptop computers, tablets, game systems, desktop computers, personal digital infotainment devices and the like. Indeed, the present invention can be used with any system that supports digital voice communications. Therefore, the use of cellular mobile phones as example implementations should not be construed to limit the scope and breadth of the present invention.
On the far-end phone, the reverse processing takes place. The radio signal containing the compressed speech is received by the far-end phone's antenna in the far-end mobile phone receiver 240. Next, the signal is processed by the receiver radio circuitry 241, followed by the channel decoder 242 to obtain the received compressed speech, referred to as speech packets or frames 246. Depending on the speech coding scheme used, one compressed speech packet can typically represent 5-30 ms worth of a speech signal.
Due to the never ending evolution of wireless access technology, it is worth mentioning that the combination of the channel encoder 216 and transmitter radio circuitry 217, as well as the reverse processing of the receiver radio circuitry 241 and channel decoder 242, can be seen as wireless modem (modulator-demodulator). Newer standards in use today, including LTE, WiMax and WiFi, and others, comprise wireless modems in different configurations than as described above and in
Referring back now to
Now referring to
At the beginning of the processing frame N (or more precisely the first speech sample in frame N) 331, the speech encoder collects a frame worth of the near-end digital speech samples 303. Depending on the speech coding standard used, this sample collection time is equal to the processing frame size in time. When the sample collection is complete for the processing frame N, the encoding of the frame N starts as shown at 332. It should be noted that modern speech compression techniques will benefit from a small, but non-zero, so-called “look-ahead latencies”. Some examples of such look-ahead latencies are a 5 ms LPC (linear prediction coding) look-ahead in 3GPP AMR-NB standard and a 10 ms latency in the EVRC (Enhanced variable rate codec) standard.
The encoding process will take some time as commercial implementations of the speech encoder employ the use of either digital signal processors (DSPs), embedded circuitry or other types of processors such as general purpose programmable processors, all with finite processing capabilities. As such, any signal processing task will take a certain amount of time to be executed. This processing time will add to the latency. At the completion of the speech encoding process, the encoded speech packet 304 is ready for transmission via the wireless modem of the near-end mobile phone.
As previously stated, the encoded speech packet will go through a good number of steps before it is received at the far-end mobile phone. For simplicity, and without changing the scope of the present invention, the time it takes can be grouped together and thought of as a single time period referred to herein as the “transmission delay” 335. Once received, the speech decoder uses information contained in the received speech packet 354 to reconstruct the near-end speech 355, which will also take some non-zero processing time before the first speech sample 351 in the frame N can be sent to the loudspeaker for output. The total end-to-end latency (or mouth-to-ear delay) is the time elapsed from the moment the first sample in the frame N becomes available at the near-end mobile phone, to the time when the first corresponding sample is played out at the far-end phone.
Because the end-to-end latencies in today's 2G or 3G wireless networks are all above 100 ms, it is highly desirable to not significantly increase that figure by the use of speech enhancement techniques. To that end, it is a common practice to limit the amount of processing time allocated for speech enhancement techniques. For example, the SMV and EVRC-B standards limit such techniques to approximately 10 ms or less.
Ambient noise typically comprises a time varying nature. That is, ambient noise typically comprises dramatic variations in volume and spectrum characteristics. When noise is at a low level, the amount of noise reduction (or speech enhancement) requirements are minimal. In addition, signal detection and analysis can be performed faster and much more reliably because the voice signal is cleaner to begin with. However, when the noise is at a high level, the signal to noise ratio (SNR) of the talker's speech is reduced, which causes a degradation in speech coding performance due to an increase in error of parameter detection.
Speech enhancement and indeed, speech processing in general, requires the determination of certain critical parameters of the speech on a frame by frame basis. Such critical parameters include voiced vs. un-voiced speech, the period of the fundamental frequency or pitch of the talker, the beginning and/or end of voiced speech, and the fine structure of the spectrum of the talker. These and other critical determinations can be severely impacted by an increase to noise levels and the consequential reduction of the SNR. The present invention improves the accuracy of such critical parameter determination by providing additional observation of the speech signal by adaptively increasing the latency budget for the speech enhancement module during certain periods, based on the detected ambient noise levels.
Referring now to
In one embodiment of the present invention the increased latency speech enhancement techniques 656 are identical to the low-latency speech enhancement techniques 646. However, when a higher latency budget is allocated in accordance with the detected noise levels of the present invention, the increased latency speech enhancement technique 656 take advantage of the additional processing time using more available speech samples, which result in much better and/or more reliable parameter determinations. In other embodiments of the present invention, increased latency speech enhancement techniques 656 comprise altered, additional or entirely different signal processing techniques with more advanced and robust signal processing to take advantage of the additional speech samples available in high noise conditions. For example, in one embodiment of the present invention, the high-latency speech enhancer 656 comprises a modified version of the standard low-latency speech enhancer 646 so that it can take advantage of the information contained in additional speech packets.
In order to minimize unwanted audible impact to voice quality during a latency adjustment performed by the speech enhancers 645 and 655, the preferred method to perform such latency adjustments occurs during silence or unvoiced portions of speech. For example, silence or background noise periods may be indicated by a VAD/DTX (voice activity detection, discontinuous transmission) mode of the wireless system. This and other such means to determine silence or background noise periods are well known to those skilled in the relevant arts. This aspect of the present invention is shown with reference to
It is noted that, while such adjustments to the latency during silence or unvoiced portions of the speech can be straightforward, and such methods are well known by persons skilled in the art, the generation of silence and especially unvoiced speech samples should be performed in such a way to minimize the impact to speech quality.
The present invention may be implemented using hardware, software or a combination thereof and may be implemented in a computer system or other processing system. Computers and other processing systems come in many forms, including wireless handsets, portable music players, infotainment devices, tablets, laptop computers, desktop computers and the like. In fact, in one embodiment, the invention is directed toward a computer system capable of carrying out the functionality described herein. An example computer system 901 is shown in
Computer system 901 also includes a main memory 906, preferably random access memory (RAM), and can also include a secondary memory 908. The secondary memory 908 can include, for example, a hard disk drive 910 and/or a removable storage drive 912, representing a magnetic disc or tape drive, an optical disk drive, etc. The removable storage drive 912 reads from and/or writes to a removable storage unit 914 in a well-known manner. Removable storage unit 914, represent magnetic or optical media, such as disks or tapes, etc., which is read by and written to by removable storage drive 912. As will be appreciated, the removable storage unit 914 includes a computer usable storage medium having stored therein computer software and/or data.
In alternative embodiments, secondary memory 908 may include other similar means for allowing computer programs or other instructions to be loaded into computer system 901. Such means can include, for example, a removable storage unit 922 and an interface 920. Examples of such can include a USB flash disc and interface, a program cartridge and cartridge interface (such as that found in video game devices), other types of removable memory chips and associated socket, such as SD memory and the like, and other removable storage units 922 and interfaces 920 which allow software and data to be transferred from the removable storage unit 922 to computer system 901.
Computer system 901 can also include a communications interface 924. Communications interface 924 allows software and data to be transferred between computer system 901 and external devices. Examples of communications interface 924 can include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, etc. Software and data transferred via communications interface 924 are in the form of signals which can be electronic, electromagnetic, optical or other signals capable of being received by communications interface 924. These signals 926 are provided to communications interface via a channel 928. This channel 928 carries signals 926 and can be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link, such as WiFi or cellular, and other communications channels.
In this document, the terms “computer program medium” and “computer usable medium” are used to generally refer to media such as removable storage device 912, a hard disk installed in hard disk drive 910, and signals 926. These computer program products are means for providing software or code to computer system 901.
Computer programs (also called computer control logic or code) are stored in main memory and/or secondary memory 908. Computer programs can also be received via communications interface 924. Such computer programs, when executed, enable the computer system 901 to perform the features of the present invention as discussed herein. In particular, the computer programs, when executed, enable the processor 904 to perform the features of the present invention. Accordingly, such computer programs represent controllers of the computer system 901.
In an embodiment where the invention is implemented using software, the software may be stored in a computer program product and loaded into computer system 901 using removable storage drive 912, hard drive 910 or communications interface 924. The control logic (software), when executed by the processor 904, causes the processor 904 to perform the functions of the invention as described herein.
In another embodiment, the invention is implemented primarily in hardware using, for example, hardware components such as application specific integrated circuits (ASICs). Implementation of the hardware state machine so as to perform the functions described herein will be apparent to persons skilled in the relevant art(s).
In yet another embodiment, the invention is implemented using a combination of both hardware and software.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
Number | Name | Date | Kind |
---|---|---|---|
8024192 | Zopf | Sep 2011 | B2 |
20030069018 | Matta | Apr 2003 | A1 |
20040010407 | Kovesi | Jan 2004 | A1 |
20060100867 | Lee | May 2006 | A1 |
20060265216 | Chen | Nov 2006 | A1 |
20060271373 | Khalil | Nov 2006 | A1 |
20080133242 | Sung | Jun 2008 | A1 |
20080243495 | Anandakumar | Oct 2008 | A1 |
20090070107 | Kawashima | Mar 2009 | A1 |
20090276212 | Khalil | Nov 2009 | A1 |
20110125505 | Vaillancourt | May 2011 | A1 |
20110196673 | Sharma | Aug 2011 | A1 |
20120123775 | Murgia | May 2012 | A1 |
20130166294 | Cox | Jun 2013 | A1 |
20140235192 | Purnhagen | Aug 2014 | A1 |
Number | Date | Country | |
---|---|---|---|
61905674 | Nov 2013 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14193606 | Feb 2014 | US |
Child | 14534531 | US |