1. Field of the Invention
The present invention relates generally to using music detection to enhance speech communications. More particularly, the present invention relates to using music detection to enhance echo cancellation and speech coding.
2. Background Art
Conventional speech coding systems often employ voice activity detectors (“VADs”) to examine speech signals and differentiate between voice and background noise. However, conventional VADs often cannot differentiate music from background noise. As is known in the art, background noise signals are typically fairly stable as compared to voice signals. The frequency spectrum of voice signals (or unvoiced signals) changes rapidly. In contrast to voice signals, background noise signals exhibit the same or similar frequency for a relatively long period of time, and therefore exhibit heightened stability. Therefore, in conventional approaches, differentiating between voice signals and background noise signals is fairly simple and is based on signal stability. Unfortunately, music signals are also typically relatively stable for a number of frames (e.g. several hundred frames). For this reason, conventional VADs often fail to differentiate between background noise signals and music signals, and exhibit rapidly fluctuating outputs for music signals.
If a conventional VAD determines that its input signal does not represent a voice signal, it will often simply classify its input signal as background noise and the signal will be encoded accordingly. However, the input signal may in fact comprise music and not background noise, and encoding a music signal as background noise will result in a low perceptual quality, or in this case, poor quality music. Further, classifying the signal as background noise would also cause conventional echo cancellers to eliminate a music signal by attenuating the signal below the noise floor and replacing the music signal by comfort noise if the comfort noise option is enabled, or with silence if the comfort noise option is disabled.
Thus, there is need in the art for methods and systems that can efficiently classify signals as music signals, and utilize such classification to improve the perceptual quality of such signals.
The present invention is directed to using music detection to enhance echo cancellation and speech coding. According to one aspect of the present invention, a method of using music detection to enhance an operation of an echo canceller is provided, wherein the echo canceller includes an adaptive filter and a nonlinear processor. The method comprises receiving an input signal including an echo signal by the echo canceller from a near end device, filtering the input signal using the adaptive filter to eliminate linear components of the echo signal in the input signal and generate an error signal, analyzing the error signal using a music detector to determine existence of a music signal in the error signal, bypassing the nonlinear processor if the analyzing determines the music signal exists in the error signal, and eliminating nonlinear components of the echo signal from the error signal using the nonlinear processor if the analyzing determines the music signal does not exist in the error signal.
In a further aspect, the method further uses the music detection to enhance an operation of a speech encoder including a noise suppressor, wherein the method further comprises bypassing the noise suppressor if the analyzing determines the music signal exists in the error signal, and attenuating the error signal using the noise suppressor if the analyzing determines the music signal does not exist in the error signal.
In another aspect, the method further uses the music detection to enhance an operation of a speech encoder including a noise suppressor, wherein the method further comprises gradually reducing an attenuation gain of the noise suppressor to zero if the analyzing determines the music signal exists in the error signal, and attenuating the error signal using the noise suppressor if the analyzing determines the music signal does not exist in the error signal.
In yet another aspect, the method further uses the music detection to enhance an operation of a speech encoder including a pitch interpolation, wherein the method further comprises disabling the pitch interpolation if the analyzing determines the music signal exists in the error signal, transmitting information to a decoder to disable a pitch interpolation of the decoder if the analyzing determines the music signal exists in the error signal, and enabling the pitch interpolation if the analyzing determines the music signal does not exist in the error signal.
In an additional aspect, the method further uses the music detection to enhance an operation of a speech encoder including a pitch pre-processing, wherein the method further comprises disabling the pitch pre-processing if the analyzing determines the music signal exists in the error signal, and enabling the pitch pre-processing if the analyzing determines the music signal does not exist in the error signal.
In other aspects of the present invention, enhanced echo cancellers and speech encoders, and related computer readable medium including a computer software product executable by a processor to use music detection for enhancing operations of the echo cancellers and speech encoders are provided according to the aforementioned methods.
Other features and advantages of the present invention will become more readily apparent to those of ordinary skill in the art after reviewing the following detailed description and accompanying drawings.
The features and advantages of the present invention will become more readily apparent to those ordinarily skilled in the art after reviewing the following detailed description and accompanying drawings, wherein:
The present invention is directed to a low-complexity music detection algorithm and system. Although the invention is described with respect to specific embodiments, the principles of the invention, as defined by the claims appended herein, can obviously be applied beyond the specifically described embodiments of the invention described herein. Moreover, in the description of the present invention, certain details have been left out in order to not obscure the inventive aspects of the invention. The details left out are within the knowledge of a person of ordinary skill in the art.
The drawings in the present application and their accompanying detailed description are directed to merely example embodiments of the invention. To maintain brevity, other embodiments of the invention which use the principles of the present invention are not specifically described in the present application and are not specifically illustrated by the present drawings. It should be borne in mind that, unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals.
Subscribers use speech quality as the benchmark for assessing the overall quality of a telephone network. A key technology to provide a high quality speech is echo cancellation. Echo canceller performance in a telephone network, either a TDM or packet telephony network, has a substantial impact on the overall voice quality. An effective removal of hybrid and acoustic echo inherent in telephone networks is a key to maintaining and improving perceived voice quality during a call.
Echoes occur in telephone networks due to impedance mismatches of network elements, acoustical coupling within telephone handsets, or room acoustic reflections when a speaker phone is used. Hybrid echo is the primary source of echo generated from the public-switched telephone network (PSTN). As shown in
As shown in
As further shown in
Double talk detector 210 controls the behavior of adaptive filter 220 during periods when Sin signal 202 from the near end reaches a certain level. Because echo canceller 200 is utilized to cancel an echo of Rin signal 234 from the far end, presence of speech signal from the near end would cause adaptive filter 220 to converge on a combination of near end speech signal and Rin signal 234, which will lead to an inaccurate echo path model, i.e. incorrect adaptive filter 220 coefficients. Therefore, in order to cancel the echo signal, adaptive filter 220 should not train in the presence of the near end speech signal. To this end, echo canceller 200 must analyze the incoming signal and determine whether it is solely an echo signal of Rin signal 234 or also contains the speech of a near end talker. By convention, if two people are talking over a communication network or system, one person is referred to as the “near talker,” while the other person is referred to as the “far talker.” The combination of speech signals from the near end talker and the far end talker is referred to as “double talk.”
To determine whether Sin signal 202 contains double talk, double talk detector 210 estimates and compares the characteristics of Rin signal 234 and Sin signal 202. A primary purpose of double talk detector is to prevent adaptive filter 220 from adaptation when double talk is detected or to adjust the degree of adaptation based on confidence level of double talk detection, which is described in U.S. Pat. No. 6,804,203, entitled “Double Talk Detector for Echo Cancellation in a Speech Communication System”, which is hereby incorporated by reference in is entirety.
Echo canceller 200 utilizes adaptive filter 220 to model the echo path and its delay. In one embodiment, adaptive filter 220 uses a transversal filter with adjustable taps, where each tap receives a coefficient that specifies the magnitude of the corresponding output signal sample and each tap is spaced a sample time apart. The better the echo canceller can estimate what the echo signal will look like, the better it can eliminate the echo. To improve the performance of echo canceller 200, it may be desirable to vary the adaptation rate at which the transversal filter tap coefficients of adaptive filter 220 are adjusted. For instance, if double talk detector 210 denotes a high confidence level that the incoming signal is an echo signal, it is preferable for adaptive filter 220 to adapt quickly. On the other hand, if double talk detector 210 denotes a low confidence level that the incoming signal is an echo signal, i.e. it may include double talk, it is preferable to decline to adapt at all or to adapt very slowly. If there is an error in determining whether Sin signal 202 is an echo signal, a fast adaptation of adaptive filter 220 causes rapid divergence and a failure to eliminate the echo signal.
As shown in
It is known that the echo path includes nonlinear components that cannot be removed by adaptive filter 220 and, thus, after subtraction of echo model signal 222 from echo signal 217, there remains residual echo, which must be eliminated by nonlinear processor (NLP) 230. As shown NLP 230 receives residual echo signal or error signal 219 from error estimator 218 and generates Sout 220 for transmission to far end. If error signal 219 is below a certain level, NLP 230 replaces the residual echo with either comfort noise if the comfort noise option is enabled, or with silence if the comfort noise option is disabled.
With continued reference to
Noise suppressor 325 attenuates speech signal 305 in order to eliminate background noise and to provide the listener with a clear sensation of the environment. In one embodiment, noise suppressor 325 includes a channel gain calculation module (not shown), which receives music detect signal 312. Music detector signal 312 indicates to noise suppressor 325 whether music detector 310 has detected music signal in speech signal 305. Music detector signal 312 is fed into channel gain calculation module of noise suppressor 325 to compute the gain, so as to improve the speech quality. In some embodiments, noise suppressor 325 may be bypassed if music detector detects music signal in speech signal 305. In other embodiments, channel gain calculation module may gradually bring the gin to 0 dB, i.e. no attenuation, to provide a smooth transition and avoid discontinuities in speech signal 305. However, if a music signal is not detected, noise suppressor 325 operates on speech signal 305.
Next, as pre-processed speech signal emerges from noise suppressor 325, speech signal coding module 330 starts the encoding process of the pre-processed speech signal at certain frame intervals, such as 20 ms frame intervals. At this stage, for each speech frame, several parameters are extracted from the pre-processed speech signal, such as spectrum and pitch estimate parameters, which may be used in the coding scheme, and other parameters, such as maximal sample in a frame, zero crossing rates, LPC gain or signal sharpness parameters, which may be used for classification and rate determination purposes.
As shown in
Referring to
Since in one embodiment, speech coding parameter P1, such as the pitch correlation (Rp), has already been calculated by the speech coder, such as the G.729 coder, the present scheme substantially reduces complexity and time by receiving speech coding parameter P1 from the speech coder and using the same to differentiate between background noise and music in a VAD module, such as VAD circuitry 140 or a VAD software module, for example.
In one embodiment, for a given speech frame under examination, if P1 is less than T1 (or in closer range of T1 than to T0) then P1 is indicative of background noise. If P1 is greater than T2 (or in closer range of T2 than T0) then P1 is indicative of music. However, if P1 falls in the range between T1 and T2 then additional computation is required to determine whether P1 is indicative of background noise or music. The flowchart of
In one embodiment, according to
At step 512, if P1 is less than T0 then the no music frame counter (cnt_nomus) is incremented at step 513. If P1 is not less than T0 at step 512 then the process proceeds to step 514. Otherwise, if P1 is greater than T0 then the music frame counter (cnt_mus) is incremented at step 514.
At step 516, a check is made to determine if the predetermined number of speech frames have been processed. If there is another speech frame to be examined, the process loops back to step 512. However, if the predetermined number of speech frames have been processed the process proceeds to step 518.
At step 518, the value of the music frame counter is compared to the value of the no music frame counter. If the music frame counter is greater than the no music frame counter (or in one embodiment, it is greater than the no music frame counter by a threshold value W), then the process proceeds to step 520, where the frame is classified as music and the VAD is set to one to indicate the same. Otherwise, the process proceeds to step 522, where the frame is classified as background noise and the VAD is set to zero to indicate the same.
In one embodiment, the VAD may have more than two output values. For example, in one embodiment, VAD may be set to “zero” to indicate background noise, “one” to indicate voice, and “two” to indicate music. Further, after the speech signal is classified as music and the speech frames are being coded accordingly, if a non-music speech frame is detected for a given period of time (or an extension period), such as a time period for processing 30 frames, the detection system continues to indicate that a music signal is being detected until it is confirmed that the music signal has ended in order to avoid glitches in coding. In another embodiment, two speech coding parameters, such as pitch correlation (Rp) and linear prediction coding (LPC) gain, can be utilized to differentiate music from background noise.
Next, at step 606, noise suppressor 325 gradually brings the gain to 0 dB, i.e. no attenuation, to provide a smooth transition and avoid discontinuities in speech signal 305. In some embodiments, however, noise suppressor 325 may be bypassed at step 606 if music detector detects music signal in speech signal 305. At step 608, for multi-rate coding algorithm, when music detector detects music signal in speech signal 305, rate selection 345 selects a high bit rate, such as the maximum available bit rate, in order to provide a high perceptual quality.
With continued reference to
From the above description of the invention it is manifest that various techniques can be used for implementing the concepts of the present invention without departing from its scope. Moreover, while the invention has been described with specific reference to certain embodiments, a person of ordinary skill in the art would recognize that changes can be made in form and detail without departing from the spirit and the scope of the invention. For example, it is contemplated that the circuitry disclosed herein can be implemented in software, or vice versa. The described embodiments are to be considered in all respects as illustrative and not restrictive. It should also be understood that the invention is not limited to the particular embodiments described herein, but is capable of many rearrangements, modifications, and substitutions without departing from the scope of the invention.
The present application is a Continuation-In-Part of U.S. patent application Ser. No. 10/981,022, filed Nov. 4, 2004 now U.S. Pat. No. 7,120,576, which claims priority to U.S. Provisional Application Ser. No. 60/588,445, filed Jul. 16, 2004, which are hereby incorporated by reference in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
5274705 | Younce et al. | Dec 1993 | A |
6424635 | Song | Jul 2002 | B1 |
6633841 | Thyssen et al. | Oct 2003 | B1 |
6760435 | Etter et al. | Jul 2004 | B1 |
7430506 | Nam et al. | Sep 2008 | B2 |
20070136053 | Ebenezer | Jun 2007 | A1 |
Number | Date | Country | |
---|---|---|---|
60588445 | Jul 2004 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10981022 | Nov 2004 | US |
Child | 11084392 | US |