This invention relates to a telephone employing circuitry for echo cancellation and noise reduction and, in particular, to such circuitry that includes a music detector.
As used herein, “telephone” is a generic term for a communication device that utilizes, directly or indirectly, a dial tone from a licensed service provider. As such, “telephone” includes desk telephones (see
While not universally followed, the prior art generally associates noise “suppression” with subtracting a signal from the signal of interest and associates noise “reduction” with attenuation or reduced gain. Noise reduction circuitry is generally part of a non-linear processor.
There are many sources of noise in a telephone system. Some noise is acoustic in origin while other noise is electronic, from the telephone network, for example. As used herein, “noise” refers to any unwanted sound, whether the unwanted sound is periodic, purely random, or somewhere in-between. As such, noise includes background music, voices of people other than the desired speaker, tire noise, wind noise, and so on. As thus broadly defined, noise could include an echo of the speaker's voice. However, echo cancellation is treated separately in a telephone.
There are two kinds of echoes in telephones, an acoustic echo from the path between an earphone or a speaker and a microphone and a line echo generated in the switched network for routing a call between stations. Echo cancellation involves subtracting a simulated echo from an input signal. The simulated echo is created by filtering an output signal with an adaptive filter. The adaptive filter is programmed to represent either the near-end path (speaker to microphone) or the far end path (line out to line in) to create the simulated echo.
Noise is subjective, somewhat like a weed. It depends upon what one wants or does not want. In this description, noise is unwanted sound from the perspective of a person trying to converse on a telephone. For example, in a vehicle, noise includes road noise, music from a radio, background conversation, and the sound from the speaker element in a hands-free kit. The desired signal is usually only the voice of the person speaking.
If there is significant amount of background noise, it is usually desirable to reduce the background noise to improve intelligibility. On the other hand, a person may be at a musical concert and it may be desirable to allow the music to pass through the telephone network unaffected. To satisfy these contradictory conditions, one needs a special algorithm to distinguish between noise and music.
It is known in the art to distinguish music from speech; see, for example, Carey, Michael J. et al., Comparison of Features for Speech, Music Discrimination, IEEE publication 0-7803-5041-3/99 © 1999. It is also known to distinguish music, speech, and noise; see, for example, G. Lu & T. Hankinson, “A Technique towards Automatic Audio Classification and Retrieval,” 1998 Fourth Signal International Conference on Signal Processing Proceedings (ISCP-98), Beijing, China 1998. Spectral flatness measure (SFM) is known in the art; see, for example, U.S. Pat. No. 5,648,921 (Bayya et al.) and U.S. Pat. No. 6,477,489 (Lockwood et al.). As used herein, SFM is defined differently from these two patents, which define SFM differently from each other. The differences are in form, not substance.
One of the main challenges in distinguishing music from noise is that the envelopes of both types of signal are relatively constant. Most known voice activity detectors measure the energy content of the envelope, which means that a voice activity detector will detect music as noise and will cause the noise reduction circuitry to reduce the background music, distorting the signal. It will also cause the non-linear processor to suppress the residual echo, which will then insert the comfort noise after suppressing the residual echo. This insertion of comfort noise can annoy a listener because the music will become intermittent. A similar effect can occur in echo canceling systems.
Music is generally characterized by a finite amount of energy at all times, some music having a relatively constant envelope and some not. Most of the acoustic energy in music is below 8 kHz, although rock and hard rock are almost like white noise. The spectral content of music changes frequently, depending upon the rhythm of the music. Based on these characteristics, certain features are selected and several different algorithms are being investigated in the art for classifying sound. Examples are in the literature identified above.
Possible methods for classifying audio signals include envelope detection, linear prediction analysis, zero crossing detection, Bark band spectral analysis, auto-correlation, silence ratio, tracking spectral peaks, and differential spectrum (changes in spectral content from instant to instant). Silence ratio is really an amplitude comparison. A signal is divided into time segments. A signal having an amplitude less than a threshold is silence. The ratio is the number of silent segments divided by the total number of segments. Speech signals have a higher silence ratio than music. Noise and non-speech are problems, as is picking the correct time interval.
Many of these methods are not robust enough to distinguish different genre of music unambiguously from noise. Some of the methods are not meant to be done in real time because of large computational requirements; e.g. requiring wide data bus, large amounts of storage, or long execution time for analysis. Hence, it is desirable to provide a method that can unambiguously distinguish mainstream music genre with small computational requirements.
In view of the foregoing, it is therefore an object of the invention to provide a method for unambiguously distinguishing mainstream music genre from noise.
Another object of the invention is to provide a method for unambiguously distinguishing mainstream music genre from noise while requiring little computational power.
A further object of the invention is to provide a method for unambiguously distinguishing mainstream music genre from noise in real time.
The foregoing objects are achieved in this invention in which spectral flatness is used to detect music and to distinguish music from noise. An audio signal is divided among exponentially related subband filters. The spectral flatness measure in each subband signal is determined and the measures are weighted and combined. The sum is compared with a threshold to determine the presence of music or noise. If music is detected, the noise estimation process in the noise reduction circuitry is turned off to avoid distorting the signal. If music is detected, residual echo suppression circuitry is also turned off to avoid inserting comfort noise.
A more complete understanding of the invention can be obtained by considering the following detailed description in conjunction with the accompanying drawings, in which:
Those of skill in the art recognize that, once an analog signal is converted to digital form, all subsequent operations can take place in one or more suitably programmed microprocessors. Reference to “signal,” for example, does not necessarily mean a hardware implementation or an analog signal. Data in memory, even a single bit, can be a signal. In other words, a block diagram can be interpreted as hardware, software, e.g. a flow chart or an algorithm, or a mixture of hardware and software. Programming a microprocessor is well within the ability of those of ordinary skill in the art, either individually or in groups.
This invention finds use in many applications where the electronics is essentially the same but the external appearance of the device may vary.
The various forms of telephone can all benefit from the invention.
A cellular telephone includes both audio frequency and radio frequency circuits. Duplexer 55 couples antenna 56 to receive processor 57. Duplexer 55 couples antenna 56 to power amplifier 58 and isolates receive processor 57 from the power amplifier during transmission. Transmit processor 59 modulates a radio frequency signal with an audio signal from circuit 54. In non-cellular applications, such as speakerphones, there are no radio frequency circuits and signal processor 54 may be simplified somewhat. Problems of echo cancellation and noise remain and are handled in audio processor 60. It is audio processor 60 that is modified to include the invention. How that modification takes place is more easily understood by considering the echo canceling and noise reduction portions of an audio processor in more detail.
A new voice signal entering microphone input 62 may or may not be accompanied by ambient noise or sounds from speaker output 68. The signals from input 62 are digitized in A/D converter 71 and coupled to summation network 72. There is, as yet, no signal from echo canceling circuit 73 and the data proceeds to non-linear processing circuit 74, which includes a music detector and other circuitry, such as a noise reduction circuit, a residual echo canceling circuit, and a center clipper.
The output from non-linear processing circuit 74 is coupled to summation circuit 76, where comfort noise 75 is optionally added to the signal. The signal is then converted back to analog form by D/A converter 77, amplified in amplifier 78, and coupled to line output 64. Circuit 73 reduces acoustic echo and circuit 81 reduces line echo as directed by control 80. The operation of these last two circuits is known per se in the art; e.g. as described in the above-identified text.
but
Equality, or perfect smoothness, is unattainable so, in practice, the ratio is always less than one.
Because a geometric mean involves repeated multiplication, the precision of the root will be much less than the precision of the factors of the product if sixteen bit precision is used. On the other hand, increasing the number of bits of precision can significantly slow the calculation. This dilemma is solved according to another aspect of the invention by computing the geometric mean, arithmetic mean, and their ratio using floating-point notation (mantissa and exponent) in a 16-bit, fixed-point processor, referred to herein as a pseudo floating-point operation. The exponent is stored in a 16-bit memory location. The performance of the pseudo floating-point operation is equal to or better than conventional floating-point performance using processors of the same precision, e.g. 16-bits. Using the pseudo floating-point operation, the system is able to detect the presence of music correctly even if the signal level is very small (less than −45 dBFS). The steps in
In general, in a musical piece, a singer is accompanied by musical instruments playing at different frequency ranges. Under these circumstances, a spectral flatness measure of the entire spectrum may not give a distinct, discriminating feature to distinguish the music from noise. In order to circumvent this problem, according to another aspect of the invention, the input signal is filtered to divide the signal into subband. The subbands are preferably octaval and are individually weighted to give more emphasis to lower frequencies.
The following table shows the octave spacing used in one embodiment of the invention. The first subband is a whole octave. The remaining subbands are split octave. The subband spacing was determined empirically by performing Monte-Carol simulation on a large database consisting of two hundred fifty-two music files and one hundred eighty-nine noise files. In the Table, L refers to the bin number corresponding the lower frequency boundary, H refers to the bin number corresponding to the higher frequency boundary and M is the number of spectral bins in each subband.
The spectral flatness measure (SFM) in each subband is calculated using the following formula.
SFM(i) spectral measure for i subband at time (n), L(i) and H(i) correspond to the lower and higher spectral bin number for ith subband and M(i) is the number of bins in ith subband.
One can distinguish music and speech from noise using any one of the many N-feature sat classification algorithms, such as k-nearest-neighbor classifier, on the data for subband SFM. However, a simpler classification scheme is used in the invention. According to another aspect of the invention, a single test statistic is derived from the individual subband SFM. The test statistic is derived from an exponentially weighted sum of subband SFMs, as shown in the following equation.
α is the weighting factor, q is the number of subbands and SFM(i) is the SFM for ith subband. The weighting is chosen to emphasize low frequencies, i.e. the contribution of individual SFMs gradually decreases as frequency increases. This is because, music, speech, and the noise spectrum share similar spectral characteristics at high frequencies. A weighting factor less than one (<1) suffices. A table could be used instead of calculating the weighting factor.
The test statistic β is preferably median filtered to reduce spurious spikes in the SFM estimate. That is,
λ(n)=median{β(n),β(n−1), . . . β(n−p)}
where p is the size of the median filter. The test statistic is further smoothed by calculating a rolling average to reduce the variance of the statistic.
γ(n)=εγ(n−1)+(1−ε)λ(n)
where ε is the smoothing constant, γ(n) is the smoothed test statistics at time (n) and γ(n−1) is the test statistic at time (n−1).
Finally, the smoothed test statistic is compared with a threshold to detect the presence of music. Specifically, if the smoothed test statistics are greater than the threshold η, then the spectrum is relatively flat and background noise is present and musicDetect goes to a logic “false” or, for positive logic, a “0” (zero). If the smoothed test statistic is not greater than the threshold η, then music is present and musicDetect is true or “1”. The musicDetect signal is used by control 80 (
The invention thus provides a method for unambiguously distinguishing mainstream music genre from noise. The method does so efficiently, requiring little computational power, in part, due to the use of a pseudo floating-point operation in a fixed-point processor, and does so in real time.
Having thus described the invention, it will be apparent to those of skill in the art that various modifications can be made within the scope of the invention. For example, circuits 72 and 76 (
Number | Name | Date | Kind |
---|---|---|---|
5394473 | Davidson | Feb 1995 | A |
5583961 | Pawlewski et al. | Dec 1996 | A |
5664052 | Nishiguchi et al. | Sep 1997 | A |
5684921 | Bayya et al. | Nov 1997 | A |
6477489 | Lockwood et al. | Nov 2002 | B1 |
6556682 | Gilloire et al. | Apr 2003 | B1 |
6658380 | Lockwood et al. | Dec 2003 | B1 |
7317958 | Freed et al. | Jan 2008 | B1 |
7379875 | Burges et al. | May 2008 | B2 |
7555117 | Suppappola et al. | Jun 2009 | B2 |
8032387 | Davidson et al. | Oct 2011 | B2 |
20040086109 | Takada | May 2004 | A1 |
20040138886 | Absar et al. | Jul 2004 | A1 |
20040267522 | Allamanche et al. | Dec 2004 | A1 |
20050108004 | Otani et al. | May 2005 | A1 |
20050165587 | Cheng et al. | Jul 2005 | A1 |
20060064299 | Uhle et al. | Mar 2006 | A1 |
20060074649 | Pachet et al. | Apr 2006 | A1 |
20060149532 | Boillot et al. | Jul 2006 | A1 |
20060229878 | Scheirer | Oct 2006 | A1 |
20070016414 | Mehrotra et al. | Jan 2007 | A1 |
20090192806 | Truman et al. | Jul 2009 | A1 |
Number | Date | Country | |
---|---|---|---|
20070136053 A1 | Jun 2007 | US |