The present invention relates to systems and methods for quality improvement in an electrically reproduced speech signal. More particularly, the present invention relates to a system and method for babble noise detection.
Telephones can be used in many different environments. There is always some background noise around the speaker (far end) as well as around the listener (near end). The type and the level of the background noise can vary from stationary office and car noise to more non-stationary street and cafeteria noise. Many speech processing algorithms try to emphasize the actual speech signal and on the other hand reduce the unwanted masking effect of background noise, in order to improve the perceived audio quality and intelligibility. For these speech enhancement algorithms it is useful to know what kind of noise is present at either end of the transmission link because different noise situations require different performance from the algorithms. It is difficult to classify noises exactly but usually it is enough to classify noise according to its level and degree of mobility.
Telephones are often used in noisy environments and there is always some background noise summed to the speech signal. Many of the speech enhancement algorithms try to improve the quality and intelligibility of the transmitted speech signal by amplifying the actual speech and attenuating the background noise. For detecting the time slots of the signal that really contain speech, algorithms called voice activity detection (VAD) have been developed. These voice activity detection algorithms often interpret speech-like noise, hum of voices, as speech as well, which leads to undesired situations where background noise is amplified. To prevent these situations, a babble noise detection procedure, which determines if the speech detected by VAD is actual speech or just background babble, is needed.
In addition to algorithms using VAD information, some other speech enhancement algorithms, such as artificial bandwidth expansion (ABE), benefit from the background noise classification information. This information about the background noise enables an optimal performance of the algorithm in different noise situations. Babble noise situations often contain other non-stationary noise as well, like for example tinkle of dishes in a cafeteria or rustling of papers. Depending on the case, these sounds can also be included in the concept of babble noise and in that kind of situations it would be desired that the babble noise detector would detect these sounds as well.
In “Noise Suppression with Synthesis Windowing and Pseudo Noise Injection,” A. Sugiyama, T. P. Hua, M. Kato, M. Serizawa, IEEE Proceedings of Acoustics, Speech, and Signal Processing, Volume: 1, 13-17 May 2002, babble noise was detected using zero-crossing information. The noise was considered babble noise if the average number of zero-crossings of a time domain signal exceeded a certain threshold.
Thus, there is a need for an improved technique for detecting babble noise. Further, there is a need to distinguish between speech and background noise. Even further, there is a need to combine results from separate detection algorithms for babble noise detection.
The present invention is directed to a method, device, system, and computer program product for detecting babble noise. Briefly, one exemplary embodiment relates to a method for detecting babble noise. The method includes receiving a frame of a communication signal including a speech signal; calculating a gradient index as a sum of magnitudes of gradients of speech signals from the received frame at each change of direction; and providing an indication that the frame contains babble noise if the gradient index, energy information, and background noise level exceed pre-determined thresholds.
Another exemplary embodiment relates to a device or module that detects babble noise in speech signals. The device include an interface that communicates with a wireless network and programmed instructions stored in a memory and configured to detect babble noise based on a spectral distribution of noise.
Another exemplary embodiment relates to a device or module that detects babble noise in speech signals. The device includes an interface that sends and receives speech signals and programmed instructions stored in a memory and configured to detect babble noise based on a voice activity detector algorithm.
Yet another exemplary embodiment relates to a system for detecting babble noise. The system includes means for receiving a frame of a communication signal including a speech signal; means for calculating a gradient index as a sum of magnitudes of gradients of speech signals from the received frame at each change of direction; and means for providing an indication that the frame contains babble noise if the gradient index, energy information, and background noise level exceed pre-determined thresholds.
Yet another exemplary embodiment relates to a computer program product that detects babble noise. The computer program product includes computer code to calculate a gradient index as a sum of magnitudes of gradients of speech signals from a received frame at each change of direction; and provide an indication that the frame contains babble noise if the gradient index, energy information, and background noise level exceed pre-determined thresholds or a voice activity detector algorithm and sound level indicate babble noise.
Other principle features and advantages of the invention will become apparent to those skilled in the art upon review of the following drawings, the detailed description, and the appended claims.
Exemplary embodiments will hereafter be described with reference to the accompanying drawings.
Accordingly, babble noise can be better detected when a VAD based algorithm and a spectral distribution algorithm are combined or used separately in the situations which fit best to the particular algorithm chosen. In an exemplary embodiment, both of the algorithms process the input signal in 10 ms frames.
In general, voice activity detection (VAD) algorithms often interpret speech-like noise, hum of voices as speech. The VAD based babble noise detection algorithm corrects those incorrect decisions made by VAD by monitoring the level of detected speech, since the level of hum is usually lower than the level of the actual speech. If the input signal level suddenly drops by more than a predetermined amount (such as 5 dB, 25 db<50 dB, ect.) from its long-term estimate, the assumption of the babble noise situation is made. The VAD based babble noise detection algorithm detects only babble noise that really is hum of voices.
The spectral distribution algorithm is based on a feature vector and it follows the longer-term background noise conditions. It monitors only the characteristics of noise without taking into account the decision of VAD, e.g. the information if the frame contains speech or not. The babble noise detection is based on features that reflect the spectral distribution of frequency components and, thus, make a difference between low frequency noise and babble noise that has more high frequency components. The spectral distribution based algorithm detects hum of voices as well as other non-stationary noise as babble noise.
Since these algorithms define and detect babble noise differently, in some cases it is advantageous to combine the information they can provide. How this is done depends on the definition of babble noise and the needed accuracy of babble noise detection. For example, the spectral distribution babble noise decision can be used to double-check the negative or positive babble noise decision made by the VAD based detection algorithm.
Babble noise detection based on spectral distribution of noise is based on three features: gradient index based feature, energy information based feature and background noise level estimate. The energy information, Ei, is defined as:
where s(n) is the time domain signal, E[s′nb] is the energy of the second derivative of the signal and E[snb] is the energy of the signal. For babble noise detection, the essential information is not the exact value of Ei, but how often the value of it is considerably high. Accordingly, the actual feature used in babble noise detection is not Ei but how often it exceeds a certain threshold. In addition, because the longer-term trend is of interest, the information whether the value of Ei is large or not is filtered. This is implemented so, that if the value of energy information is greater than a threshold value, then the input to the IIR filter is one, otherwise it is zero. The IIR filter is of form:
where a is the attack or release constant depending on the direction of change of the energy information.
The energy information has high values also when the current speech sound has high-pass characteristics, such as for example /s/. In order to exclude these cases from the IIR filter input, the IIR-filtered energy information feature is updated only when the frame is not considered as a possible sibilant (i.e., the gradient index is smaller than a predefined threshold).
Gradient index is another feature used in babble noise detection. In babble noise detection, the gradient index is IIR filtered with the same kind of filter as was used for energy information feature. The background noise level estimation can be based on, for example, a method called minimum statistics.
If all three features, (IIR-filtered energy information, IIR-filtered gradient index and background noise level estimate) exceed certain thresholds, then the frame is considered to contain babble noise. By requiring all there features to exceed certain thresholds, this embodiment of the invention can minimize the number of false positives (i.e. the number of times a frame is incorrectly considered to contain babble noise). In at least one embodiment, in order to make the babble noise detection algorithm more robust, fifteen consecutive stationary frames are used to make the final decision that the algorithm operates in stationary noise mode. The transition from stationary noise mode to babble noise mode on the other hand requires only one frame.
Voice activity detector (VAD) algorithms are used to interpret time instants when the signal contains speech instead of mere background noise. These algorithms often interpret speech-like noise also as speech. However, the level of this kind of hum of voices is usually lower than the level of the actual speech. Using this assumption it is possible to monitor the level of the input signal, interpreted as speech by the VAD, and compare it to its long-term estimate. If the input signal level suddenly drops by more than, for example, 15 dB from its long-term estimate, an assumption of the babble noise situation is made. During babble noise, the long-term speech estimate is kept intact.
If the level of the actual speech signal drops suddenly, the babble noise detection algorithm triggers falsely. This result would prevent the updating of the long-term speech level estimate. For these kind of situations, the algorithm has a safety control, which is performed after 20-30 seconds. This safety control forces the update of the long-term estimate, if short-term estimate has not reached the long-term estimate for a given number of samples. The time period of 20-30 seconds is justified because it is somewhat the typical maximum time a person keeps completely silent in a telephone conversation, and thus the long-term estimate should be updated more frequently than that.
These two separate babble noise detection algorithms both have their advantages and disadvantages. Fortunately, these algorithms usually fail in different situations. How the combining of the babble noise detection decisions of the algorithms should be done, depends on the situation since the definition of babble noise is not exact and speech processing algorithms need the babble noise detection information for different reasons.
If the VAD based algorithm detects babble after a long non-babble period in block 74, the decision of the spectral distribution algorithm is checked in block 76 before making the final babble decision. If the spectral distribution algorithm gives a logical 1 as well, babble is detected, if not, there is a wait period in block 78 of a control safety time (e.g., 20-30 seconds). The long-term estimate is then updated in block 79 and the babble decision is made after that. This combination could be used, for example, if faulty babble noise detections are a problem. Occasions where quiet speech is faulty detected as babble noise would be prevented.
Advantageously, depending on the purpose of usage, only one of the algorithms or both of them can be used to detect babble noise. Further, combining the separate detection algorithms helps overcome their problems by using their strengths.
This detailed description outlines exemplary embodiments of a method, device, and system for babble noise detection. In the foregoing description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It is evident, however, to one skilled in the art that the exemplary embodiments may be practiced without these specific details. In other instances, structures and devices are shown in block diagram form in order to facilitate description of the exemplary embodiments.
While the exemplary embodiments illustrated in the Figures and described above are presently preferred, it should be understood that these embodiments are offered by way of example only. Other embodiments may include, for example, different techniques for performing the same operations. The invention is not limited to a particular embodiment, but extends to various modifications, combinations, and permutations that nevertheless fall within the scope and spirit of the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
5596676 | Swaminathan et al. | Jan 1997 | A |
6658380 | Lockwood et al. | Dec 2003 | B1 |
6671667 | Chandran et al. | Dec 2003 | B1 |
20020165713 | Skoglund et al. | Nov 2002 | A1 |
20020193130 | Yang et al. | Dec 2002 | A1 |
Number | Date | Country |
---|---|---|
WO 01 86633 | Nov 2001 | WO |
Entry |
---|
Srinivasan et al, “Voice Activity Detection for Cellular Networks,” IEEE Workshop on Speech Coding for Telecommunications, Oct. 13, 1993, pp. 85-86. |
Bou-Ghazale et al., “A Robust Endpoint Detection of Speech for Noisy Environments with Application to Automatic Speech Recognition”, Conexant Systems, Inc., pp. 3808-3811. |
Beritelli, “A Robust Voice Activity Detector for Wireless Communications Using Soft Computing,” IEEE 1998, pp. 1818-1829. |
Jax et al., Feature Selection for Improved Bandwidth Extension of Speech Signals (IND), ICASSP 2004, pp. I-697-I-700. |
Srinivasant et al., “Voice Activity Detection for Cellular Networks”, Department of Electrical and Computer Engineering, pp. 85-86. |
“Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms” ETSI ES 202 0505 V1.1.3 (2203-11), pp. 1-45. |
Noise Suppression with Synthesis Windowing and Pseudo Noise Injection, Multimedia Research Laboratories, Sep. 2002, Sugiyama et al., I-545-I548, France. |
Number | Date | Country | |
---|---|---|---|
20050267745 A1 | Dec 2005 | US |