The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein:
The present invention discloses an audio signal segmentation algorithm comprising the following steps. First, an audio signal is provided. Then, an audio activity detection (AAD) step is applied to divide the audio signal into at least one noise segment and at least one noisy audio segment. Then, multiple audio features are extracted from the noisy audio segment by a frame with fixed length in the audio feature extraction step. Afterwards, a smoothing step is applied to the audio features to raise the discrimination rate of the speech and the music frames. Then, a classifier is used to tell the speech and the music frames apart. Finally, the frames of the same kind are merged according to the result and the speech and the music segments are then segmented.
In order to make the illustration of the present invention more explicit and complete, the following description is stated with reference to
Refer to
Then, in step 112, a convolution process is performed on the result obtained and a window (such as a rectangular window) in the smoothing step to raise the discrimination rate for the following step. Then, in step 114, a classifier is used to tell the speech and the music frames apart. The speech frames and the music frames compose at least one speech segment and at least one music segment, respectively. Then, the frames of the same kind are merged according to the result and the speech and the music segments are then segmented. Finally, the speech segment 116 and the music segment 118 are obtained. In the preferred embodiment of the present invention, the classifier is a KNN based classifier and it classifies the signals into different types in a codebook and further determines if the signals belong to speech or music. The following describes in detail the audio activity detection step used in the preferred embodiment of the present invention.
Refer to
Then, in step 210, a comparison step is performed between the likelihood ratio and the noise threshold 212. If the likelihood ratio is smaller than the noise threshold, the bands belong to a noise frame 214, and if the likelihood ratio is greater than the noise threshold, the bands belong to a noisy audio frame 216. In the preferred embodiment of the present invention, the likelihood computation step and the comparison step are based on the equation:
where Λ is the likelihood ratio, L is the number of the bands, Xk denotes the kth Fourier coefficient in one of the frames, λN(k) is the noise variance of the Fourier coefficient and denotes the variance of the kth Fourier coefficient of the noise, η is the noise threshold, H0 denotes the result is the noise frame, and H1 denotes the result is the noisy audio frame.
Then, a frame-merging process is performed in step 218. Some times the too-small and discrete frames are meaningless, so the frame-merging process is used to merge the small pieces into longer segments and to further raise the discrimination accuracy afterwards. In the preferred embodiment of the present invention, the method to merge the frames is to determine if the distance between the two adjacent frames detected is too small by programming. If the distance is too small, they are considered to be merged into the same frame. If the distance is not too small, they are still considered two different frames. In other words, when the distance between two adjacent noisy audio frames is smaller than a predetermined value, the two adjacent noisy audio frames are combined to compose the noisy audio segment 220. Refer to
It is noted that the noise threshold η can be estimated as different values according to different environments rather than a fixed value in order to make the audio signal segmentation algorithm of the present invention suitable for different environments. The following describes in detail the estimation of the noise threshold.
Refer to
In other words, the estimation of the noise threshold in the preferred embodiment of the present invention extracts a noise segment in initial part of the audio signal first and then mixes the noise segment with prepared training data (a noiseless speech/music segment) to a certain predetermined signal-to-noise ratio. Since the training data is prepared in advance, the location of the voice in the training data is already known, so the signal-to-noise ratio of the training data and the noise segment can be adjusted. Generally, if the signal with the lowest SNR in the system is 5 dB, the SNR of the mixing audio segment can be set to 3 dB to estimate the threshold. It just needs to be smaller than 5 dB. Then, the audio activity detection step is performed to the mixing audio segment. The mixing audio segment is proceeded a Fourier transform by a unit of 30 ms frame. Then, the likelihood ratio is computed, and an initial threshold (0) is used to judge. If the threshold can detect all of the voice part in the training data, the threshold is adjusted to be 0.2 higher until the threshold with the highest value that still can completely tell apart all the voice segments is obtained. There are t training data, so the step needs to be done for t times. However, each training data is not as long as usual, so it does not take too much time. When all training data is processed, t thresholds can be obtained and the smallest one among these t thresholds is chosen to be the threshold used in the system.
The following describes in detail the audio feature extraction step used in the preferred embodiment of the present invention.
After performing the audio activity detection step, the audio signal inputted is divided into a noise segment and a noisy audio segment. Then, the audio feature extraction step is performed on the noisy audio segment to obtain audio features of the noisy audio segment. Three audio features are used in discriminating the speech signals and the music signals in the preferred embodiment of the present invention. Each audio feature is defined in a length of about one second, and the length of one second is also the smallest unit in the discrimination in the preferred embodiment of the present invention. These three audio features are low short time energy rate (LSTER), spectrum flux (SF), and likelihood ratio crossing rate (LRCR), respectively. They are described as follows.
The audio features of the low short time energy rate: in a piece of audio signals, since the change of the energy in the frames of the speech signal is bigger than that of the music signal owing to the pitch, the speech signal and the music signal can be discriminated just by calculating the ratio of the low energy.
The audio feature of spectrum flux: in a piece of audio segment, since the energy of the speech signal is changeable, if calculating the sum of the frequency distance between the adjacent frames in the piece of audio segment, the speech signal has bigger value. The change in the frequency of the audio signal is usually slower, so the sum of the frequency distance between the adjacent frames is smaller. Therefore, the spectrum flux can be used to discriminate the speech and the music signal.
The audio feature of likelihood ratio crossing rate: The waveform of the likelihood ratio obtained in the AAD step can be used to tell the speech and the music apart by observing the damping characteristics. The speech signal has more frames of low energy than the music signal does. However, the speech and the music signal are not easily discriminated in the way of calculating the energy in time domain. Therefore, the audio feature of likelihood ratio crossing rate is derived in frequency domain. The likelihood ratio waveform of each frame obtained in the AAD step is used and the sum of the crossing rate of the likelihood ratio waveform compared to two thresholds is calculated. Generally speaking, the crossing rate in speech is higher than in music. The following describes in detail the audio feature extraction step in likelihood ratio crossing rate used in the present invention.
Refer to
Refer to
After the smoothing step, a classifier is used to tell the speech and the music frames apart. Finally, the frames of the same kind are merged according to the result and the speech and the music segments are then segmented. In the preferred embodiment of the present invention, the classifier is a KNN based classifier to classify the speech and the music types. The signal belongs to the type (the speech or the music) which has the most training data in the nearest k training data in the codebook. In other embodiments of the present invention, other classifiers may also be used, such as a Gaussian mixture model (GMM) classifier, a hidden Markov model (HMM) classifier and a multi-layer perceptron (MLP) classifier.
Refer to
According to the aforementioned description, one advantage of the present invention is that the present invention provides an audio signal segmentation algorithm suitable to be used in low SNR environments which works well in practical noisy environments.
According to the aforementioned description, another advantage of the present invention is that the present invention provides an audio signal segmentation algorithm which can be integrated into multimedia content analysis applications, multimedia data compression and audio recognition, and can be used in the front of the audio signal processing system to classify the signals and further to let the system discriminate and segment the speech and the audio signals.
According to the aforementioned description, yet another advantage of the present invention is that the present invention provides an audio signal segmentation algorithm which can be used as an IP to be supplied to multimedia system chips.
As is understood by a person skilled in the art, the foregoing preferred embodiments of the present invention are illustrative of the present invention rather than limiting of the present invention. It is intended to cover various modifications and similar arrangements included within the spirit and scope of the appended claims, the scope of which should be accorded the broadest interpretation so as to encompass all such modifications and similar structure.
Number | Date | Country | Kind |
---|---|---|---|
95118143 | May 2006 | TW | national |