The present invention relates to an audio event detection method and apparatus, and particular to an audio event detection method and apparatus based on a long-term feature.
Today, the world is in a generation of the information explosion, and information is increased with a speed of the exponent level. The continuous development of the multimedia technology and the internet technology make a necessity of automatically analyzing and processing the large-scale multimedia data increase significantly. However, the amount of the computation of the video analysis is large, and more resource is consumed, thus the audio analysis of the multimedia data has a larger advantage.
In general, time of video such as a sports game is relatively long, and the content that truly interests most sports fans often only occupy a small section of the entire content. If it is necessary to find the interesting content, the user often need to go through the content from the beginning to the end to find the desired content, which costs time and labor. On the other hand, the more the sports videos are, the more the requirement for the effective retrieval and the management of the sport video is. Therefore, if there is a sports content retrieve system that can help the user to retrieve some contents truly cared about, the time can be largely saved.
In particular, the automatic audio analysis on sports game programs has got more attention from researchers. For a sports game, by extracting the highlight scene in the video of the sports game through the extraction of the audio event such as applauding, applause, cheering and laughing, it makes it possible for the user to find the interesting segment more conveniently.
The extraction of the audio event has the following difficulties: first, in the sports game, the audio event usually does not occur individually, instead, it is often accompanied by the speech of the preside and other sound, which causes difficulty for the modeling of the video event; second, in the sports game, the spectrum characteristic of the audio event is usually similar to the ambient noise, causing more pseudo-alarm generated in the retrieval procedure, so that the accuracy is relatively lower.
In the article “Perceptual linear predictive (PLP) analysis of speech” of Hermansky H (Journal of the Acoustical Society of America, 87:1738, 1990), the processing is through two stages. In the first stage, the multimedia data with a manual mark is relative audio searched with the semantic tag, and in the second stage, this type of music feature is on-line trained based on the audio search result of the semantic tag, and is applied to the query of the audio content.
It can be seen from the above literature that the related art only analyzes and detects certain content of one or two types of the sports games, and this technique has great pertinence, and can not extend to other types of the content detection for extracting the content of the sports game. And, with the increase of the types of sports games, the consumer is less likely to have enough time to view the entire game from the beginning to the end, therefore, the sports fans desire an automatic content detection system of the sports game for helping the user to detect the content interested fast and conveniently. Since the current image analysis technology is only limited to a scene analysis, there is not a good research on the understanding of the content of the image, thus this invention focuses on the use of the voice signal processing technology to understand and analyze the content of the sports games, to help the sports fans to extract some interesting event and information, such as match detection according to type, highlight event detection, key name of person and group, and the detection of the start point and the end time point of the different matches, etc.
In view of this, the present invention provides an audio event detection method and apparatus with robustness and high performance, wherein the audio event comprises: applauding, cheering and laughing. This method considers the continuity of the feature on the time domain, and detects in combined with a long-term feature based on slices, so that the performance of the detection is increased significantly.
According to an aspect of the present invention, the present invention provides an audio event detection method based on a long-term feature, the method comprises the step: dividing an input audio stream into a series of slices; extracting a short-term feature and a long-term feature for each slice; and obtaining a classification result of the audio stream based on the short-term features and the long-term features.
According to the aspect of the present invention, the audio event detection method further comprises a step of obtaining an event detection result through a smoothing processing of the classification result.
According to the aspect of the present invention, the audio event detection method further comprises the step of calculating a Mean Super Vector feature based on the long-term feature, after extracting the short-term feature and the long-term feature.
According to the aspect of the present invention, the audio event detection method further comprises the step of reducing dimensions of the Mean Super Vector by using a dimension reduction algorithm to remove redundant information, after calculating the Mean Super Vector feature.
According to the aspect of the present invention, in the audio event detection method, the short-term feature is based on a frame and the long-term feature is based on the slice.
According to the aspect of the present invention, in the audio event detection method, classification result comprises using a Support Vector Machine to classify the input audio stream.
According to the aspect of the present invention, in the audio event detection method, based on the short-term feature based on the frame comprises at least one feature of: PLP, LPCC, LFCC, Pitch, short-term energy, sub-band energy distribution, brightness and bandwidth.
According to the aspect of the present invention, in the audio event detection method, based on the long-term feature based on the slice comprises at least one feature of: spectrum flux, long-term average spectrum and LPC entropy.
According to the aspect of the present invention, in the audio event detection method, the obtaining the event detection result through the smoothing processing comprises using a smoothing rule in the smoothing processing, and the smoothing rule is as follows:
if {s(n)==1 and s(n+1)!=1 and s(n+2)==1} then s(n+1)=1 (1)
if {s(n)==1 and s(n−1)!=1 and s(n+1)!=1} then s(n)=s(n−1) (2)
According to another aspect of the present invention, the present invention provides an audio event detection apparatus based on a long-term feature, the apparatus comprises: an audio stream dividing section for dividing the input audio stream into a series of slices; a feature extracting section for extracting short-term features and long-term features for each slice; and classifying section for obtaining a classification result of the input audio stream based on the extracted short-term features and the long-term features.
According to another aspect of the present invention, the present invention provides a computer product for causing the computer to execute the steps of: dividing the input audio stream into a series of slices; extracting short-term features and long-term features for each slice; and obtaining a classification result of the input audio stream based on the short-term features and the long-term features.
In summary, by dividing the input audio stream into a series of slices, the present invention averages the feature vector of the slice (obtaining the MSV, Mean Super Vector), extracts the short-term features and the long-term features for each slice using the dimension reduction method, obtains the final classification result using SVM (supporting vector machine classifier), and obtain the final event detection result through smoothing. The experimental result shows that the result of event detection can reach an F value of 86% in the common TV program.
Further scope of applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from the following detailed description.
The present invention will become more fully understood from the detailed description given hereinafter and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention and wherein:
The audio event detection method and apparatus based on the long-term feature according to the present invention is described with reference to the figure.
The audio event detection method based on the long-term feature further comprises a long-term feature extracting step S 120, in the step S120, the short-term features and the long-term features are extracted for each slice. According to one embodiment of the present invention, for each slice, two features respectively based on frame and based on the slice, i.e., frame feature and slice feature can be extracted for each slice feature vector thereof.
Here, the features based on frame comprises at least one of the following features: PLP (Perceptual Linear Predictive Coefficients), LPCC (Linear Predictive Cepstrum Coefficients), LFCC (Linear Frequency Cepstral Coefficients), Pitch, STE (Short-term energy), SBED (Sub-band energy distribution), BR and BW (Brightness and bandwidth). And the features based on slice comprise at least one of the following features: SF (Spectrum Flux), LTAS (long-term average spectrum) and LPC entropy.
In particular, the PLP feature is a technology for voice analysis from three acoustical psychology aspects of equal-loud curve, strength energy theorem and critical spectrum analysis, the detailed algorithm refers to Hynek Hermansky: Perceptual Linear Predictive (PLP) analysis of speech, J. Acoust. Soc. Am. 87(4), April 1990. LPCC is a parameter feature based on sound track, and LFCC is a parameter feature taking the acoustical feature of the human ear into account, the detailed computation method refers to Jianchao Y U, Ruilin ZHANG: the Recognition of the speaker based on LFCC and LPCC, The engineering and design of Computer, 2009, 30(5). There are some differences between the LFCC and LPCC, For LFCC, it is necessary to map the energy in the common frequency to the Mel spectrum more compliant with the human hearing considering the perceptual characteristic of the human ear, while LPCC processes the frequency with a series linear triangular window in the common frequency domain instead of mapping on the Mel spectrum.
The short-term energy of one dimension is extracted by using formula (1), the short-term energy describes the total spectrum energy of one frame.
STE=log (∫0ω
Wherein ω0 is a half of the sampling frequency of audio, F(ω) is fast-Fourier coefficient, |F(ω)|2 is the energy at frequency ω. This feature can distinguish the voice/music and the noise relatively well.
If the spectrum is divided into some sub-band, the distribution of the sub-band energy is defined as ratio of the sub-band energy on the sub-band and the short-term energy of the frame. The expression is expressed as formula (2).
Wherein Lj and Hj is the up-limit frequency and the down-limit frequency on the jth sub-band, respectively.
The brightness and the bandwidth are expressed by formula (3) and (4) as follows:
Next, the spectrum flux is used for representing the variation between the spectrum of the continuous two frames, its expression is as formula (5):
Wherein M is the number of frames in this slice, and K is the number of orders of FFT.
The long-term average spectrum is expressed as in the following formula (6).
Wherein PSDi is the power spectrum intensity of ith frame, and L (25 in this application) is the number of frames syncopated in this slice.
Wherein k is the frequency, N is the number of orders of DFT (512 in this application), t1 and t2 is the starting time and ending time of this slice. Further, the statistical value of LTAS such as average value, minimum value, maximum value, mean square error, range of variation, local peak is extracted as well.
Further, the LPC entropy is mainly used for describing the variation of the spectrum on the time domain, which is expressed as formula (7).
Wherein a(n,d) is the LPC coefficient, w is the length of the window, and D the number of orders of LPC.
Therefore, with the above audio stream long-term feature extracting step S120, the voice signal is divided into a series of voice window using the sliding window, the frame feature and slice feature are extracted for each voice window and the frame therein, so as to obtain the MSV (Mean Super Vector) feature vector.
Please note that in this invention, the following process can be performed with both or one of the two features based on frame and based on slice.
Next, back to refer to
Finally, in the smoothing step S140, the final event detection result is obtained by smoothing. Here, the smoothing process is mainly used for removing the classification error results, including the pseudo-alarm and non-integrity. The smoothing regulation defined is expressed as follows:
if {s(n)==1 and s(n+1)!=1 and s(n+2)==1} then s(n+1)=1 1)
if {s(n)==1 and s(n−1)!=1 and s(n+1)!=1} then s(n)=s(n−1) (2)
The commonly used dimension reduction methods include Principal component analysis (PCA), Linear Discriminative Analysis (LDA), Independent Component Analysis (ICA), and so on.
Beside the above difference in
The audio stream to be processed is input into the audio stream dividing section 420 from the audio stream inputting section 410. The audio stream to be processed inputted by the audio stream inputting section 410 is divided into a series of slices to facilitate the extraction of a short-term feature and a long-term feature of each slice. Here, in order to divide the input audio signal, the voice signal can be divided into a series of voice windows using the sliding window, each voice window corresponding to a slice, so as to achieve the object of division. The audio stream dividing section 420 also input the division result to the feature extracting section 430 to extract the short-term features and the long-term features of each slice.
In a embodiment of the invention the feature extracting section 430 extracts at least the features based on frame and the features based on slice, i.e., the frame feature and the slice feature. Here, the frame feature comprises at least one of PLP, LPCC, PFCC, Pitch, short-term energy, sub-band energy distribution, brightness, bandwidth, and so on. And the slice feature comprises at least one of spectrum flux, long-term average spectrum, LPC entropy, and so on.
The LPCC computing section 520, the LFCC computing section 530 and the pitch computing section 540 are used for computing PLP, LPCC, LFCC and Pitch according to the conventional method. As above mentioned, the detail of computation can refer to Hynek Hermansky (Perceptual Linear Predictive (PLP) analysis of speech, J. Acoust. Soc. Am. 87 (4), April 1990), and the word of Jianchao Y U, Ruilin Zhang, et (the recognition of the speaker based on LFCC and LPCC, the computer engineering and design, 2009, 30(5)).
The short-term energy computing section 550 extract the short-term energy describing the total spectrum energy in one frame using the formula (1). The sub-band energy distribution computing section 560 computes the sub-band energy distribution using formula (2). The brightness computing section 570 and the bandwidth computing section 580 compute the brightness and bandwidth using the formulas (3) and (4), respectively.
Next, the spectrum flux computing section 590 computes the spectrum flux using formula (5). The long-term average spectrum computing section 592 computes the long-term average spectrum using formula (6). The LPC entropy computing section 594 computes the LPC entropy using formula (7).
Back to
The smoothing section 450 obtains the final event detection result by smoothing. Here, the smoothing process is mainly used for removing classification error result including pseudo-alarm and non-integrity.
The experimental result shows that the result of the event detection can reach an F value of 86% in the general TV program. Table 1 shows the content and length of the data for training, and Table 2 shows the data for testing.
It can be seen from Table 1 and Table 2, the data content referred to comprises: Talking of News, Xiaocui Talking, Conference of Wulin, Serving for you, Common time all over the world, face to face, focus talking, recording, New oriental time, story of wealth, archive of the people, program for the Senior, joke, speaking, authenticating, sport match, etc. In these data, the training data and testing data are distributed by 4:1, the two types of data are not overlapped. 4 copy is used for training data, and 1 copy is used for testing data.
As the experimental result, Table 3 shows the condition of the obtained detailed number of feature dimension. In particular, Table 3 shows the number of dimension of each feature.
The experiment is mainly used for authenticating whether there is an improvement of the performance of detection after adding new feature. Table 4 shows the performance of the detection according to the above method of the present invention.
It can be seen from Table 4, only with the PLP feature, the Precision is 56.27%, the Recall is 63.44%, the F value is 59.64%, and after adding the STE and SBED feature, the Precision increases to 78.14%, the Recall is 63.51%, the F Value if 70.07%; and the like. The classifier used here is SVM, and the definition of F Value is expressed by formula (8).
Further, with the comparison between the classifying effect of the above SVM classifier and the effect of GMM (Gaussian mixture mode) classifier, it can be seen the performance of the SVM classifier will be higher than the performance of the GMM by about 5% on the same feature. Here, the Table 5 shows the performance of GMM.
Further, the processing procedure described in the embodiment of the present invention can be provided as the method with the procedure sequence. Further, these procedure sequences can be provided as program which causes the procedure sequence executed in the computer and the record medium recording the program. CD (compact-disc), MD (mini-disk), DVD (digital versatile disk), memory card, blue-disc (registered trademark) and so on are used for this record medium.
The embodiment of the invention being thus described, it will be obvious that the same may be varied in many ways. Such variations are not to be regarded as a departure from the spirit and scope of the invention, and all such modifications as would be obvious to those skilled in the art are intended to be included within the scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
201010590438.1 | Dec 2010 | CN | national |