This application claims priority under 35 USC §119 to Finnish Patent Application No. 20045146 filed on Apr. 22, 2004.
The present invention relates to a method for detecting audio activity comprising forming or receiving samples of an audio signal, forming a feature vector from the samples of the audio signal, and projecting the feature vector by a discriminant vector to form a projection value for the feature vector. The invention also relates to a speech recognizer comprising a detector for detecting audio activity; a sampler for forming samples of an audio signal, a feature vector forming block to form a feature vector from the samples of the audio signal, a discriminator for projecting the feature vector by a discriminant vector to form a projection value for the feature vector. The invention also relates to an electronic device comprising a sampler for forming or receiving samples of an audio signal; a feature vector forming block to form a feature vector from said samples of the audio signal; and discriminator for projecting the feature vector by a discriminant vector to form a projection value for the feature vector. The invention relates to a module for detection of audio activity comprising an input for receiving a projection value for a feature vector, which feature vector is formed from samples of an audio signal, and which projection value is formed by projecting the feature vector by a discriminant vector. The invention further relates to a computer program product comprising machine executable steps for detecting audio activity comprising forming or receiving samples of an audio signal; forming a feature vector from said samples of the audio signal; projecting the feature vector by a discriminant vector to form a projection value for the feature vector. The invention still further relates to a system comprising a sampler for forming or receiving samples of an audio signal; a feature vector forming block to form feature vectors from said samples of the audio signal; a discriminator for projecting the feature vector by a discriminant vector to form a projection value for the feature vector.
The Beginning of Audio Activity (BAA) and End of Audio Activity (EAA) detection is an important feature in isolated word speech recognition systems, name dialling, SMS dictation for multimedia applications, general voice activity detection etc. The aim of BAA and EAA detection is to detect the time where the audio activity begins and ends as reliably and quickly as possible. When the BAA detection has been performed the recognizing system can start processing the detected audio signal. The processing can be ended after EAA is detected. With reliable BAA and EM detection unnecessary and costly computation done by the recognition system can be avoided. The recognition rate can also be improved since a noisy part possibly existing before the audio activity can be omitted.
Both the BAA and EAA represent some kind of changes in audio activity wherein the term “change in audio activity” is used instead of BAA or EAA at some parts of this description.
Decoding in Automatic Speech Recognition (ASR) is a computationally expensive and time consuming task. It is useless to perform decoding for non-audio activity data and especially in noisy environments it can even cause performance degradation to the automatic audio activity recognition system. A simple but robust beginning of audio activity detection algorithm would be ideal for many automatic audio activity recognition tasks as listed above.
Many existing automatic audio activity recognition systems include a signal processing front-end that converts the audio activity waveform into feature parameters. One of the most used features is the Mel Frequency Cepstrum Coefficients (MFCC). Cepstrum is the Inverse Discrete Cosine Transform (IDCT) of the logarithm of the short-term power spectrum of the signal. One advantage of using such coefficients is that they reduce the dimension of an audio activity spectral vector.
In prior art systems there are also some other algorithm-related problems. For example, many algorithms usually work nicely in clean, noiseless environments but if there is noise present the algorithms can often fail even if the signal to noise ratio (SNR) of the audio activity signal is fairly high. The frequency spectrum coefficient (FCC) features and/or methods utilizing the energy of the signal that are commonly used in audio activity recognition clearly do not provide satisfactory features for beginning of audio activity detection. Additive noise is also difficult to compensate.
Many different techniques have been developed for solving the BAA and EAA problem. For example, there exist many energy and zero crossing based methods in which the energy of the audio signal is measured and zero crossing points are detected. However, these methods often prove to be either unreliable especially in noise or are unnecessarily complex. Often the BAA and EM detection is obtained from an algorithm as a side effect. For example, the actual algorithm may be aimed to solve more general problems like speech/non-speech detection or voice activity detection problems but which are not important features for the BAA or EAA detection. Stripping of the unnecessary part does not lead to good performance or may be totally impossible.
One prior art method for speech/non-speech detection is disclosed in a publication “Robust speech/non-speech detection using LDA applied to MFCC”; Martin, A.; Charlet, D.; Mauuary, L.; Proceedings. (ICASSP '01). 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001, Volume: 1, 7-11 May 2001; Pages: 237-240 vol. 1. In this method a Linear Discriminant Analysis (LDA) is applied to MFCC.
However, there is room for improvements as the feature vector normalization and delta and ddelta calculations introduce some delay to the decoding. With the static MFCC features without normalization there is no delay and much can be done within the delayed time window.
The invention tries to reduce the computational complexity of the recognition process for example for the SMS dictation task. Being able to discard non-audio activity data at the change in audio activity detection results in computational savings in decoding.
The present invention provides a way to utilize e.g. the Mel Frequency Cepstrum Coefficients feature more effectively in the audio activity detection. This is possible since MFCC calculation usually introduces some delay to the decoding because of dynamic coefficient calculations and feature vector normalization. Also the need for noise compensation can be avoided in a simple way, making the algorithm more robust against noise.
The invention is based on the idea that the LDA is applied on MFCC features and making the threshold used in the determination of change in audio activity adaptable. The adaptation of the threshold is based on calculation of some properties of feature vectors formed from the speech signal.
According to one aspect of the invention there is provided a method for detecting a change in audio activity comprising:
The method is primarily characterized in that the method further comprises
According to an example embodiment of the method of the present invention the BAA is detected when the first statistical value of projected values classified as audio activity is found from the number of projected values and EAA is detected when the first statistical value of projected values has remained below threshold value for a predefined amount of samples.
According to another aspect of the invention there is provided a speech recognizer comprising:
The speech recognizer is primarily characterized in that the speech recognizer further comprises
According to a third aspect of the invention there is provided an electronic device comprising:
The electronic device is primarily characterized in that the electronic device further comprises:
According to a fourth aspect of the invention there is provided a module for detection of audio activity comprising an input for receiving a projection value for a feature vector, which feature vector is formed from samples of an audio signal, and which projection value is formed by projecting the feature vector by a discriminant vector. The module is primarily characterized in that the module further comprises:
According to a fifth aspect of the invention there is provided a computer program product comprising machine executable steps for detecting audio activity comprising:
The computer program product is primarily characterized in that the method further comprises machine executable steps for:
According to a sixth aspect of the invention there is provided a system comprising:
The system is primarily characterized in that the system further comprises:
In an example embodiment of the present invention the statistical values are magnified, when necessary, by some value before performing the comparisons.
The present invention provides a less complicated method and system for audio activity detection compared to prior art. The performance of prior art systems is often insufficient if noise is present, at least compared to how much computation power they use.
a illustrates a sample of a speech signal,
b illustrates the behaviour of the mean LDA projection values in different environments,
a shows the main blocks of an audio activity detection according to the present invention,
b shows the beginning of audio activity block of
a shows the main blocks of EAA detection according to the present invention,
b shows the EAA block of
The invention will now be described in more detail with reference to the audio activity recognizer 1 presented in
The samples of the audio signal are input to the speech processor 1.4. In the speech processor 1.4 the samples are processed on a frame-by-frame basis i.e. each sample of one frame is processed to perform a feature extraction on the speech frame. The feature extraction forms a feature vector for each speech frame input to the speech recognizer 1. The coefficients of the feature vector relate to some sort of spectrally based features of the frame. The feature vectors are formed in a feature vector forming block 1.41 of the speech processor by using the samples of the audio signal. This block can be implemented e.g. as a set of filters each having a certain bandwidth. Together the filters cover the whole bandwidth of the audio signal. The bandwidths of the filters may partly overlap with some other filters. The outputs of the filters are transformed, such as discrete cosine transformed (DCT) wherein the result of the transformation is the feature vector. In this example embodiment of the present invention the feature vectors are 13-dimensional vectors but it should be evident that the invention is not limited to such vectors only. In this example embodiment the feature vectors are Mel Frequency Cepstrum Coefficients.
In the speech processor 1.4 the feature vectors are then projected to a one-dimensional line vector to represent each feature vector as a single value (block 502 in
The projected value is then used to determine which class the feature vector (i.e. the speech frame) belongs to. Block 501 in
In principle, it is now possible to set a threshold (THRESHOLD_DISTRIBUTIONS) based on these mean value distributions and state that BAA occurred if the LDA projection value is above the threshold. The selection of the discriminant vector can inter alia affect on the way the result of the comparison should be interpreted.
The speech processor 1.4 also keeps track on the minimum (min) and maximum (max) mean values and the difference (diff) of the maximum and minimum mean values (block 506) while calculating the mean values. The minimum, maximum and the difference are used when determining the BAA for example as follows. After calculating the mean value, min, max and diff values for the current frame, the decision block 1.42 compares the maximum mean value to a predetermined high noise parameter (THRESHOLD_HIGH_NOISE) and if the maximum mean value is greater than the value of the high noise parameter the decision block 1.42 determines that the audio activity has already started or the noise level is high. The decision block 1.42 gives a signal indicative of beginning of audio activity to a speech decoder 1.43 of the speech recognizer 1. This signal triggers the speech decoding in the speech decoder 1.43. If, however, the maximum mean value is below the high noise parameter value, the decision block 1.42 compares the difference with a predetermined first min/max difference parameter value. This parameter is set so that if the difference between the maximum and minimum mean values is greater than the first min/max difference parameter value it is supposed that audio activity has already started. When the difference between the maximum and minimum mean values does not exceed the first min/max difference parameter value and the maximum mean value is less than the high noise parameter value, the decision block 1.42 compares the difference value with a predetermined second min/max difference parameter value. If the comparison indicates that the difference value is greater than the second min/max difference parameter value the decision block 1.42 compares the projection value of a current frame with a distribution threshold value. If the projection value of the current frame is greater than the distribution threshold value the decision block 1.42 determines that audio activity has started.
It should be evident that the comparison of the projection value of the current frame and the distribution threshold value can also be performed before the comparison of the second min/max parameter values and the difference value. Also the order of the other comparisons need not be the same as mentioned above. If none of the comparisons mentioned above produce a BAA indication the procedure will be repeated for the next frame, if not stopped for some other reason.
The decision criteria for the BAA triggering in the example embodiment of the invention described above can also be represented as the following pseudo code:
It is possible to use the same means for EAA detection. This is illustrated in
The decision criteria for EAA triggering in the example embodiment of the invention described above can also be represented as the following pseudo code:
Given that the BAA has already been detected or it is otherwise desired to start detecting the EAA:
In the pseudo code the counter Cntr is cleared. The purpose of the counter Cntr is to count the number of non-audio activity frames to differentiate pauses between words from EAA. The maximum parameter Max is set to a maximum value INF because the purpose of the maximum parameter is to find the smallest of the maximum mean LDA values (or smallest maximum of some other statistical value). This smallest maximum value is then used in the min-max difference analysis. The pseudo code comprises a loop which is repeated for each frame until the condition to exit the loop is found. In the loop the mean LDA value of the current frame is compared with the value of the maximum parameter Max. If the mean LDA value is smaller than the value of the Max, the mean LDA value is set as the new value of the maximum parameter Max. Then, the difference of the maximum Max and minimum Min is compared with the third min/max difference parameter THRESHOLD_MIN_MAX_DIFF_EM. In the calculation of the min/max difference the minimum value is the global minimum value (i.e. the smallest of the mean LDA values). If the difference is smaller than the value of the third min/max difference parameter THRESHOLD_MIN_MAX_DIFF_EM the counter Cntr is increased. Otherwise the counter is cleared. At the end of the loop the value of the counter Cntr is compared with the predefined number N of unbroken positive EM decisions. If the value of the counter Cntr is greater than the predefined number N it is determined that the audio activity has ended and the EM parameter is set true and the loop is exited. Otherwise the loop will be repeated for the next frame.
The frames are not necessarily examined continuously but in groups. For example, the speech recognizer 1 buffers forty-seven frames and after that begins the calculation and BAA detection. If no audio activity is detected on the buffered frames the speech recognizer 1 buffers the next forty-seven frames and repeats the calculation and BAA detection.
When the speech recognizer 1 has detected the BAA frame (i.e. the frame in which the BAA had the value true) it informs the speech decoder 1.43 the frame number so that the speech decoder 1.43 can begin the decoding of the speech. It is also possible that the speech decoder 1.43 starts the decoding predefined amount of frames before the BAA frame. Similarly the speech decoder ends decoding after a predefined amount of frames after EAA detection. It should be noted that these predefined amounts of frames are not necessarily fixed constants, but may vary e.g. according to threshold or some other time-dependant variable.
It is also possible to use another statistical value instead of the mean value in the calculations described above.
It should be noted that the data used in training the discriminant vector in the case of
The discriminant vector corresponding to the
v=(−0.0205, −0.1355, −0.0292, −0.1206, 0.0060, 0.0863, −0.0407, −0.2307, −0.1286, −0.2852, −0.1591, −0.2092, −0.8581)T
the corresponding MFCC feature vector components being c1, c2, . . . , c12 and c0 respectively. If the component values of the vector v are interpreted as weights on how much each MFCC component (variable) contributes to the LDA projection it can be seen that the energy term c0 clearly dominates. Indeed, it was noted during the development work that, in noise, the distributions tend to move (“slide”) to the direction of the speech distribution. This is illustrated in
Adding noise to the speech sample does not necessarily mean that the minimum and maximum values are shifted up (in case similar to the
In the following another embodiment of the operation of the decision block 1.42 will be described. After calculating the mean value, min, max and diff values for the current frame, the decision block 1.42 compares the maximum mean value to a predetermined high noise parameter the same way as in the embodiment described above. If the maximum mean value is below the high noise parameter value, the decision block 1.42 examines the noise level of the signal. It can be performed, for example, by examining the minimum mean value. If the minimum mean value is relatively high it can be assumed that there is noise in the signal. The decision block 1.42 can compare the minimum mean value with a noise level parameter and if the minimum mean value exceeds the noise level parameter the decision block 1.42 continues the operation as follows. The decision block 1.42 buffers the mean values of the frames under consideration and calculates the median of the mean values and compares it with the distribution threshold. If the median is greater than the distribution threshold, the mean values are magnified (multiplied) by some constant greater than one (e.g. by 2) before performing the comparisons as described above. Also some linear or non-linear functions could be considered here so that the magnification is not based on a constant value but on a function. This way it is possible to set a common threshold (THRESHOLD_MIN_MAX_DIFF) for the differences between min and max mean LDA values for both clean and noise. This can be seen through the result below.
In the clean case the differences between non-speech and speech parts are quite obvious if the mean LDA projection values are checked. A threshold could be set to the mean LDA values so that speech and non-speech parts are well separated. However, since there is shifting due to the noise the difference between the minimum and maximum values as a function of time are checked. Because of the shrinking in noise the mean LDA projection values are multiplied by some constant if noise is considered to be substantial. For this reason mean LDA values are buffered and their median is calculated when there is no delay compared to the decoding. The median value is compared against the threshold and the mean LDA values are magnified by some constant which is greater than one (e.g. two) if the threshold is exceeded (also some linear functions could be considered). This way it is possible to set a common threshold (THRESHOLD_MIN_MAX_DIFF) for the differences between min and max mean LDA values for both clean and noise. This can be seen through the result below.
In the following some details about the tests performed with a speech recognizer 1 are described. For training the LDA projection vector some speech samples were used from a name database. Only clean training data was used. Multi environment training could be tried but the shifting phenomenon would be present still with high probability. Static MFCC feature vectors were obtained from the recognizer for each speech sample and read into a Matlab-program. Based on the speech starting point (SSP) labels that were generated with the recognizer the feature vectors were divided into two classes. Non-speech class was formed directly from the feature vectors that were observed before the point SSP−10. The reason for the minus ten is that the transition part could better be discarded. For the speech class only 40 feature vectors from the file just after the point SSP+10 were qualified. The reason for this was that with this arrangement there would be practically only feature vectors corresponding to speech frames. Then LDA was performed on the feature vectors with the class information and the number of discriminant vectors was set to one. The discriminant vector was then used in the algorithm described in this application. The speech and non-speech distributions for the training data are close to those shown in
It is also possible to implement the present invention as a module which can be connected with an electronic device comprising a speech decoder, wherein the module produces a change in audio activity indication to the electronic device or directly to the speech decoder of the electronic device. One example embodiment of the module comprises the decision block 1.42 but also other constructions for the module are possible.
The electronic device in which the invention can also be implemented can be any electronic device in which the change in audio activity detection may be used. In
The present invention can be used in detection of changes in audio activity. For example, the invention can be used for BAA detection, for EAA detection or for both BAA and EAA detection.
It should be understood that the present invention is not limited solely to the above described embodiments but it can be modified within the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
20045146 | Apr 2004 | FI | national |