The present invention relates to audio signal processing and in particular to a method for detecting whether a signal includes speech or music to select appropriate signal processing.
Speech enhancement has been a long standing problem for broadcast content. Dialogue becomes harder to understand in noisy environments or when mixed along with other sound effects. Any static post-processing (e.g., fixed parametric equalizer) applied to the program material may improve the intelligibility of the dialogue but may introduce some undesirable artifacts into the non-speech portions. Known methods of classifying signal content as speech or music have not provided adequate accuracy.
The present invention addresses the above and other needs by providing a speech/music discrimination method which evaluates the standard deviation between envelope peaks, loudness ratio, and smoothed energy difference. The envelope is searched for peaks above a threshold. The standard deviations of the separations between peaks are calculated. Decreased standard deviation is indicative of speech, higher standard deviation is indicative of non-speech. The ratio between minimum and maximum loudness in recent input signal data frames is calculated. If this ratio corresponds to the dynamic range characteristic of speech, it is another indication that the input signal is speech content. Smoothed energies of the frames from the left and right input channels are computed and compared. Similar (e.g., highly correlated) left and right channel smoothed energies is indicative of speech. Dissimilar (e.g., un-correlated content) left and right channel smoothed energies is indicative of non-speech material. The results of the three tests are compared to make a speech/music decision.
In accordance with one aspect of the invention, there is provided a method for classifying signal content as speech or non-speech in real time. The classification can be used with other post processing enhancement algorithms enabling selective enhancement of speech content, including (but not limited to) frequency-based equalization.
In accordance with another aspect of the invention, there is provided a method for classifying signal content as speech or non-speech in real time by evaluating the standard deviation between envelope peaks. Frames of N samples of an input signal are constructed. The left and right channels of input signals are band limited. A high-frequency roll-off point (e.g., 4 kHz) is determined by the highest meaningful frequencies of human speech. The low-end roll-off is significantly higher than the fundamental (lowest) frequencies of human speech—but is low enough to capture important vocal cues. The band limited left and right channels are used as the two inputs to a Least Mean Squared (LMS) filter. The LMS filter (with the appropriate step size and filter order parameters) has two outputs, a correlated content of the left and right channels and an error signal. The absolute values of the correlated content are taken, and normalized by the loudness of the LMS filter's output frame, to construct an envelope (where the loudness of a frame is the energy within a frame of data, weighted by a perceptual loudness filter). The envelope is searched for peaks above a specified threshold. The standard deviations of the separations between peaks are calculated. When this standard deviation decreases it is indicative of speech, whereas a higher standard deviation is indicative of non-speech material.
In accordance with yet another aspect of the invention, there is provided a method for classifying signal content as speech or non-speech in real time based on loudness ratios. The energy (RMS value) of each frame is calculated for each frame of the LMS filtered data, weighted by a perceptual loudness filter to obtain a measure of the loudness perceived by the typical human, and stored in a buffer. The buffer contains the M most recent energy calculations (the length M of the buffer is dictated by the longest gap between the syllables in speech). The ratio between maximum and minimum values in each buffer are calculated for the input signal. If this ratio corresponds to the dynamic range characteristic of speech, it is another indication that the input signal is speech content.
In accordance with still another aspect of the invention, there is provided a method for classifying signal content as speech or non-speech in real time based smoothed energy difference between input channels. Smoothed energies of the frames from the left and right input channels are computed and compared. Similar (e.g., highly correlated) left and right channel smoothed energies is indicative of speech. Dissimilar (e.g., un-correlated content) left and right channel smoothed energies is indicative of non-speech material.
The above and other aspects, features and advantages of the present invention will be more apparent from the following more particular description thereof, presented in conjunction with the following drawings wherein:
Corresponding reference characters indicate corresponding components throughout the several views of the drawings.
The following description is of the best mode presently contemplated for carrying out the invention. This description is not to be taken in a limiting sense, but is made merely for the purpose of describing one or more preferred embodiments of the invention. The scope of the invention should be determined with reference to the claims.
Where the terms “about” or “generally” are associated with an element of the invention, it is intended to describe a feature's appearance to the human eye or human perception, and not a precise measurement.
A method for classifying speech/music content of a signal according to the present invention is shown in
The correlated data frames 20 are further provided to a loudness ratio calculation 30 which processes the correlated data 20. The energy of each correlated data frame 20 of the LMS filter 18 is calculated and weighted with a perceptual loudness filter Revised Low-frequency B (RLB) weighting curve based on the International Telecommunications Union (ITU) standard (ITU-R BS.1770-2). The ratio between maximum and minimum values in each buffer are calculated for the input signal 12. If the ratio corresponds to the dynamic range characteristic of speech, it is another indication that the input signal is speech content, and a corresponding loudness ratio flag or score 32 is produced.
The input signal 12 is further provided to a left-right energy calculation 34 to produce channel energies 36. The channel energies 36 are smoothed by smoother 38 to produce smoothed energies 40 of the frames from the left and right input channels are computed and compared. The smoothed left and right channel energies 40 may be compared by comparitor 42 to provide a speech/non-speech flag 43, or the smoothed energies 40 of the left and right channels may be provided as a signal 43 for use in the weighted decision process. Similar (e.g., highly correlated) left and right channel smoothed energies is indicative of speech. Dissimilar (e.g., un-correlated content) left and right channel smoothed energies is indicative of non-speech material, and left-right channel energy flag or score 43 is produced.
While processing steps such as the comparitor 42 are shown as separate steps, those skilled in the art will recognize that reallocation of the processing steps is within the scope of the present invention. For example, the step of comparing the left and right channel energies described in the comparitor 42 can be reallocated to the decision block 44.
The peak separation flag or score 28, the loudness ratio flag or score 32, and the left-right channel energy flag or score 43 are provided to a decision block 44 where a speech versus music decision 45 is made for each frame of input data 12. The speech versus music decision 45 is provided to signal processing 46 which also receives the input signal 12. The signal processing 46 applies processing to the input signal 12 based on the speech versus music decision 45 to produce a processed signal 47. For example, speech specific frequency based equalization may be applied when the speech versus music decision 45 indicates that the input signal 12 includes speech. An example of speech specific frequency based equalization is a parametric EQ filter with variable gain at a fixed frequency to process the audio signal. When the decision block 44 outputs a speech flag 45 set to TRUE, parametric EQ filter may be enabled to enhance the intelligibility of speech. The decision flag could be also be combined with other dynamic processing techniques such as compressors and limiters.
The processed signal 47 is provided to a transducer 48 (for example an audio speaker) which produces sound waves 49.
The input signal 12 is broken into frames of N samples and the frames are processed by a band-pass filter 14 producing band limited signal frames 16. A high-frequency roll-off point (e.g., 4 kHz) is determined by the highest meaningful frequencies of human speech. The low-end roll-off is significantly higher than the fundamental (lowest) frequencies of human speech—but is low enough to capture important vocal cues.
The LMS filter 18 of the method for classifying speech/music content of a signal is shown in
The method for obtaining a standard deviation of correlated left and right channel content is described in more detail in
A method for calculating a ratio between maximum and minimum values in recent data buffers is described in
A method for computing and comparing smoothed energies of the frames from the left and right input channels is described in
The method 44 for making a speech/music classification based on the peak separation flag or score 28, the loudness ratio flag or score 32, and the left-right channel energy flag or score 43, is shown in
While the invention herein disclosed has been described by means of specific embodiments and applications thereof, numerous modifications and variations could be made thereto by those skilled in the art without departing from the scope of the invention set forth in the claims.
Number | Name | Date | Kind |
---|---|---|---|
5703955 | Fels | Dec 1997 | A |
5826230 | Reaves | Oct 1998 | A |
7254532 | Fischer | Aug 2007 | B2 |
8468014 | Master | Jun 2013 | B2 |
8650029 | Thambiratnam | Feb 2014 | B2 |
9026440 | Konchitsky | May 2015 | B1 |
20130304464 | Wang | Nov 2013 | A1 |
20150039304 | Wein | Feb 2015 | A1 |
20150162014 | Zhang | Jun 2015 | A1 |
20150264507 | Francombe | Sep 2015 | A1 |