1. Field of the Invention
This invention relates generally to audio files that have been processed using compression algorithms, and, more particularly, to a technique for the automatic classification of the compressed audio file contents.
2. Background of the Invention
With advances in auditory masking theory, quantization techniques, and data compression techniques, lossy compression of audio files has become the processing method of choice for the storage and streaming of the audio files. Compression schemes with various degrees of complexity, compression ratios and quality have evolved. The availability of these compression schemes has driven and been driven by the internet and portable audio devices. Several large data bases of compressed audio music files exist on the internet (e.g., from online stores). On a smaller scale, compressed audio music files are present on computers and portable devices around the globe. While classification schemes exist for MIDI music files and speech files, few schemes address the problem of identification and retrieval of audio content from compressed music database files. One attempt at classification of compressed audio files is the MPEG-7 standard. This standard is directed to providing a set of low level and high level descriptors that can facilitate content indexing and retrieval.
Referring to
In the past, centroid and energy levels of the data in the frequency domain of MPEG (Moving Picture Experts Group) encoded files along with nearest neighbor classifiers have been used as descriptors. This system has been further enhanced by including a framework for discrimination of compressed audio files based on semi-automatic methods, the system including the ability of the user to add more audio features. In addition, a classification for MPEG1 audio and television broadcasts using class (i.e., silence, speech, music, applause based segmentation) has been proposed. A similar proposal compares GMM (Gaussian Mixture Models) and tree-based VQ (Vector Quantization) descriptors for classifying MPEG encoded data.
The data in the compressed audio files are in the form of frequency magnitudes. The entire range of frequencies audible to the human ear is divided into sub-bands. Thus the data in the compressed file is divided into sub-bands. Specifically, in the MP3 format, the data is divided into 32 sub-bands. (In addition in this format, each sub-band can be further divided into 18 frequency bands referred to as split sub-bands). Each sub-band can be treated according to its masking capabilities. (Masking capability is the ability of a particular frame of audio data to mask the audio noise resulting from compression of the data. For example, instead of encoding a signal with 16 bits, 8 bits can be used, however, resulting in additional noise.) Audio algorithms also provide flags for detection of attacks in a music piece. Because an energy calculation is already performed in the encoder, the flagging of attacks can be used as an indication of rhythm, e.g., drum beats. Drum beats form the background music in most titles in music data bases. Most audiences tend to identify the characteristics of drum beats as rhythm. Because rhythm plays an important role in identifying any music, the characteristics of compression algorithms in flagging attacks is important. In present encoders, including MP3, pre-echo conditions (i.e., a condition resulting from analyzing the audio in fixed blocks rather than a long stream) are handled by switching the window to a shorter window rather than one that would otherwise be used. In some encoders, such as ATRAC (Adaptive Transform Acoustic Coding,) pre-echo is handled by gain control in the time domain. In AAC (Advanced Audio Coding) encoders, both methods are used. Referring to
Referring to
The techniques implemented and proposed for classifying compressed audio files in the related art have variety of shortcomings associated therewith. The computational complexity is high in most of the schemes of the related art. Therefore, these schemes may be applicable only for music file servers and not for generic internet applications. The schemes typically are not directly applicable to compressed audio files. In addition, most of the schemes decode the compressed data back to the time domain and apply techniques that have been proven in the time domain. Thus, these schemes do not take advantage of the features and parameters already available in the compressed files. In the schemes that do make use of data in the compressed format, the frequency data alone is used and not the information available as side-information descriptors. The use of side-information descriptors eliminates a large amount of computation.
A need has therefore been felt for apparatus and an associated method having the feature that the identification and classification of compressed audio files can be implemented. It would be a further feature of the apparatus and associated method to provide for the classification and identification of compressed audio files in a relatively short period of time. It would be a still further feature of the apparatus and associated method to provide for the classification and identification of compressed audio files at least partially using parameters generated as a result of compressing the audio file. It would be a still further feature of the apparatus and associated method to generate parameters describing a compressed audio file. It would be a more particular feature of the apparatus and associated method to compare a compressed reference audio file with at least one other compressed audio file. It would be yet another particular feature of the present invention to compare parameters generated from a first compressed audio file with parameters from a second compressed audio file.
The aforementioned and other features are accomplished, according to the present invention, by classifying each audio file by means of a group of parameters. The original audio file is divided into frames and each frame is compressed by means of a psycho-acoustic algorithm, the resulting files being in the frequency domain. The resulting frames are divided into frequency sub-bands. A parameter identifying the average spectral power for all the frames is generated. The set of parameters for all of the bands can be used to classify the audio file and to compare the audio file with other audio files. To improve the effectiveness of the parameters, the sub-bands can be further divided into split sub-bands. In addition, because the auditory response is more sensitive at lower frequencies, the split sub-band spectral power for at least one of the lowest order sub-bands can be separately used as parameters. These parameters can be used in conjunction with corresponding parameters for a second audio file to determine the similarity between the audio files by taking the difference between the parameters. The process can be further refined by providing incorporating weighting factors in the calculation. The psycho-acoustic compression typically generates side-information relating to the rhythm of a musical audio file. This side-information can be used in determining the similarity between two files.
Other features and advantages of present invention will be more clearly understood upon reading of the following description and the accompanying drawings and the claims.
1. Detailed Description of the Figures
Referring to
Referring now to
Referring to
Referring to
= 0, if e > Δ/2
= 0, if e > Δ/2
The coefficients a and b have been calculated empirically, and a>b to account for the greater importance accorded by the human auditory system for lower frequency sounds.
The parameters used in the foregoing pseudo code are illustrated in
Referring to
2. Operation of the Preferred Embodiment
The present invention can be understood as follows. An audio file is divided into frames in the time domain. Each frame is compressed according to a psycho-acoustic algorithm. The compressed file is then divided into sub-bands and each sub-band is further divided into split sub-bands. The power in each sub-band is averaged over all of the frames. The average power for each sub-band is then a parameter against which a corresponding parameter for a separate file can be compared. The parameters for all of the sub-bands are compared by determining a difference between the corresponding parameters. The accumulated difference between the parameters determines a measure of the similarity of the two audio files.
The foregoing procedure can be refined to provide a more accurate comparison of two files. Because the ear is sensitive to lower frequency components of the audio file, the difference between the powers in the individual split sub-bands of the first two sub-bands is determined rather than the average power in the sub-bands. Thus, greater weight is given to the power in the first two sub-bands. Similarly, empirical weighting factors can be incorporated in the comparison to refine the technique further.
In the psycho-acoustic compression, certain parameters referred to as attack parameters and related to the rhythm of the audio file are identified and included in side-information. These attack parameters can also be used to determine a relationship between two audio files.
Referring once again to
One application of the present invention can be the search for similar audio files such as song files. In this situation, the parameters of the reference audio files are generated. Then the parameters of stored (and compressed) audio files are generated for comparison. However, stored audio files not only are compressed using a psycho-acoustic algorithm, but are compressed a second time to reduce the storage space required for the audio file. As will be clear, prior to determination of the parameters, the stored audio file must have the second compression removed.
The result of using the present invention to characterize and classify audio files in pop rock classical and jazz categories is shown in
While the invention has been described with respect to the embodiments set forth above, the invention is not necessarily limited to these embodiments. Accordingly, other embodiments, variations, and improvements not described herein are not necessarily excluded from the scope of the invention, the scope of the invention being defined by the following claims.
Number | Name | Date | Kind |
---|---|---|---|
6370504 | Zick et al. | Apr 2002 | B1 |
6542869 | Foote | Apr 2003 | B1 |
6813600 | Casey et al. | Nov 2004 | B1 |
Number | Date | Country | |
---|---|---|---|
20040215447 A1 | Oct 2004 | US |