This application claims the benefit of U.S. patent application Ser. No. 11/123,474, filed on May 5, 2005, as well as U.S. Provisional Patent Application No. 60/569,423, filed on May 7, 2004, and German Patent Application No. 10 2004 022 660.1, filed on May 7, 2004, which applications are incorporated herein by reference in their entirety.
1. Field of the Invention
The present invention relates to analyzing information signals, such as audio signals, and in particular to analyzing information signals consisting of a superposition of partial signals, it being possible for a partial signal to stem from an individual source or a group of individual sources.
2. Description of Prior Art
Ongoing development of digital distribution media for multi-media contents has led to a large variety of data offered. The huge variety of data offered has long exceeded the limits of manageability to human users. Thus, descriptions of the contents of the data by means of metadata become more and more important. In principle, the goal is to make it possible to search not only text files, but also e.g. music files, video files or other information signal files, while envisaging the same conveniences as with common text databases. One approach in this context is the known MPEG 7 standard.
In particular in analyzing audio signals, i.e. signals including music and/or voice, extracting fingerprints is very important.
What is also envisaged is to “enrich” audio data with meta-data so as to retrieve metadata on the basis of a fingerprint, e.g. for a piece of music. The “fingerprint” is to provide a sufficient amount of relevant information, on the one hand, and is to be as short and concise as possible, on the other hand. “Fingerprint” thus designates a compressed information signal which is generated from a music signal and does not contain the metadata but serves to make reference to the metadata, e.g. by searching in a database, e.g. in a system for identifying audio material (“audioID”).
Normally, music data consists of the superposition of partial signals from individual sources. While in pop music, there are typically relatively few individual sources, i.e. the singer, the guitar, the bass guitar, the drums and a keyboard, the number of sources may become very large for an orchestra piece. An orchestra piece and a piece of pop music, for example, consist of a superposition of the tones emitted by the individual instruments. Thus, an orchestra piece, or any piece of music, represents a superposition of partial signals from individual sources, the partial signals being the tones generated by the individual instruments of the orchestra and/or pop music formation, and the individual instruments being individual sources.
Alternatively, even groups of original sources may be regarded as individual sources, so that one signal may be assigned at least two individual sources.
An analysis of a general information signal will be presented below, by way of example only, with reference to an orchestra signal. Analysis of an orchestra signal may be performed in a variety of ways. For example, there may be a desire to recognize the individual instruments and to extract the individual signals of the instruments from the overall signal, and to possibly translate them into musical notation, in which case the musical notation would act as “metadata”. Other possibilities of analysis are to extract a dominant rhythm, it being easier to extract rhythms on the basis of the percussion instruments rather than on the basis of instruments which rather produce tones, also referred to as harmonically sustained instruments. While percussion instruments typically include kettledrums, drums, rattles or other percussion instruments, the harmonically sustained instruments include all other instruments, such as violins, wind instruments, etc.
In addition, percussion instruments include all those acoustic or synthetic sound producers which contribute to the rhythm section on the ground of their sound properties (e.g. rhythm guitar).
Thus, it would be desirable, for example for rhythm extraction in a piece of music, to extract only percussive portions from the entire piece of music, and to then perform rhythm detection on the basis of these percussive portions without “interfering with” the rhythm detection by signals coming from the harmonically sustained instruments.
On the other hand, any analysis pursuing the goal of extracting metadata which requires exclusively information about the harmonically sustained instruments (e.g. a harmonic or melodic analysis) will benefit from an upstream separation and of further processing of the harmonically sustained portions.
Very recently, there have been reports, in this context, about the utilization of blind source separation (BSS) and independent component analysis (ICA) techniques for signal processing and signal analysis. Fields of applications are, in particular, biomedical technology, communication technology, artificial intelligence and image processing.
Generally, the term BSS includes techniques for separating signals from a mix of signals with a minimum of previous experience with or knowledge of the nature of signals and the mixing process. ICA is a method based on the assumption that the sources underlying a mix are statistically independent of each other at least to a certain degree. In addition, the mixing process is assumed to be invariable in time, and the number of the mixed signals is assumed to be no smaller than the number of the source signals underlying the mix.
Independent subspace analysis (ISA) represents an expansion of ICA. With ISA, the components are subdivided into independent subspaces, the components of which need not be statistically independent. By transforming the music signal, a multi-dimensional representation of the mixed signal is determined, and the latter assumption for the ICA is met. In the last few years, various methods of calculating the independent components have been developed. What follows is relevant literature also dealing, in part, with analyzing audio signals:
In [1], a method of separating individual sources of mono audio signals is represented. [2] gives an application for a subdivision into single traces, and, subsequently, rhythm analysis. In [3], a component analysis is performed to achieve a subdivision into percussive and non-percussive sounds of a polyphonic piece. In [4], independent component analysis (ICA) is applied to amplitude bases obtained from a spectrogram representation of a drum trace by means of generally calculated frequency bases. This is performed for transcription purposes. In [5], this method is expanded to include polyphonic pieces of music.
The first above-mentioned publication by Casey will be represented below as an example of the prior art. Said publication describes a method of separating mixed audio sources by the technique of independent subspace analysis. This involves splitting up an audio signal into individual component signals using BSS techniques. To determine which of the individual component signals belong to a multi-component subspace, grouping is performed to the effect that the components' mutual similarity is represented by a so-called ixegram. The ixegram is referred to as a cross-entropy matrix of the independent components. It is calculated in that all individual component signals are examined, in pairs, in a correlation calculation to find a measure of the mutual similarity of two components. Thus, exhaustive pair-wise similarity calculations are performed across all component signals, so that what results is a similarity matrix in which all component signals are plotted along a y axis, and in which all component signals are also plotted along the x axis. This two-dimensional array provides, for each component signal, a measure of similarity with one other component signal, respectively. The ixegram, i.e. the two-dimensional matrix, is now used to perform clustering, for which purpose grouping is performed using a cluster algorithm on the basis of dyadic data. To perform optimum partitioning of the ixegram into k categories, a cost function is defined which measures the compactness within a cluster and determines the homogeneity between clusters. The cost function is minimized, so that what eventually results is an allocation of individual components to individual subspaces. If this is applied to a signal which represents a speaker in the context of a continual roaring of a waterfall, what results as the subspace is the speaker, the reconstructed information signal of the speaker subspace exhibiting significant attenuation of the roaring of the waterfall.
What is disadvantageous about the concepts described is the fact that the case where the signal portions of a source will come to lie on different component signals is very likely. This is the reason why, as has been described above, a complex and computing-time-intensive similarity calculation is performed among all component signals to obtain the two-dimensional similarity matrix, on the basis of which a classification of component signals into subspaces will eventually be performed by means of a cost function to be minimized.
What is also disadvantageous is the fact that in the case where there are several individual sources, i.e. where the output signal is not known upfront, even though there will be a similarity distribution after a longish calculation, the similarity distribution itself does not give an actual idea of the actual audio scene. Thus, the viewer knows merely that certain component signals are similar to one another with regard to the minimized cost function. However, he/she does not know which information is contained in these subspaces, which were eventually obtained, and/or which original individual source or which group of individual sources are represented by a subspace.
Independent subspace analysis (ISA) may therefore be exploited to decompose a time-frequency representation, i.e. a spectrogram, of an audio signal into independent component spectra. To this end, the above-described prior methods rely either on a computationally intensive determination of frequency and amplitude bases from the entire spectrogram, or on frequency bases defined upfront. Such frequency bases and/or profile spectra defined upfront consist, for example, in that a piece is said to be very likely to feature a trumpet, and that an exemplary spectrum of a trumpet will then be used for signal analysis.
This procedure has the disadvantage that one has to know all featuring instruments upfront, which goes against, in principle already, to automated processing. A further disadvantage is that, if one wants to operate in a meticulous manner, there are, for example, not only trumpets, but many different kinds of trumpets, all of which differ in terms of their qualities of sound, or timbres, and thus in their spectra. If the approach were to employ all types of exemplary spectra for component analysis, the method again becomes very time-consuming and expensive and gets to exhibit a very high redundancy, since typically not all feasible different kinds of trumpets will feature in one piece, but only trumpets of one single kind, i.e. with one single profile spectrum, or perhaps with very few different timbres, i.e. with few profile spectra. The problem gets worse when it comes to different notes of a trumpet, especially as each tone comprises a spread/contracted profile spectrum, depending on the pitch. Taking this into account also involves a huge computational expenditure.
On the other hand, decomposition on the basis of ISA concepts becomes extremely computationally intensive and susceptible to interference if the entire spectrogram is used. It shall be pointed out that a spectrogram typically consists of a series of individual spectra, a hopping time period being defined between the individual spectra, and a spectrum representing a specific number of samples, so that a spectrum has a specific time duration, i.e. a block of samples of the signal, associated with it. Typically, the duration represented by the block of samples from which a spectrum is calculated is considerably longer than the hopping time so as to obtain a satisfactory spectrogram with regard to the frequency resolution required and with regard to the time resolution required. However, on the other hand it may be seen that this spectrogram representation is extraordinarily redundant. If one considers the case, for example, that a hopping time duration amounts to 10 ms and that a spectrum is based on a block of samples having a time duration of, e.g., 100 ms, every sample will come up in 10 consecutive spectra. The redundancy thus created may cause the requirements in terms of computing time to reach astronomical heights especially if a relatively large number of instruments are searched for.
In addition, the approach of working on the basis of the entire spectrogram is disadvantageous for such cases where not all sources contained are to be extracted from a signal, but where, for example, only sources of a specific kind, i.e. sources having a specific characteristic, are to be extracted. Such a characteristic may relate to percussive sources, i.e. percussion instruments, or to so-called pitched instruments, also referred to as harmonically sustained instruments, which are typical instruments of tune, such as trumpet, violin, etc. A method operating on the basis of all these sources will then be too time-consuming and expensive and, after all, also not robust enough if, for example, only some sources, i.e. those sources which are to meet a specific characteristic, are to be extracted. In this case, individual spectra of the spectrogram, wherein such sources do not occur or occur only to a very small extent, will corrupt, or “blur” the overall result, since these spectra of the spectrogram are self-evidently included into the eventual component analysis calculation just as much as the significant spectra.
It is an object of the present invention to provide a robust and computing-time-efficient concept for analyzing an information signal.
In accordance with a first aspect, the invention provides a device for analyzing an information signal, having:
In accordance with a second aspect, the invention provides a method for analyzing an information signal, the method including the steps of:
In accordance with a third aspect, the invention provides a computer program having a program code for performing the method for analyzing an information signal, the method including the steps of:
The present invention is based on the findings that robust and efficient information-signal analysis is achieved by initially extracting significant short-time spectra or short-time spectra derived from significant short-period spectra, such as difference spectra etc., from the entire information signal and/or from the spectrogram of the information signal, the short-period spectra extracted being such short-time spectra which come closer to a specific characteristic than other short-time spectra of the information signal.
What is preferably extracted are short-time spectra which have percussive portions, and consequently, short-time spectra which have harmonic portions will not be extracted. In this case, the specific characteristic is a percussive, or drum, characteristic.
The short-period spectra extracted or short-period spectra derived from the short-period spectra extracted are then fed to a means for decomposing the short-period spectra into component-signal spectra, a component-signal spectrum representing a profile spectrum of a tone source which generates a tone corresponding to the characteristic sought for, and another component-signal spectrum representing another profile spectrum of a tone source which generates a tone also corresponding to the characteristic sought for.
Eventually, an amplitude envelope is calculated over time on the basis of the profile spectra of the tone sources, the profile spectra determined as well as the original short-time spectra being used for calculating the amplitude envelope over time, so that for each point in time, at which a short-time spectrum was taken, an amplitude value is obtained as well.
The information thus obtained, i.e. various profile spectra as well as amplitude envelopes for the profile spectra, thus provides a comprehensive description of the music and/or information signal with regard to the specified characteristic with regard to which the extraction has been performed, so that this information may already be sufficient for performing a transcription, i.e. for initially establishing, with concepts of feature extraction and segmenting, which instrument “belongs to” the profile spectrum and which rhythmics are at hand, i.e. which are the events of rise and fall which indicate notes of this instrument that are played at specific points in time.
The present invention is advantageous in that rather than the entire spectrogram, only extracted short-time spectra are used for calculating the component analysis, i.e. for decomposing, so that the calculation of the independent subspace analysis (ISA) is performed only using a subset of all spectra, so that computing requirements are lowered. In addition, the robustness with regard to finding specific sources is also increased, particularly as other short-time spectra which do not meet the specified characteristic are not present in the component analysis and therefore do not represent any interference and/or “blurring” of the actual spectra.
In addition, the inventive concept is advantageous in that the profile spectra are determined directly from the signal without this resulting in the problems of the ready-made profile spectra, which again would lead to either inaccurate results or to increased computational expenditure.
Preferably, the inventive concept is employed for detecting and classifying percussive, non-harmonic instruments in polyphonic audio signals, so as to obtain both profile spectra and amplitude envelopes for the individual profile spectra.
Preferred embodiments of the present invention will be explained below in detail with regard to the accompanying figures, wherein:
a shows an example of an amplitude envelope for a percussive source;
b shows an example of a profile spectrum for a percussive source;
a shows an example of an amplitude envelope for a harmonically sustained instrument; and
b shows an example of a profile spectrum for a harmonically sustained instrument.
The extracted spectra, i.e. the original short-time spectra or the short-time spectra derived from the original short-time spectra, for example by differentiating, differentiating and rectifying, or by means of other operations, are fed to means 18 for decomposing the extracted short-time spectra into component signal spectra, one component signal spectrum representing a profile spectrum of a tone source which generates a tone corresponding to the characteristic sought for, and another profile spectrum representing another tone source which generates a tone also corresponding to the characteristic sought for.
The profile spectra are eventually fed to means 20 for calculating an amplitude envelope for the one tone source, the amplitude envelope indicating how the profile spectra of a tone source change over time and, in particular, how the intensity, or weighting, of a profile spectrum changes over time. Means 20 is configured to function on the basis of the sequence of short-time spectra, on the one hand, and on the basis of the short-period spectra, on the other hand, as may be seen from
With reference to
As may be seen from
Optionally, it is preferred to use the phase information, which is provided from block 12 to block 16c via phase line 13, as an indicator for the reliability of the maxima found. The spectra for which the maximum searcher detects a maximum in the detection function are used as {circumflex over (X)}t and represent the short-time spectra extracted.
In block 18a, a principal component analysis (PCA) is performed. For this purpose, a sought-for number of components d is initially specified. Thereafter, PCA is performed in accordance with a suitable method, such as singular value decomposition or eigenvalue decomposition, across the columns of matrix {circumflex over (X)}t.
{tilde over (X)}={circumflex over (X)}t·T
The transformation matrix T causes a dimension reduction with regard to {tilde over (X)}, which results in a reduction of the number of columns of this matrix. In addition, a decorrelation and variance normalization are achieved. In block 18b, a non-negative independent component analysis is then performed. For this purpose, the method, shown in [6], of non-negative independent component analysis is performed with regard to {tilde over (X)} for calculating a separation matrix A. In accordance with the equation below, {tilde over (X)} is decomposed into independent components.
F=A·{tilde over (X)}
Independent components F are interpreted as static spectral profiles, or profile spectra, of the sound sources present. In a block 20, the amplitude basis, or amplitude envelope E, is then extracted for the individual tone sources in accordance with the following equation.
E=F·X
The amplitude basis is interpreted as a set of time-variable amplitude envelopes of the corresponding spectral profiles.
In accordance with the invention, the spectral profile is obtained from the music signal itself. Hereby, the computational complexity is reduced in comparison with the previous methods, and increased robustness towards stationary signal portions, i.e. signal portions due to harmonically sustained instruments, is achieved.
In a block 22, a feature extraction and a classification operation are then performed. In particular, the components are distinguished into two subsets, i.e. initially into a subset having the properties “non-percussive”, i.e. harmonic, as it were, and into another, percussive subset. In addition, the components having the property “percussive/dissonant” are classified further into various classes of instruments.
For classification into the two subsets, the features of percussivity, or spectral dissonance, are used.
The following features are employed for classifying instruments:
Classification may be performed into the following classes of instruments, for example:
For increasing the robustness of the inventive concept even further, a decision for using percussion onsets and/or an acceptance of percussive maxima may be performed in a block 24. Thus, maxima with a transient rise in the amplitude envelope above a variable threshold value are considered percussive events, whereas maxima with a transient rise below the variable threshold value are discarded, or recognized as artifacts and ignored. The variable threshold value preferably varies with the overall amplitude in a relatively large range around the maximum. Output is performed in a suitable form which associates the point of time of percussive events with a class of instruments, an intensity and, possibly, further information such as, for example, note and/or rhythm information in a MIDI format.
It shall be pointed out here that means 16 for extracting significant short-time spectra may be configured to perform this extraction using actual short-time spectra such as are obtained, for example, with a short-time Fourier transform. In particular with the example of application of the present invention, wherein the specific characteristic is the percussive characteristic, it is preferred not to extract actual short-time spectra but short-time spectra from a differentiated spectrogram, i.e. from difference spectra. The differentiation as is shown in block 16a in
In addition, it is preferred to perform PCA 18a and non-negative ICA 18b, i.e., more generally speaking, the decomposition operations for decomposing the extracted short-time spectra in block 18 of
In addition, it shall be pointed out that means 18 for decomposing, which performs a PCA 18a with a subsequent non-negative ICA (18b), anyhow performs a weighted linear compensation of the extracted spectra provided by the means, for determining a profile spectrum. This means that specific weighting factors calculated by the individual methods are applied to the spectra extracted, or that the spectra extracted are linearly combined, i.e. by subtraction or addition. Therefore, one can observe, at least partially, the effect that for depositing the short-time spectra extracted, means 18 may have a functionality which counteracts differentiation, so that the profile spectra determined for the tone sources are not differentiated profile spectra, but are the actual profile spectra. In any case, one has found that using differentiated spectra, i.e. difference spectra from a difference spectrogram in combination with a decomposition algorithm—the decomposition algorithm being based on a weighted linear combination of the individual spectra extracted—leads to profile spectra for the individual high-quality and high-selectivity tone sources in means 18.
If, on the other hand, only stationary portions were processed further, i.e. if the specific characteristic is not a percussive, but a harmonic characteristic, it is preferred to achieve pre-processing of the spectrogram by integration, i.e. by summing up, so as to reinforce the stationary portions as compared to the transient portions. In this case, too, it is preferred to calculate the profile spectra for the individual—in this case harmonic—tone sources using the sum spectra, i.e. the integrated spectrogram.
Individual functionalities of the inventive concept will be presented in more detail below. However, in a preferred embodiment of the present invention, typical digital audio signals are initially pre-processed by means 8. In addition, it is preferred to add, as a PCM audio signal input into pre-processing means 8, mono files having a width of 16 bits per sample at a sampling frequency of 44.1 Hz. These audio signals, i.e. this stream of audio samples, which may also be a stream of video samples and may generally be a stream of information samples, is fed to pre-processing means 8 so as to perform pre-processing within the time range using a software-based emulation of an acoustic-effect device often referred to as “exciter”. With this concept, the pre-processing stage 8 amplifies the high-frequency portion of the audio signal. This is achieved by performing a non-linear distortion with a high-pass filtered version of the signal, and by adding the result of the distortion to the original signal. It turns out that this pre-processing is particularly favorable when there are hi-hats to be evaluated, or idiophones with a similarly high pitch and low intensity. Their energetic weight in relation to the overall music signal is increased by this step, whereas most harmonically sustained instruments and percussion instruments having lower tones are not negatively affected.
Another positive side effect is the fact that MP3 encoded and decoded files which have been inherently low-pass filtered by this process, again obtain high-frequency information.
A spectral representation of the pre-processed time signal is then obtained using the time/frequency means 12, which preferably performs a short-time Fourier transform (STFT).
To implement the time/frequency means, a relatively large block size of preferably 4096 values, and a high degree of overlap are preferred. What is initially required is a good spectral resolution for the low-frequency range, i.e. for the lower spectral coefficient. In addition, the temporal resolution is increased to a desired accuracy by obtaining a hop size, i.e. a small hop interval between adjacent blocks. In the preferred embodiment, as has already been explained, 4096 samples per block are subject to a short-time Fourier transform, which corresponds to a temporal block duration of 92 ms. This means that each sample comes up more than 9 times in a row within a short-time spectrum.
Means 12 is configured to obtain an amplitude spectrum X. The phase information may also be calculated, and, as will be explained in more detail below, may be used in the extreme-value searcher, or maximum searcher, 16c.
The amount spectrum X now possesses n frequency bins or frequency coefficients, and m columns and/or frames, i.e. individual short-time spectra. The time-variable changes of each spectral coefficient are differentiated across all frames and/or individual spectra, specifically by differentiator 16a, to decimate the influence of harmonically sustained tone sources and to simplify subsequent detection of transients. The differentiation, which preferably comprises the formation of a difference between two short-time spectra of the sequence, may also exhibit certain normalizations.
It shall be pointed out that differentiation may lead to negative values, so that half-wave rectification is performed in a block 16b to eliminate this effect. Alternatively, however, the negative signs could simply be reversed, which is not preferred, however, with a view to the subsequent decomposition of components.
Because of the rectifier 16b, a non-negative difference spectrogram is thus obtained which is fed to maximum searcher 16c.
Maximum searcher 16c performs an event detection which will be dealt with below. The detection of several local extreme values and preferably of local maxima associated with transient onset events in the music signal is performed by initially defining a time tolerance which separates two consecutive drum onsets. In the preferred embodiment a time period of 68 ms is used as a constant value derived from time resolution and from knowledge about the music signal. In particular, this value determines the number of frames and/or individual spectra and/or differentiated individual spectra which must occur at least between two consecutive onsets. Use of this minimum distance is also supported by the consideration that at an upper speed limit of a very high speed of 250 bpm, a sixteenth of a note lasts 60 ms.
To be able to perform automated maximum search, a detection function, on the basis of which the maximum search may be performed, is derived from the differentiated and rectified spectrum, i.e. from the sequence of rectified (different) short-time spectra. In order to obtain, for each point in time, a value of this function, what is done is to simply determine a sum across all frequency coefficients and/or all spectral bins. To smooth this one-dimensional function, which will then result, over time, the function obtained is folded with a suitable Hann window, so that a relatively smooth function e is obtained. To obtain the positions t of the maxima, a sliding window having the tolerance length is “pushed” across the entire distance e to achieve the ability to obtain one maximum per step.
The reliability of the search for maxima is improved by the fact that preferably only those maxima are maintained which appear in a window for more than a moment, since they are very likely to be the interesting peaks. Thus it is preferred to use those maxima which represent a maximum over a predetermined threshold of moments, i.e., for example, three moments, the threshold eventually depending on the ratio of the block duration and the hop size. This goes to show that a maximum, if it really is a significant maximum, must be a maximum for a certain number of moments, i.e., eventually, for a certain number of overlapping spectra, if one considers the fact that with the numerical values represented above, each sample “is in on” at least 9 consecutive short-time spectra.
In the preferred embodiment of the present invention, the “unwrapped” phase information of the original spectrogram are used as a reliability function, as is depicted by the phase arrow. It turned out that a significant, positively directed phase shift needs to occur in addition to an estimated onset time t, which avoids that small ripples are erroneously regarded as onsets.
In accordance with the invention, a small portion of the difference spectrogram, specifically a short-time spectrum formed by differentiation, is extracted and fed to the subsequent decomposition means.
Subsequently, the functionality of means 18a for performing a principal component analysis will be addressed. From the steps described in the above paragraph, the information about the time of occurrence t and the spectral compositions of the onsets, i.e. the extracted short-time spectra Xt, are thus derived. With real music signals, one typically finds a large number of transient events within the duration of the piece of music. Even with a simple example of a piece having a speed of 120 beats per minute (bpm) it turns out that 480 events may occur in a four-minute extract, provided that only quarter notes occur. As to the goal of finding only a few significant subspaces and/or profile spectra, principal component analysis (PCA) is applied to {circumflex over (X)}t, i.e. to the short-time spectra extracted or to short-time spectra derived from the short-time spectra extracted.
Using this known technique it is possible to reduce the entire set of short-time spectra collected to a limited number of decorrelated principal components, which results in a positive representation of the original data with a small reconstruction error. To this end, an eigenvalue decomposition (EVD) of the covariance matrix of the data set is calculated. From the set of eigenvectors, those eigenvectors having the d largest eigenvalues are selected so as to provide the coefficients for the linear combination of the original vectors in accordance with the following equation:
{tilde over (X)}={circumflex over (X)}t·T
Therefore, T describes a transformation matrix, which is actually a subset of the multiplicity of the eigenvectors. In addition, the reciprocal values of the eigenvalues are used as scaling factors, which not only leads to a decorrelation, but also provides variance normalization, which again results in a whitening effect. Alternatively, a singular value decomposition (SVD) of {circumflex over (X)}t may also be used. One has found that SVD is equivalent to PCA with EVD. The whitened components {tilde over (X)} are subsequently fed into ICA stage 18b, which will be dealt with below.
Generally speaking, independent component analysis (ICA) is a technique used to decompose a set of linear mixed signals into their original sources or component signals. One requirement placed upon optimum behavior of the algorithm is the sources' statistical independence. Preferably, non-negative ICA is used which is based on the intuitive concept of optimizing a cost function describing the non-negativity of the components. This cost function is related to a reconstruction error introduced by pair-of-axes rotations of two or more variables in the positive quadrant of the common probability density function (PDF). The assumptions for this model imply that the original source signals are positive, and, at zero, have a PDF different from zero, and that they are linearly independent up to a certain degree. The first concept is always satisfied, since the vectors subject to ICA result from the differentiated and half-wave weighted version {circumflex over (X)} of the original spectrogram X, which version thus will never include values smaller than zero, but will certainly include values equaling zero. The second limitation is taken into account if the spectra collected at times of onset are regarded as the linear combinations of a small set of original source spectra characterizing the instruments in question. Of course, this means a rather rough approximation, which, however, proves to be sufficient in most cases.
In addition, use is made of the fact that the spectra which have onsets, particularly the spectra of actual percussion instruments, have no invariant structures, but are not subject to any changes here with regard to their spectral compositions. Nevertheless, it may be assumed that there are characteristic properties which are characteristic of spectral profiles of percussive tones and which thus allow the whitened components {tilde over (X)} to be separated into their potential source and profile spectra F, respectively, in accordance with the following equation.
F=A·{tilde over (X)}
A designates a d×d de-mixing matrix determined by the ICA process which actually separates the individual components {tilde over (X)}. The sources F are also referred to as profile spectra in this document. Each profile spectrum has n frequency bins, just like a spectrum of the original spectrogram, but is identical for all times—except for amplitude normalization, i.e. the amplitude envelope. This means that such a profile spectrum only contains that spectral information which is related to an onset spectrum of an instrument. In order to preferably circumvent arbitrary scaling of the components introduced by PCA and ICA, a transformation matrix R is used in accordance with the following equation:
R=T·AT
Normalizing R with its absolute maximum value results in weighting coefficients in a range from −1 to +1, so that spectral profiles extracted using the following equation
F={tilde over (X)}t·R
have values in the range of the original spectrogram. Further normalization is achieved by dividing each spectral profile by its L2 norm.
As has already been set forth above, the assumption of independence and the assumption of invariance is not always satisfied one hundred percent for given short-time spectra. Therefore, it comes as no surprise that the spectral profiles obtained after de-mixing still exhibit certain dependencies. However, this should not be regarded as defective behavior. Tests conducted with spectral profiles of individual percussive tones have revealed that the spectral profiles also exhibit a large amount of dependence between the onset spectra of different percussive instruments. One possibility of measuring the degree of mutual overlap and similarity along the frequency axis is to conduct crosstalk measurements. For reasons of illustration, the spectral profiles obtained from the ICA process may be regarded as a transfer function of highly frequency-selective parts in a filter bank, it being possible for passage bands to lead to crosstalk in the output of the filter bank channels. The crosstalk measure present between two spectral profiles is calculated in accordance with the following equation:
In the above equation, i ranges from 1 to d, j ranges from 1 to d, and j is different from i. In fact, this value is related to the well-known cross-correlation coefficient, but the latter uses a different normalization.
On the basis of the profile spectra determined, an amplitude-envelope determination is now performed in block 20 of
E=F·X
As the second information source, the differentiated version of the amplitude envelopes may also be determined, in accordance with the following equation, from the difference spectrogram:
Ê=F·{circumflex over (X)}
What is essential about this concept is that no further ICA calculation is performed with the amplitude envelopes. Instead, the inventive concept provides highly specialized spectral profiles which come very close to the spectra of those instruments which actually come up in the signal. Nevertheless, it is only in specific cases that the extracted amplitude envelopes are fine detection functions with sharp peaks, e.g. for dance-oriented music with highly dominant percussive rhythm portions. The amplitude envelopes often contain relatively small peaks and plateaus which may be due to the above-mentioned crosstalk effects.
A more detailed implementation of means 22 for feature extraction and classification will be pointed out below. It is well-known that the actual number of components is initially unknown for real music signals. In this context, “components” signify both the spectral profiles and the corresponding amplitude envelopes. If the number d of components extracted is too low, artifacts of the non-considered components are very likely to come up in other components. If, on the other hand, too many components are extracted, the most prominent components are divided up into several components. Unfortunately, this division may occur even with the right number of components and may occasionally complicate detection of the real components.
To overcome this problem, a maximum number d of components is specified in the PCA or ICA process. Subsequently, the components extracted are classified using a set of spectral-based and time-based features. Classification is to provide two kinds of information. Initially, those components which are detected, with a high degree of certainty, as non-percussive are to be eliminated from the further procedure. In addition, the remaining components are to be assigned to predefined classes of instruments.
A suitable measure of differentiating between the amplitude envelopes is given by percussivity, mentioned in the third specialist publication. Here, use is made of a modified version wherein the correlation coefficient between corresponding amplitude envelopes is used in Ê and E. The degree of correlation between both vectors tends to be small if the characteristic plateaus related to harmonically sustained tones come up in the non-differentiated amplitude envelopes E. The latter are very likely to disappear in the differentiated version Ê. Both vectors are much more similar in the case of transient amplitude envelopes stemming from percussive tones. For this purpose, reference shall be made to
Thus, the amplitude envelopes may be used for classification and/or feature extraction equally well as the profile spectra, explained below, which clearly differ in the case of a percussive source (
Thus, a spectral-based measure, i.e. a measure derived from the profile spectra (e.g.
Assigning spectral profiles to pre-defined classes of percussive instruments is provided by a simple classifier for classifying the k next neighbor with spectral profiles of individual instruments as a training database. The distance function is calculated from at least one correlation coefficient between a query profile and a database profile. In order to verify the classification in cases of low reliability, i.e. at low correlation coefficients, or to verify multiple occurrences of the same instruments, additional features are extracted which provide detailed information about the form of the spectral profile. These features include the individual features already mentioned above.
In the following, the functionality of the decider 24 in
In accordance with the invention, automatic detection, and preferably also automatic classification, of non-pitched percussive instruments in real polyphonic music signals is thus achieved, the starting basis for this being the profile spectra, on the one hand, and the amplitude envelope, on the other hand. In addition, the rhythmic information of a piece of music may also be easily extracted from the percussive instruments, which in turn is likely to lead to a favorable note-to-note transcription.
Depending on the circumstances, the inventive method for analyzing an information signal may be implemented in hardware or in software. Implementation may occur on a digital storage medium, in particular a disc or CD with electronically readable control signals which can interact with a programmable computer system such that the method is performed. Generally, the invention thus also consists in a computer program product with a program code, stored on a machine-readable carrier, for performing the method, when the computer program product runs on a computer. In other words, the invention may thus be realized as a computer program having a program code for performing the method, when the computer program runs on a computer.
While this invention has been described in terms of several preferred embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Number | Name | Date | Kind |
---|---|---|---|
3581192 | Miura et al. | May 1971 | A |
3673331 | Hair et al. | Jun 1972 | A |
3828133 | Ishigami et al. | Aug 1974 | A |
3855417 | Fuller | Dec 1974 | A |
4076960 | Buss et al. | Feb 1978 | A |
4207527 | Abt | Jun 1980 | A |
4424415 | Lin | Jan 1984 | A |
4442540 | Allen | Apr 1984 | A |
4457014 | Bloy | Jun 1984 | A |
4641343 | Holland et al. | Feb 1987 | A |
4959863 | Azuma et al. | Sep 1990 | A |
5086475 | Kutaragi et al. | Feb 1992 | A |
5214708 | McEachern | May 1993 | A |
5615302 | McEachern | Mar 1997 | A |
5809459 | Bergstrom et al. | Sep 1998 | A |
5828994 | Covell et al. | Oct 1998 | A |
5832424 | Tsutsui | Nov 1998 | A |
5870703 | Oikawa et al. | Feb 1999 | A |
5909664 | Davis et al. | Jun 1999 | A |
5918223 | Blum et al. | Jun 1999 | A |
5950156 | Ueno et al. | Sep 1999 | A |
5950664 | Battaglia | Sep 1999 | A |
6140568 | Kohler | Oct 2000 | A |
6195632 | Pearson | Feb 2001 | B1 |
6202046 | Oshikiri et al. | Mar 2001 | B1 |
6266644 | Levine | Jul 2001 | B1 |
6275795 | Tzirkel et al. | Aug 2001 | B1 |
6301555 | Hinderks | Oct 2001 | B2 |
6413098 | Tallal et al. | Jul 2002 | B1 |
6505160 | Levy et al. | Jan 2003 | B1 |
6534700 | Cliff | Mar 2003 | B2 |
6646587 | Funai | Nov 2003 | B2 |
6675140 | Irino et al. | Jan 2004 | B1 |
6751564 | Dunthorn | Jun 2004 | B2 |
6755629 | Utsumi | Jun 2004 | B2 |
6829368 | Meyer et al. | Dec 2004 | B2 |
6868365 | Balan et al. | Mar 2005 | B2 |
6873955 | Suzuki et al. | Mar 2005 | B1 |
6941275 | Swierczek | Sep 2005 | B1 |
6965068 | Moriat | Nov 2005 | B2 |
6990453 | Wang et al. | Jan 2006 | B2 |
7085721 | Kawahara et al. | Aug 2006 | B1 |
7191128 | Sall et al. | Mar 2007 | B2 |
7232948 | Zhang | Jun 2007 | B2 |
7302574 | Conwell et al. | Nov 2007 | B2 |
7317958 | Freed et al. | Jan 2008 | B1 |
7349552 | Levy et al. | Mar 2008 | B2 |
7415129 | Rhoads | Aug 2008 | B2 |
7461136 | Rhoads | Dec 2008 | B2 |
7467087 | Gillick et al. | Dec 2008 | B1 |
7478045 | Allamanche et al. | Jan 2009 | B2 |
7565213 | Dittmar et al. | Jul 2009 | B2 |
7587602 | Rhoads | Sep 2009 | B2 |
7590259 | Levy et al. | Sep 2009 | B2 |
20010044719 | Casey | Nov 2001 | A1 |
20020169601 | Nishio | Nov 2002 | A1 |
20030055630 | Byrnes et al. | Mar 2003 | A1 |
20030125936 | Dworzak | Jul 2003 | A1 |
20030182105 | Sall et al. | Sep 2003 | A1 |
20030182106 | Bitzer et al. | Sep 2003 | A1 |
20040049383 | Kato et al. | Mar 2004 | A1 |
20040122662 | Crockett | Jun 2004 | A1 |
20040148159 | Crockett et al. | Jul 2004 | A1 |
20040181393 | Baumgarte | Sep 2004 | A1 |
20040215447 | Sundareson | Oct 2004 | A1 |
20050091040 | Nam et al. | Apr 2005 | A1 |
20050137730 | Trautmann et al. | Jun 2005 | A1 |
20050273319 | Cittmar et al. | Dec 2005 | A1 |
20060064299 | Uhle et al. | Mar 2006 | A1 |
20090265024 | Dittmar et al. | Oct 2009 | A1 |
Number | Date | Country |
---|---|---|
1197020 | Nov 2007 | EP |
2363227 | Dec 2001 | GB |
2000035796 | Feb 2000 | JP |
2004029274 | Jan 2004 | JP |
WO-0116937 | Mar 2001 | WO |
WO-0188900 | Nov 2001 | WO |
Number | Date | Country | |
---|---|---|---|
20090265024 A1 | Oct 2009 | US |