This Utility Patent Application claims the benefit of the filing date of German Application No. DE 10 2004 028 694.9 filed Jun. 14, 2004, and International Application No. PCT/EP2005/004518 filed Apr. 27, 2005, both of which are herein incorporated by reference.
The present invention relates to information signal processing and particularly to audio signal processing for the purpose of polyphonic music analysis or polyphonic music transcription.
The variety of musical presentations and the number of tastes in music of the audience have grown equally in the last few years. In particular, the interest in music is growing in the population due to the rapid advances in storing and further distributing pieces of music. Thus, the digital storage has made it possible to copy pieces of music as often as one likes without loss in quality. The most prominent example for this is the CD, which has almost completely superseded records. Recently, DVDs are also becoming increasingly popular, since they do not only enable the presentation of stereo music, but also multi-channel music, i.e. the known 5.1 surround format, for example.
Previously, the main focus was on the improvement of the sound quality and in the improvement of the distribution methods. But the increasing expansion of the Internet and digital broadcasting has been accompanied by new demands for a pre-filtering of the large amounts of music data available for the individual people. In this connection, the metadata concept, i.e. providing data via music data, reaches a new dimension. While descriptive data previously have been generated manually and added to the corresponding piece of music, automatic means to objectively analyze the content of a piece of music are being developed. Standardization methods in this field are known by the keyword “MPEG-7”.
Thus, achievements of this music analysis are to be seen in an efficient music summary or in a format-independent association of metadata with pieces of music. An objective of the automatic generation of metadata also consists in the ability to extract features from the original content, which are related to the taste in music of the user. For example, it is known to use extracted features of pieces of music to train a music provision system in that it categorizes incoming music into different musical genres.
In order to specify the musical content in manageable and yet searchable manner, i.e. in order to provide data that can be read and interpreted both by humans and by machines, reference has to be made to semantically meaningful properties of the audio signal. Such properties are the tone of instruments, the melody contained in a piece, the tempo, the rhythm, or the harmony of a piece, for example. In this connection, particularly the harmony feature is of special significance, since its importance is meaningful as an indicator for a mood of a musical passage. A piece is perceived differently in terms of feeling by a listener, depending on whether it is dissonant or harmonic, or whether it is written in a major key or in a minor key. At the same time, the harmony gives hints to the structural diversity of the available music material, for example whether there are quick and unusual chord changes, or whether there are repetitive properties in the chord structure.
The automatic expansion of polyphonic notes to full chords is known from musical tone synthesis. Modern synthesizers and keyboards are capable of automatically accompanying a player by analyzing their playing in real time and by generating a bass accompaniment, for example. The rules employed by such synthesizers or keyboards may also be applied to notes recovered from polyphonic music, even if not all notes can be recovered yet due to technical imperfections, in order to finally find dominant chords in an examined piece of music.
Thus, it is one object to analyze pieces of music not already present in musical notation or as a MIDI file, but present in form or their acoustic/electric waveforms, in order to extract individual notes from the examined piece of music due to waveform present in the time domain. The objective hereof lies in the melodic transcription of polyphonic music, i.e. ultimately the generation of a complete musical notation from a time domain representation of the music, which ultimately is a series of samples, as it is stored on a CD, for example, or is present in an mp3 file in compressed/encoded manner, for example.
A musical notation of a piece of music may in a way be considered a frequency domain representation, since the piece of music is not given by a waveform in the time domain but by a series of notes or chords, i.e. several concurrent notes, which is written in the frequency domain, with the note lines here being the frequency range scale.
At the same time, a musical notation also includes, however, time information in that a note is to be played either longer or shorter due to its symbol. The musical notation does therefore not place too much importance on a pure frequency domain representation, i.e. the representation of an amplitude at a special frequency, even though amplitude information is also given. This information is, however, not specified, but generally as information, whether a portion of the piece of music, i.e. some bars or notes of a musical notation, for example, are to be played loudly (forte) or quietly (piano).
In classical music, in particular, but also in modern music, it can be assumed that—apart from percussive portions—all notes/tones lie in a predefined note raster. Thus, in a correctly played piece of music not all frequencies can be present, but only the frequencies permitted by the musical notation. In the western note scale, one octave is divided into twelve halftones. These twelve halftones are, however, not arranged at a constant spacing—with reference to the frequency. Instead, in the tempered mood, as it is known due to the “Well-Tempered Clavier” by Johann Sebastian Bach, for example, a sequence of tones is employed, which is such that the “quality” or the “Q factor” is constant for each tone. This means that a frequency value divided by the bandwidth associated with this frequency value is constant for every tone. Tones with low frequencies have small bandwidths, whereas tones with high frequencies have great bandwidths.
This “geometric” notes classification is exemplarily illustrated in
These spectral coefficients also referred to as variable spectral coefficients in the classification shown in the left half of
In the constant spectral coefficients, the spacing between two spectral coefficients at the lower end of the spectrum to the upper end of the spectrum is always the same. For illustration purposes, the twelve tones in
From the above discussion, it becomes obvious that constant spectral coefficients, as they are provided by a Fourier transform, for example, are in contrast at least with the western sense of music.
But since a transcription is to be created from a piece of music, as a first step to a harmony analysis, often no Fourier transform but a so-called constant Q transform is employed, i.e. a transform taking into account that the quality of each variable spectral coefficient is identical. This leads to the fact that the transform is supposed to provide a frequency raster, which is no constant frequency raster, as it is shown on the right in
In the technical publication “Calculation of a Constant Q Spectral Transform”, Judith, C. Brown, Journal of the Acoustical Society of America, 89 (1), pages 425-432, January 1991, a time-frequency conversion is shown, which takes into account that the scale of western music is based on a geometric spectral coefficient spacing. Such a constant Q transform may be derived from a Fourier transform, in which the logarithm is taken of the frequency axis. This “pattern” in the frequency domain is the same for all music signals with harmonic frequency components. But differences manifest themselves in the amplitudes of the components in spite of their relatively fixed positions. These amplitude differences give the tone its tone color, for example.
When the frequency axis is illustrated logarithmically, it turns out that the mapping of constant spectral coefficients to variable spectral coefficients provides too little information at low frequencies and too much information at high frequencies. The discrete short-time Fourier transform gives a constant resolution for every frequency bin, which is inversely proportional to the temporal window size. This means that a window with 1,024 samples at a sampling rate of 32,000 samples per second has a resolution of 31.3 Hz. At the lower end of a violin, for example, i.e. at the frequency G3 of 196 Hz, this resolution is 16% of the frequency. This is much greater than a 6% frequency separation for two adjacent notes, which are tuned to the same mood. At the upper end of a piano, the frequency of C8 is 4186 Hz, wherein the FFT resolution of 31.3 Hz leads to a resolution value of 0.7% of the center frequency. Thus, much too great a number of frequency coefficients is calculated by the FFT at this point in the frequency range. Mathematically, the constant Q transform is represented as follows:
In this equation x[n] is the n-th sample of a digitized time function to be analyzed. The digital frequency is 2 πk/N. The period in samples is N/k, and the number of analyzed cycles is equal to k. Here, W[n] indicates the window shape. The window function has the same shape for each component. Its length is, however, determined by N[k], so that it is a function of k and n.
In the technical publication “An Efficient Algorithm for the Calculation of a Constant Q Transform”, Judith C. Brown et al., Journal of the Acoustical Society of America, 92 (5), pages 2698-2701, November 1992, an efficient algorithm for calculating the previously described transform is given. At first a discrete Fourier transform is determined, which is then converted to a constant Q transform, wherein Q is the ratio of center frequency to the bandwidth. To this end, so-called kernels are calculated, which then are applied to each consecutive DFT. Thus, each component of the constant Q transform can be calculated with a few multiplications. A spectral kernel is the discrete Fourier transform of a temporal kernel, wherein a temporal kernel is given as follows:
As window w[n,k], a Hamming window according to the following definition is used:
w└n,kcq┘=a−(1−a)cos(2πn/N└kcq┘),
In this equation, α equals 25/46.
In F. J. Harris, “High-Resolution Spectral Analysis with Arbitrary Spectral Centers and Arbitrary Spectral Resolutions”, “Comput. Electr. Eng. 3”, pages 171-191, 1976, a transform with bounded Q value is used, which may also serve for music analysis. Here, at first a fast transform is calculated, in order to then again discard the frequency values with the exception of the topmost octave. Then, it is filtered, downsampled by a factor of 2, in order to finally calculate a further FFT with the same amount of points as before, which leads to twice the previous resolution. Of this result, again only the second-highest octave is retained. Then, this procedure is repeated until the lowest octave is reached. The advantage of this method is that the efficiency of the FFT is maintained, and that at the same time a variable frequency and a variable time resolution are obtained, so that one is capable of optimizing the obtained information both with respect to the frequency and with respect to the time.
It is disadvantageous in this concept that, when a larger tone space is to be calculated, nevertheless a large amount of Fourier transforms is to be calculated, wherein between each Fourier transform windowing (filtering) has to be performed anew and at the same time downsampling has to be done. This in turn means that for the lowest octave very many temporal samples are needed, whereas very few temporal samples are needed for the topmost octave. Thus, if one wishes to calculate a complete analysis, for every (small) number of samples for the topmost octave the entire pyramid, so to speak, has to be calculated through. Since most results of each FFT are further “thrown away” in this method, and since a rather significant number of overlaps with respect to the lower octaves is required in the temporal “pyramid”, this method is extremely intensive, in spite of using the indeed efficient FFT. In other words, for each octave an FFT of its own has to be calculated to obtain a complete spectrum. If one wishes to analyze a time signal completely, i.e. for example every 8 milliseconds or every 16 milliseconds, in case for example 6 octaves are to be calculated, as many as 96 (!) FFTs will be required for an excerpt of a piece of 128 milliseconds.
One embodiment of the present invention provides a more efficient concept for converting an audio signal to a spectral representation with variable spectral coefficients.
In accordance with a first aspect, the present invention provides an apparatus for converting an information signal, which is given as a series of samples, to a spectral representation with variable spectral coefficients, with a frequency value and a bandwidth being associated with a variable spectral coefficient, and with a frequency spacing of the variable spectral coefficients being variable, having: a window filter for windowing the information signal to obtain a windowed block of the information signal having a length in time; a converter for converting the windowed block of samples to a spectral representation having a set of information signal spectral coefficients; a provider for providing a first set of complex base function coefficients, a second set of complex base function coefficients and a third set of complex base function coefficients, wherein the base function coefficients of the first set represent a result of a first windowing and transform of a first base function, which has a frequency corresponding to a first frequency value of a first variable spectral coefficient, wherein the base function coefficients of the second set represent a result of a second windowing and transform of a second base function, which has a frequency corresponding to a second frequency value of a second variable spectral coefficient, and wherein the base function coefficients of the third set represent a result of a third windowing and transform of the second base function, which has the second frequency value, wherein the first windowing, the second windowing and the third windowing differ in that a window length of a window in the first windowing differs from a window length of a window in the second and the third windowing, and that a window position of the second window and of the third window differ with reference to the second base function; and a weighter for weighting the set of information signal spectral coefficients with the first set of base function coefficients, in order to calculate the first variable spectral coefficient, for weighting the set of information signal spectral coefficients with the second set of base function coefficients, in order to obtain the second variable spectral coefficient for a first portion of the windowed block of the information signal, and for weighting the set of information signal spectral coefficients with the third set of base function coefficients, in order to obtain the second variable spectral coefficient for a second portion of the windowed block of the information signal, which is different from the first portion of the windowed block of the information signal.
In accordance with a second aspect, the present invention provides an apparatus for providing sets of base function coefficients, having: a provider for providing a time representation of a first and a second base function, wherein the first base function has a first frequency value, and wherein the second base function has a second frequency value, which is higher than the first frequency value; a window filter for windowing the first base function with a first window and for windowing the second base function with a second window and a third window, wherein the third window relates to a portion of the second base function later in time than the second window; and a transformer for transforming a result of a windowing of the first base function with the first window, in order to obtain a first set of base function coefficients, for transforming a result of a windowing of the second base function with the second window, in order to obtain a second set of base function coefficients, and for windowing a result of a third windowing of the second base function with the third window, in order to obtain a third set of base function coefficients.
In accordance with a third aspect, the present invention provides a method of converting an information signal, which is given as a series of samples, to a spectral representation with variable spectral coefficients, with a frequency value and a bandwidth being associated with a variable spectral coefficient, and with a frequency spacing of the variable spectral coefficients being variable, with the steps of: windowing the information signal to obtain a windowed block of the information signal having a length in time; converting the windowed block of samples to a spectral representation having a set of information signal spectral coefficients; providing a first set of complex base function coefficients, a second set of complex base function coefficients and a third set of complex base function coefficients, wherein the base function coefficients of the first set represent a result of a first windowing and transform of a first base function, which has a frequency corresponding to a first frequency value of a first variable spectral coefficient, wherein the base function coefficients of the second set represent a result of a second windowing and transform of a second base function, which has a frequency corresponding to a second frequency value of a second variable spectral coefficient, and wherein the base function coefficients of the third set represent a result of a third windowing and transform of the second base function, which has the second frequency value, wherein the first windowing, the second windowing and the third windowing differ in that a window length of a window in the first windowing differs from a window length of a window in the second and the third windowing, and that a window position of the second window and of the third window differ with reference to the second base function; and weighting the set of information signal spectral coefficients with the first set of base function coefficients, in order to calculate the first variable spectral coefficient, weighting the set of information signal spectral coefficients with the second set of base function coefficients, in order to obtain the second variable spectral coefficient for a first portion of the windowed block of the information signal, and weighting the set of information signal spectral coefficients with the third set of base function coefficients, in order to obtain the second variable spectral coefficient for a second portion of the windowed block of the information signal, which is different from the first portion of the windowed block of the information signal.
In accordance with a fourth aspect, the present invention provides a method of providing sets of base function coefficients, with the steps of: providing a time representation of a first and a second base function, wherein the first base function has a first frequency value, and wherein the second base function has a second frequency value, which is higher than the first frequency value; windowing the first base function with a first window and windowing the second base function with a second window and a third window, wherein the third window relates to a portion of the second base function later in time than the second window; and transforming a result of a windowing of the first base function with the first window, in order to obtain a first set of base function coefficients, transforming a result of a windowing of the second base function with the second window, in order to obtain a second set of base function coefficients, and windowing a result of a third windowing of the second base function with the third window, in order to obtain a third set of base function coefficients.
In accordance with a fifth aspect, the present invention provides a computer program with a program code for performing, when the computer program is executed on a computer, a method of converting an information signal, which is given as a series of samples, to a spectral representation with variable spectral coefficients, with a frequency value and a bandwidth being associated with a variable spectral coefficient, and with a frequency spacing of the variable spectral coefficients being variable, with the steps of: windowing the information signal to obtain a windowed block of the information signal having a length in time; converting the windowed block of samples to a spectral representation having a set of information signal spectral coefficients; providing a first set of complex base function coefficients, a second set of complex base function coefficients and a third set of complex base function coefficients, wherein the base function coefficients of the first set represent a result of a first windowing and transform of a first base function, which has a frequency corresponding to a first frequency value of a first variable spectral coefficient, wherein the base function coefficients of the second set represent a result of a second windowing and transform of a second base function, which has a frequency corresponding to a second frequency value of a second variable spectral coefficient, and wherein the base function coefficients of the third set represent a result of a third windowing and transform of the second base function, which has the second frequency value, wherein the first windowing, the second windowing and the third windowing differ in that a window length of a window in the first windowing differs from a window length of a window in the second and the third windowing, and that a window position of the second window and of the third window differ with reference to the second base function; and weighting the set of information signal spectral coefficients with the first set of base function coefficients, in order to calculate the first variable spectral coefficient, weighting the set of information signal spectral coefficients with the second set of base function coefficients, in order to obtain the second variable spectral coefficient for a first portion of the windowed block of the information signal, and weighting the set of information signal spectral coefficients with the third set of base function coefficients, in order to obtain the second variable spectral coefficient for a second portion of the windowed block of the information signal, which is different from the first portion of the windowed block of the information signal.
In accordance with a sixth aspect, the present invention provides a computer program with a program code for performing, when the computer program is executed on a computer, a method of providing sets of base function coefficients, with the steps of: providing a time representation of a first and a second base function, wherein the first base function has a first frequency value, and wherein the second base function has a second frequency value, which is higher than the first frequency value; windowing the first base function with a first window and windowing the second base function with a second window and a third window, wherein the third window relates to a portion of the second base function later in time than the second window; and transforming a result of a windowing of the first base function with the first window, in order to obtain a first set of base function coefficients, transforming a result of a windowing of the second base function with the second window, in order to obtain a second set of base function coefficients, and windowing a result of a third windowing of the second base function with the third window, in order to obtain a third set of base function coefficients.
The present invention is based on the finding that a transform to a spectral representation with variable spectral coefficients may be understood as a correlation of the music signal with the sought frequency raster in which the variable spectral coefficients are. A correlation of a signal with a frequency raster may be understood as a search for how much proportion is contained in the audio signal, which is contained in the frequency band associated with a variable spectral coefficient. A correlation of the audio signal with a sine tone as an example for a base function yields the content of the audio signal at the frequency of the base tone. The conversion to a variable spectral representation hence may be achieved by correlation of the audio signal with a base function, with each base function being a time representation of a variable spectral coefficient in the variable spectral representation. If this correlation is understood as a convolution, this correlation may be understood as a convolution of the audio signal with every single base function.
According to the invention, this calculation is, however, not performed in the time domain but in the frequency domain. To this end, the audio signal itself is at first windowed to obtain a windowed block of the audio signal, wherein the windowed block of the audio signal has a predetermined temporal length. Hereupon, the windowed block of samples is converted to a spectral representation comprising a set of spectral coefficients, which preferably are constant spectral coefficients, as they are obtained by a preferably employed computation-efficient FFT, for example. This single calculated FFT spectrum of the audio signal is now subjected to a correlation with base functions, the base functions having different frequency values. For example, if variable spectral coefficients are sought in spectral coefficients at 46.0 Hz and 48.74 Hz, one base function is a sine function at 46.0 Hz and the other base function is a sine function with 48.74 Hz. Both base functions start with a defined phase with respect to each other and preferably with the same phase. Both base functions then are windowed and transformed, with the window length with which the base function is transformed setting the bandwidth this variable spectral coefficient has in the final variable spectral representation. The base function spectral coefficients obtained by a base function are also referred to as set of base function coefficients. The convolution in the time domain for correlation purposes is simply performed by a multiplication of the FFT spectrum by the base function coefficients in the frequency domain. At the end of this multiplication by the base function coefficients, there results a value the amplitude of which shows, how much signal energy is contained in the audio signal at the frequency of the base function, with the frequency value of the variable spectral coefficient obtained therewith being given by the frequency value of the base function.
As has been set forth, the window for windowing the base function, in order to obtain the base function coefficients, sets the bandwidth of the variable spectral coefficients. For higher variable frequency values, i.e. for higher musical tones, the bandwidth does not have to be as small as for low tones any more. For this reason, the set of base function coefficients for a higher tone is obtained by the base function being windowed with a shorter window and then transformed to obtain the base function coefficients for the higher tone. The variable spectral coefficient for this higher tone is then again obtained by weighting the original FFT spectrum with the set of base function coefficients.
According to the invention, it is advantageously taken advantage of the fact that for higher tones the window of the base function, which has a higher frequency, is shorter than a window for windowing a base function having a lower frequency. It is analyzed for a temporally later portion of the audio signal, which has in a way been windowed after the window with which the second base function (representing a higher tone than the first base function) has been windowed. To this end, the same second base function (for the higher tone) is windowed with a window lying temporally after the window with which the second base function has been windowed at first. The base function coefficients obtained thereby are then weighted with the same Fourier spectrum, in order to obtain a variable spectral coefficient having the same frequency as the variable spectral coefficient just calculated, but which includes the content of the audio signal at the frequency sought, namely following in time to the region calculated previously in the audio signal. According to the invention, this is achieved by using complex base function coefficients as base function coefficients, which develop by windowing and transforming the base function. Thereby, it is achieved that audio signal regions within the window are taken into account, wherein the originally calculated audio signal spectrum also preferably is a complex spectrum.
In a preferred embodiment of the present invention, the window length of a window for determining the base function coefficients for a lower frequency value is chosen, according to an integer multiple to the window length, for windowing a base function for a higher tone, wherein the integer multiple preferably is a multiple of 2. With this, all sets of base function coefficients may efficiently be sorted into a matrix, so that transforming the constant spectral representation to the variable spectral representation may be obtained as a simple matrix-vector multiplication, which is extraordinarily efficient to execute, wherein the vector is the result of the constant spectral transform of the audio signal, and wherein the matrix includes a set of base function coefficients in each line.
At this point it is to be pointed out, in particular, that the matrix is a very thinly populated matrix, since—in the ideal case—the set of base function coefficients only has a single base function coefficient, namely at the frequency of the sought tone. But since the windows for windowing a base function typically are not of such resolution, so as to accurately resolve a frequency value of a variable spectral coefficient. Furthermore, by the not phase-correct windowing of the base function, also additional spectral lines are generated, which is to be attributed to the fact that a base function enters the window with a certain phase and exits the window for windowing the base function with a certain phase. Moreover, the rectangular windowing preferably used, which is very efficient numerically because no weighting like with other windows is to be performed, leads to artifacts, which lead to additional spectral lines next to the actual spectral line at the frequency of the base function.
Depending on the implementation, the base function coefficients may be calculated directly. It is, however, preferred to calculate the base function coefficients off-line, i.e. sometime for a certain temporal length of the base function window or for a certain sampling rate, and store the same in a matrix, wherein this weighting matrix may then be filed in a working memory of a processor when calculating the variable spectral representation or when “transforming” the constant spectral representation to the variable spectral representation.
In a preferred embodiment, the number of base function coefficients in a set of base function coefficients is limited. Here, it is preferred to use as many base function coefficients in weighting the constant spectrum that the base function coefficients used carry a certain percentage of the overall energy contained in a window for windowing a base function. If this percentage is set higher toward 100%, the spectral analysis becomes more accurate. But if this percentage is set further away from 100%, the number of base function coefficients necessary for weighting is reduced, which shows itself in a more efficient and quicker weighting. Thus, the matrix of the base function coefficients inherently is a thinly populated matrix, wherein the thin population of this matrix may be “thinned” further by setting the percentage further away from 100%, so that certain algorithms for handling very thinly populated matrices may also preferably be employed in a very efficient calculation. One preferred value is that the base function coefficients employed for weighting together include 90% of the energy contained in an entire window for windowing a base function.
The accompanying drawings are included to provide a further understanding of the present invention and are incorporated in and constitute a part of this specification. The drawings illustrate the embodiments of the present invention and together with the description serve to explain the principles of the invention. Other embodiments of the present invention and many of the intended advantages of the present invention will be readily appreciated as they become better understood by reference to the following detailed description. The elements of the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding similar parts.
These and other objects and features of the present invention will become clear from the following description taken in conjunction with the accompanying drawings, in which:
In the following Detailed Description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. In this regard, directional terminology, such as “top,” “bottom,” “front,” “back,” “leading,” “trailing,” etc., is used with reference to the orientation of the Figure(s) being described. Because components of embodiments of the present invention can be positioned in a number of different orientations, the directional terminology is used for purposes of illustration and is in no way limiting. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.
The windowed block of samples is supplied to a means 12 for converting the windowed block to a spectral representation, which has a set of complex spectral coefficients, wherein for efficiency reasons a conversion rule providing a set of complex constant spectral coefficients is preferred, wherein the frequency values of these constant spectral coefficients have a constant bandwidth and/or a constant frequency spacing.
The apparatus according to the invention further includes a means 14 for providing the sets of base function coefficients. The means 14 preferably is formed as a lookup table, in which a matrix is filed, wherein the matrix coefficients can be referenced by their line/column position of the lookup table. In particular, the means 14 for providing is formed to provide at least a first set of base function coefficients, a second set of base function coefficients and a third set of base function coefficients, wherein the base function coefficients according to the invention are complex base function coefficients. In particular, a first set of base function coefficients represents a result of a first windowing and a first transform of a first base function. The first base function has a frequency corresponding to a first frequency value of a first variable spectral coefficient. As will be explained later with reference to
The base function coefficients of the second set of base function coefficients are a result of a second windowing and a second transform of a second base function. The second base function is, for example, a sine function with a frequency of 277 Hz, when reference is again made to
The third set of base function coefficients in turn represents a result of a third windowing and transform of the second base function, i.e. the base function that is a sine signal at a frequency of 277 Hz, for example.
The first, the second and the third windowing differ in that a window length in the first windowing is different as compared with a window length in the second windowing and in the third windowing, wherein, in the example shown in
According to the invention, the window positions of the windows in the second and in the third windowing also are different from each other, so that the third window provides a temporally later portion of the second base function than the second window for windowing the second base function. Thus, in the embodiment shown in
The apparatus according to the invention, as it is illustrated in
By the fact that the audio spectrum preferably is a complex spectrum, i.e. includes phase information of the spectral values, and by the fact that the base function coefficients are also complex coefficients including phase information of the base function within the window for calculating the base function coefficients, it is achieved according to the invention that the second variable spectral coefficient is calculated with higher time resolution than the first variable spectral coefficient, or that with one and the same complex audio spectrum a first (small) temporal resolution is obtained for the lowest variable spectral coefficient, while for the second variable spectral coefficient already two variable spectral coefficients, which are successive in time, are obtained—on the basis of one and the same audio spectrum—, so that the second variable spectral coefficient thus is obtained with a second temporal (high) resolution.
Furthermore, due to the fact that the third window for windowing the second base function and the second window for windowing the second base function are shorter, i.e. have a shorter window length than the first window for windowing the first base function, the bandwidth of the second variable spectral coefficient will be lower, both at a point earlier in time and at a point later in time, than the bandwidth associated with the first variable spectral coefficient, so that the second and the first variable spectral coefficient have a variable window resolution.
Subsequently, with reference to
Furthermore, in the middle diagram,
In
Since the length of the second base function window and of the third base function window each are equal, the second base function window and the third base function window provide a second and a third set of base function coefficients, which have the same spectral resolution, which is, however, smaller than the resolution of the first set of base function coefficients, but which is greater than the resolution of e.g. the k-th set of base function coefficients, which is obtained by windowing the n-th base functions with the window 35a in
Subsequently, with reference to
Preferably, all base functions, i.e. all sine functions with frequencies from 46 Hz to 7040 Hz, start with the phase 0 at one and the same reference point for the base functions, which lies at 0 ms in the embodiment shown in
The variable spectral coefficients for the frequencies from 46 Hz to 124 Hz, which represent the first eighteen halftones, therefore act for a time region of the audio signal from 0 ms to 256 ms, since the 0-th base function window preferably coincides with the audio window. The variable spectral coefficients for the frequency values 131 Hz to 262 Hz refer to a range of the audio signal from 64 ms to 192 ms.
Due to the fact that the second base function window 40 and the third base function window 41 are only half as long as the first base function window 40, one variable spectral coefficient for the time portion from 64 ms to 128 ms as well as a second spectral coefficient for the excerpt 128 ms to 192 ms results for each frequency of the frequencies 277 to 523.
For each of the variable spectral coefficients for the frequency values 554 Hz to 1046 Hz, again four variable spectral coefficients each result, wherein the first variable spectral coefficient for e.g. the frequency of 554 Hz refers to the portion of the audio signal between 64 ms to 96 ms. The second variable spectral coefficient, which goes back to the next window 49, refers to the excerpt between 96 ms and 128 ms of the original audio signal. The further variable spectral coefficients e.g. for the frequency value 1108 Hz result for the corresponding later excerpt in analog manner.
For a group of e.g. the topmost 21 halftones, which cover the frequencies between 2216 Hz and 7040 Hz, it is preferred to take windows with a window length of 8 ms each, so that 16 such short windows 48 fit in a long first base function window 42.
It is to be pointed out that the base function coefficients obtained by the window arrangement, as it is schematically shown in
In other words, the variable spectral coefficients, which go back to a base function window that is longer than another window, are “reused” for the spectrums resulting due to shorter base function windows. With reference to
This “recycling” of variable spectral coefficients due to longer base function windows does, however, correspond to the natural laws of time/frequency resolution, because—stated simply—a period of a signal with low frequency is longer than a period of a signal with high frequency.
The inventive concept thus provides, using only a single FFT as well as a single multiplication with a pre-stored, very thinly populated matrix, 16 variable spectrums, with each spectrum having a length of 8 ms, such that with this a complete—gap-free—region of the audio signal with a length of 128 ms is analyzed with high time resolution and high frequency resolution. For the same example, the bounded Q analysis mentioned at the beginning would require 96 (!) complete Fourier transforms.
It is to be pointed out that the base function window does not necessarily have to be offset with respect to all other base function windows. Instead, the window beginning of the 0-th base function window could also be aligned with the window beginning of the first base function window, etc. In this case, it would furthermore be preferred to mirror the entire window arrangement at a vertical line starting with the tone at 131 Hz, so that the first base function window 42 would have a downstream further base function window of equal length, while now four base function windows of equal length would be in the line with the base function windows 40 and 41.
The arrangement of the upper base function windows in centered manner above the lower base function window shown in
Subsequently, with reference to
Typically, the result of the transform in the block 62 will be a spectrum having few prominent lines and many minor lines, wherein the few prominent lines are to be attributed to the fact that the frequency value of a variable spectral coefficient will not necessarily match the resolution achieved by the transform 62. Furthermore, coefficients are also generated due to the fact that the base functions do not necessarily have to enter the window with the phase 0 and not necessarily have to exit the window with the phase 0. Moreover, the windowing itself also leads to artifacts, which are, however, uncritical. Furthermore, some compensation of the artifacts exists when the same window shape is employed as audio window and as base function window. It has turned out that the simplest window to be handled numerically, i.e. the rectangular window, has provided the best results according to the invention.
So as to have defined conditions, then a selection is performed among a set of base function coefficients. To this end, the spectrum is fed to a means 63 squaring each spectral value, i.e. each base function coefficient, so as to then sum the squared base function coefficients in order to obtain a measure for the overall energy. Hereupon, the spectrum is fed to a means 64 for arranging the spectral coefficients according to their size and for summing starting from the greatest toward the smallest value, wherein this summing is continued until a predetermined energy threshold in percent is reached. Thus, then only the spectral values that have been summed continue to be used as base function coefficients, whereas the spectral values that have no longer taken part in the summing, are set to 0 in defined manner, in order to further thin out the coefficient matrix, which will be described later. Hereupon, the summed spectral coefficients, i.e. the spectral coefficients having taken part in the summing and having contributed to the 90% measure of energy are fed to a means 65 for scaling the summed spectral coefficients, such that in the end the base function coefficients in each set of base function coefficients together have the same energy. With this, the fact that of course a base function brings substantially more energy into a long window than into a short window is offset. So as to obtain no artifacts therefrom, the energy of each set of base function coefficients is therefore made equal within a predetermined deviation threshold of e.g. 50%, and preferably 5%.
Hereupon, the scaled base function coefficients having “survived” the selection step in block 64 are fed to a means 66 for entering into the coefficient matrix, which is finally stored preferably in a lookup table (LUT) by a means 67. In
In the embodiment shown in
At this point, it is to be pointed out that the crosses in
It is to be pointed out once again that the crosses in
The inventive concept concerns a range of 88 halftones more specifically between 46.3 Hz (F1 Sharp) and 7040 Hz (A8) with window sizes from 256 ms to 8 ms. For the lowest frequencies, as it has been illustrated, a temporally overlapped analysis window of 50% is used, with which a maximum frame increment of 128 ms for the system results. This property of course generates more output values for higher frequencies, when the samples of the input signal are analyzed without gaps. A practical solution for this mismatch is a sample and hold automatism, which is used for the lower frequency output values, whereby the matrix representation (
In particular, the inventive concept is characterized by the fact that the computationally more efficient rectangular windows are employed, instead of the more intensive Hamming windows. Furthermore, in a preferred embodiment of the present invention, a complete analysis is achieved at a 50% overlap, wherein particularly the inventive matrix structure illustrated on the basis of
The inventive concept is characterized by a block-wise constant window length, and thus by a quality factor, which varies within a band (of
At this point, it is to be pointed out that the examination time window, i.e. the audio signal window, refers to a signal portion of the time signal to be analyzed. This time signal is multiplied by a rectangular window of 256 ms width in the time domain and transformed to the frequency domain by FFT, where then the exact analysis takes place using the CQT coefficients or base function coefficients. The rectangular window is moved on by 50% of its width each, i.e. 128 ms, before the next FFT is calculated. Each sample in the time domain thus enters the FFT twice. The width of the rectangular window is determined by the intended high resolution at these frequencies. Since the demands on the frequency resolution decrease, however, toward higher frequencies, a smaller window width also is sufficient there.
The modified CQT at this point takes advantage of the phase information of the coefficients, in order to enable more accurate location of the spectral proportions within the audio window. In other words, for rectangular windows a different number of frequency values result independently of the frequency range, namely exactly one value for the lowest frequency range, wherein each sample is used twice here by the 50% overlap, also exactly one value for the next higher range, wherein only the half of the samples centered around the window center is used. For the next higher range, exactly two values result, wherein only the second or third quarter of the samples is used, etc. It is preferred to illustrate the overall result of the transform in matrix form. Since there is a different number of values for the same analysis part depending on the frequency range, which is the feature of the present invention with respect to the high time resolution, a repetition or a “recycling” of the values from the lower frequency ranges is performed to indicate a complete spectrum for every smallest window.
With respect to the selection of the base function coefficients, it is to be pointed out that starting from the highest values per line, i.e. per analysis bin, the quotients are squared and summed until the threshold of 90% of the greatest square sum occurring in the entire matrix or matrix line is reached. The remaining quotients of each line are set to 0. The remaining coefficients are then normalized line by line to achieve uniform weighting of the lines.
A preferred application of the inventively generated variable spectral representation lies in the music analysis and particularly in the transcription, i.e. the note finding, or for purposes of key recognition or chord detection, or generally wherever a frequency analysis with variable bandwidth for the spectral coefficients is required. Further fields of application therefore are given for the transform of, generally speaking, information signals, which are video signals, but also temporal measurement values or temporal simulation courses of an electric or electronic parameter, the frequency representation of which with high time and high frequency resolution is of interest.
Finally, it is to be pointed out that the inventive concept may be implemented as hardware, software or as a mixture of hardware and software. The present invention thus also relates to a computer program with a machine-readable code by which one of the methods according to the invention is executed when the computer program is executed on a computer.
While this invention has been described in terms of several preferred embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific embodiments shown and described without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific embodiments discussed herein. Therefore, it is intended that this invention be limited only by the claims and the equivalents thereof.
Number | Date | Country | Kind |
---|---|---|---|
10 2004 028 694 | Jun 2004 | DE | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2005/004518 | 4/27/2005 | WO | 00 | 4/8/2008 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2005/122135 | 12/22/2005 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
4142433 | Gross | Mar 1979 | A |
4184401 | Hiyoshi et al. | Jan 1980 | A |
4354418 | Moravec et al. | Oct 1982 | A |
4397209 | Deforeit | Aug 1983 | A |
4633749 | Fujimori et al. | Jan 1987 | A |
4841828 | Suzuki | Jun 1989 | A |
5117727 | Matsuda | Jun 1992 | A |
5260980 | Akagiri et al. | Nov 1993 | A |
5392231 | Takahashi | Feb 1995 | A |
5442129 | Mohrlok et al. | Aug 1995 | A |
5459281 | Shibukawa | Oct 1995 | A |
5475629 | Takahashi | Dec 1995 | A |
5756918 | Funaki | May 1998 | A |
5760325 | Aoki | Jun 1998 | A |
6057502 | Fujishima | May 2000 | A |
6111181 | Macon et al. | Aug 2000 | A |
6111183 | Lindemann | Aug 2000 | A |
20030182105 | Sall et al. | Sep 2003 | A1 |
Number | Date | Country |
---|---|---|
1278182 | Jan 2003 | EP |
H01-219634 | Sep 1989 | JP |
H02-029792 | Jan 1990 | JP |
02188794 | Jul 1990 | JP |
04104617 | Apr 1992 | JP |
05216482 | Aug 1993 | JP |
05346783 | Dec 1993 | JP |
2000097759 | Apr 2000 | JP |
2000-298475 | Oct 2000 | JP |
2003156480 | May 2003 | JP |
2003263155 | Sep 2003 | JP |
0104870 | Jan 2001 | WO |
0188900 | Nov 2001 | WO |
Number | Date | Country | |
---|---|---|---|
20090100990 A1 | Apr 2009 | US |