This application claims priority from German Patent Application No. 102004049457.6, which was filed on 11 Oct. 2004, and German Patent Application No. 102004049517.3, which was filed on 11 Oct. 2004, and are incorporated herein by reference in its entirety.
The present invention relates to the extraction of a melody underlying an audio signal. Such an extraction may for example be used in order to obtain a transcribed illustration or musical representation of a melody underlying a monophonic or polyphonic audio signal which may also be present in an analog form or in a digital sampled form. Melody extractions thus enable for example the generation of ring tones for mobile telephones from any audio signal, like e.g. singing, humming, whistling or the like.
For some years already, signal tones of mobile telephones have not only served for signalizing a call anymore. The same rather became an entertainment factor with growing melodic capabilities of the mobile devices and a status symbol among adolescents.
Earlier mobile telephones partially offered the possibility to compose monophonic ring tones at the device itself. This was complicated, however, and often frustrating for users with a little knowledge regarding music and was unsatisfactory with regard to the results. Therefore, this possibility or functionality, respectively, has largely disappeared from new telephones.
In particular modern telephones, which allow polyphonic signalizing melodies or ring tones, respectively, offer such an abundance of combinations, so that an independent composition of a melody on such a mobile device is hardly possible anymore. At most, ready-made melody and accompaniment patterns may be newly combined in order to thus enable independent ring tones in a restricted way.
Such a combination possibility of ready-made melody and accompaniment patterns is for example implemented in the telephone Sony-Ericsson T610. In addition to that, the user is, however, dependent on buying commercially available, ready-made ring tones.
It would be desirable, to be able to provide an intuitively operable interface for generating a suitable signalizing melody to the user that does not assume a high musical education but is suitable for a conversion of own polyphonic melodies.
In most keyboards today, a functionality exists known as a so called accompanying automatics, to automatically accompany a melody when the chords to be used are predetermined. Apart from the fact that such keyboards provide no possibility to transmit the melody provided with an accompaniment via an interface to a computer and have it converted into a suitable mobile telephone format in order to be able to use the same as ring tones in a mobile telephone, the use of a keyboard for generating own polyphonic signalizing melodies for mobile telephones is not an option for most users as same are not able to operate this musical instrument.
DE 102004010878.1 with the title “Vorrichtung und Verfahren zum Liefern einer Signalisierungs-Melodie”, whose applicant is the same as the applicant of the present invention and which was filed at the German Patent and Trademark Office on Mar. 5, 2004, describes a method using which with the help of a java applet and a server software monophonic and polyphonic ring tones may be generated and sent to a mobile device. The approaches for extracting the melody from audio signals proposed there are very prone to errors or only useable in a limited way, however. Among others it is proposed there to obtain a melody of an audio signal by extracting characteristic features from the audio signal in order to compare the same with corresponding features of pre-stored melodies and to then select that one among the pre-stored melodies as the generated melody for which the best match results. This approach, however, inherently restricts the melody recognition to the pre-stored set of melodies.
DE 102004033867.1 with the title “Verfahren und Vorrichtung zur rhythmischen Aufbereitung von Audiosignalen” and DE 102004033829.9 with the title “Verfahren und Vorrichtung zur Erzeugung einer polyphonen Melodie” which were filed at the same day at the German Patent and Trademark Office, are also directed to the generation of melodies from audio signals, do not consider the actual melody recognition in detail, however, but rather the subsequent process of deriving an accompaniment from the melody together with a rhythmic and harmony-depending processing of the melody.
Bello, J. P., Towards the Automated Analysis of Simple Polyphonic Music: A Knowledge-based Approach, University of London, Diss., January 2003 for example treats the possibilities of melody recognition, wherein different types of the recognition of the initial point of time of notes are described either based on the local energy in the time signal or on an analysis in the frequency domain. Apart from that, different methods for a melody line recognition are described. The common thing about these proceedings is that they are complicated in that the finally obtained melody is obtained via detours by the fact that initially in the time/spectral representation of the audio signal several trajectories are processed or traced, respectively, and that only among those trajectories finally the selection of the melody line or the melody, respectively, is made.
Also in Martin, K. D., A Blackboard System for Automatic Transcription of Simple Polyphonic Music, M.I.T. Media Laboratory Perceptual Computing Section Technical Report No. 385, 1996, a possibility for an automatic transcription is described, wherein the same is also based on the evaluation of several harmonic traces in a time/frequency representation of the audio signal or the spectrogram of the audio signal, respectively.
In Klapuri, A. P.: Signal Processing Methods for the Automatic Transcription of Music, Tampere University of Technology, Summary Diss., December 2003, and Klapuri, A. P., Signal Processing Methods for the Automatic Transcription of Music, Tampere University of Technology, Diss., December 2003, A. P. Klapuri, “Number Theoretical Means of Resolving a Mixture of several Harmonic Sounds”. In Proceedings European Signal Processing Conference, Rhodos, Greece, 1998, A. P. Klapuri, “Sound Onset Detection by Applying Psychoacoustic Knowledge”, in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, Ariz., 1999, A. P. Klapuri, “Multipitch Estimation and sound separation by the Spectral Smoothness Principle”, in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, Utah, 2001, Klapuri A. P. and Astola J. T., “Efficient Calculation of a Physiologically-motivated Representation for Sound”, in Proceedings 14th IEEE International Conference on Digital Signal Processing, Santorin, Greece, 2002, A. P. Klapuri, “Multiple Fundamental Frequency Estimation based on Harmonicity and Spectral Smoothness”, IEEE Trans. Speech and Audio Proc., 11(6), pp. 804-816, 2003, Klapuri A. P., Eronen A. J. and Astola J. T., “Automatic Estimation of the Meter of Acoustic Musical Signals”, Tempere University of Technology Institute of Signal Processing, Report 1-2004, Tampere, Finland, 2004, ISSN: 1459:4595, ISBN: 952-15-1149-4, different methods regarding the automatic transcription of music are described.
With regard to the basic research in the field of the extraction of a main melody as a special case of polyphonic transcription, further Bauman, U.: Ein Verfahren zur Erkennung und Trennung multipler akustischer Objekte, Diss., Lehrstuhl für Mensch-Maschine-Kommunikation, Technische Universit{hacek over (a)}t München, 1995, is to be noted.
The above-mentioned different approaches for melody recognition or automatic transcription, respectively, present special requirements for the input signal. For example, they only admit piano music or only a certain number of instruments and exclude percussive instruments or the like.
The hitherto most practicable approach for current modern and popular music is the approach of Goto, as it is for example described in Goto, M.: A Robust Predominant-FO Estimation Method for Real-time Detection of Melody and Bass Lines in CD Recordings, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. II-757-760, June 2000. The goal in this method is the extraction of a dominant melody and bass line, wherein the detour for line finding again takes place via the selection among several trajectories, i.e. using so called “agents”. Therefore, the method is expensive.
Melody detection is also treated by Paiva R. P. et al.: A Methodology for Detection of Melody in Polyphonic Musical Signals, 116th AES Convention, Berlin, May 2004. Also there the proposal is made to take the path of a trajectory tracing in the time/spectral representation. The document also relates to the segmentation of the individual trajectories until the same are post-processed to a note sequence.
It would be desirable to have a method for melody extraction or automatic transcription, respectively, which is more robust and reliably functions for a wider plurality of different audio signals. Such a robust system may lead to high time and cost savings in “Query by Huming”-systems, i.e. in systems in which a user is able to find songs in a data base by humming them in, as an automatic transcription for the reference files of the system data base would be possible. A robustly functioning transcription might also find use as a receiving front-end. It would further be possible to use an automatic transcription as a supplement to an audio ID system, i.e. a system which recognizes audio files at a fingerprint contained within the same, as when not recognized by the audio ID system, like e.g. due to a missing fingerprint, the automatic transcription might be used as an alternative in order to evaluate an incoming audio file.
A stably functioning automatic transcription would further provide a manufacturing of similarity relations in connection with other musical features, like e.g. key, harmony and rhythm, like e.g. for a “recommendation-engine”. In musical science, a stabile automatic transcription might provide new views and lead to a review of opinions with regard to older music. Also for maintaining the copyrights by an objective comparison of pieces of music, an automatic transcription which is stabile in its application might be used.
In summary, the application of the melody recognition or autotranscription, respectively, is not restricted to the above-mentioned generation of ring tones for mobile telephones, but may in general serve as a support for musicians and those interested in music.
It is the object of the present invention to provide a more stable scheme for melody recognition or one working correctly for a wider plurality of audio signals, respectively.
In accordance with a first aspect, the present invention provides a device for the extraction of a melody underlying an audio signal having means for providing a time/spectral representation of the audio signal; means for scaling the time/spectral representation using curves of equal volume reflecting the human volume perception in order to obtain a perception-related time/spectral representation; and the determinater for determining a melody of the audio signal on the basis of the perception-related time/spectral representation.
In accordance with a second aspect, the present invention provides a method for extracting a melody underlying an audio signal, having the steps of providing a time/spectral representation of the audio signal; scaling the time/spectral representation using curves of equal volume reflecting human volume perception in order to obtain a perception-related time/spectral representation; and on the basis of the perception-related time/spectral representation, determining a melody of the audio signal.
In accordance with a third aspect, the present invention provides a device for extracting a melody underlying an audio signal, having means for providing a time/spectral representation of the audio signal, wherein means for providing is implemented such that it provides a time/spectral representation which comprises a spectral band with a sequence of spectral values for each of a plurality of spectral components, and that the time/spectral representation comprises a spectral value for every time section of a sequence of time sections of the audio signal in each spectral band; means for determining, on the basis of the time/spectral representation of the audio signal, a melody line of the audio signal by a unique association of exactly the one spectral component to each time section for which the time/spectral representation or a version of the time/spectral representation derived from the same is at maximum; and means for determining the melody of the audio signal based on the melody line.
In accordance with a fourth aspect, the present invention provides a method for extracting a melody underlying an audio signal, having the steps of providing a time/spectral representation of the audio signal such that a time/spectral representation is provided which comprises a spectral band with a sequence of spectral values for each of a plurality of spectral components, and that the time/spectral representation comprises a spectral value for each time section of a sequence of time sections of the audio signal in each spectral band; on the basis of the time/spectral representation of the audio signal, determining a melody line of the audio signal by a unique allocation of exactly the one spectral component to each time section for which the time/spectral representation at the respective time section has the highest spectral value; and based on the melody line, determining the melody of the audio signal.
In accordance with a fifth aspect, the present invention provides a computer program having a program code for performing one of the above mentioned methods when the computer program runs on a computer.
It is the finding of the present invention that the melody extraction or the automatic transcription may be made clearly more stable and if applicable even less expensive if the assumption is considered sufficiently that the main melody is that part of a piece of music that man perceives the loudest and most concise. With regard to this, according to the first aspect of the present invention, the time/spectral representation or the spectrogram, respectively, of an interesting audio signal is scaled using the curves of equal volume that reflect human volume perception, in order to determine the melody of the audio signal on the basis of the resulting perception-related time/spectral representation, and according to the second aspect of the present invention, in the determination of the melody of the audio signal first of all a melody line extending through the time/spectral representation is determined, by the fact that exactly one spectral component or one frequency bin of the time/spectral representation is associated with every time section or frame, respectively—in a unique way—i.e., according to a special embodiment, the one that leads to the sound result with the maximum intensity.
According to a preferred embodiment of the present invention, the above indicated statement of musicology, that the main melody is the portion of a piece of music that man perceives the loudest and most concise, is considered with regard to two aspects. According to this embodiment of the first aspect, in determining the melody of the audio signal at first a melody line is determined extending through the time/spectral representation, by the fact that exactly one spectral component or one frequency bin of the time/spectral representation is associated with every time section or frame, respectively—in a unique way—i.e. the one that leads to the sound result with the maximum intensity. In particular, according to this embodiment the spectrogram of the audio signal is first of all logarithmized, so that the logarithmized spectral values indicate the sonic pressure level. Subsequently, the logarithmized spectral values of the logarithmized spectrogram are mapped to perception-related spectral values depending on their respective value and the spectral component to which they belong. Here, functions are used that represent the curves of equal volume as a sonic pressure depending on spectral components or depending on the frequency, respectively, and which are associated with different volumes, and according to this embodiment according to the second aspect, the time/spectral representation or the spectrogram, respectively, of an interesting audio signal is scaled using the curves of equal volume reflected by human volume perception in order to determine the melody of the audio signal on the basis of the resulting perception-related time/spectral representation. In more detail, according to this embodiment, the spectrogram of the audio signal is first logarithmized so that the logarithmized spectral values indicate the sonic pressure level. Subsequently, the logarithmized spectral values of the logarithmized spectrogram are mapped to perception-related spectral values depending on their respective value and on the spectral component to which they belong. In doing so, functions are used that represent the curves of equal volume as a sonic pressure depending on spectral components or depending on the frequency, respectively, and are associated with different volumes. The perception-related spectrum is again delogarithmized in order to generate a time/sound spectrum from the result by forming sums of delogarithmized perception-related spectral values per frame for predetermined spectral components. These sums include the delogarithmized perception-related spectral value at the respective spectral component and the delogarithmized perception-related spectral value at the spectral components that form an overtone for the respective spectral component. The thus obtained time/sound spectrum represents a version of the time/spectral representation which is derived from the same.
In the following, preferred embodiments of the present invention are explained in more detail with reference to the accompanying drawings, in which:
a and b show schematical illustrations of a section from the amplitude course of the spectrogram of an audio signal along a segment for illustrating the functioning of the tone separation according to
With reference to the following description of the figures it is noted, that there the present invention is explained merely exemplary with regard to a special case of application, i.e. the generation of a polyphonic ring melody from an audio signal. It is explicitly noted at this point, however, that the present invention is of course not restricted to this case of application, but that an inventive melody extraction or automatic transcription, respectively, may also find use somewhere else, like e.g. for facilitating the search in a database, the mere recognition of pieces of music, enabling the maintaining of the copyright by an objective comparison of pieces of music or the like, or, however, for a mere transcription of audio signals, in order to be able to indicate the transcription result to a musician.
The device of
As the setup of the device 300 of
The extraction means 304 is implemented to subject the audio signal received at the input 302 to a note extraction or recognition, respectively, in order to obtain a note sequence from the audio signal. In the present embodiment, the note sequence 318 passing on the extraction means 304 to the rhythm means 306 is present in a form in which for every note n a note initial time, tn, which for example indicates the tone or note beginning, respectively, in seconds, a tone or note duration, respectively, τn, indicating the note duration of the note for example in seconds, a quantized note or tone pitch, i.e. C, F sharp or the like, for example as an MIDI note, a volume Ln of the note and an exact frequency fn of the tone or the note, respectively, contained in the note sequence, wherein n is to represent an index for the respective note in the note sequence increasing with the order of subsequent notes or indicating the position of the respective notes in the note sequence, respectively.
The melody recognition or audio transcription, respectively, performed by means 304 for generating the note sequence 318, is later explained in more detail with reference to
The note sequence 318 still represents the melody as it was illustrated by the audio signal 302. The note sequence 318 is then supplied to the rhythm means 306. The rhythm means 306 is implemented in order to analyze the supplied note sequence in order to determine a length of time, an upbeat, i.e. a time raster, for the note sequence and thus adapt the individual notes of the note sequence to suitable time-quantified lengths, like e.g. whole notes, half notes, crotchets, quavers etc. for the certain time and to adjust the note beginnings of the notes to the time raster. The note sequence output by the rhythm means 306 thus represents a rhythmically rendered note sequence 324.
At the rhythmically rendered note sequence 324 the key means 308 performs a key determination and if applicable a key correction. In particular, means 308 determines based on the note sequence 324 a main key or a key, respectively, of the user melody represented by the note sequence 324 or the audio signal 302, respectively, including the mode, i.e. major or minor, of the piece which was for example sung. After that, the same recognizes further tones or notes, respectively, in the note sequence 114 not contained in the scale and corrects the same in order to come to a harmonically sounding final result, i.e. a rhythmically rendered and key-corrected note sequence 700 which is passed on to the harmony means 310 and represents a key-corrected form of the melody requested by the user.
The functioning of means 324 with regard to the key determination may be implemented in different ways. The key determination may for example be performed in the way described in the article Krumhansl, Carol L.: Cognitive Foundations of Musical Pitch, Oxford University Press, 1990, or in the article Temperley, David: The cognition of basical musical structures, The MIT Press, 2001.
The harmony means 310 is implemented to receive the note sequence 700 from means 308 and to find a suitable accompaniment for the melody represented by this note sequence 700. For this purpose, means 310 acts or operates bar-wise, respectively. In particular, means 310 is operable at every bar as it is determined by the time raster determined by the rhythm means 306, such that it creates a statistic about the tones or tone pitches, respectively, of the notes Tn occurring in the respective time. The statistics of the occurring tones is then compared to the possible chords of the scale of the main key, as it was determined by the key means 308. Means 310 selects in particular that chord among the possible chords whose tones match best with the tones in the respective time, as it is indicated by the statistics. This way, means 310 determines the one chord for every time which best suits the tones or notes, respectively, in the respective time, which were for example sung. In other words, means 310 associates chord stages of the basic key to the times found by means 306, depending on the mode, so that a chord progression forms over the course of the melody. At the output of means 310, apart from the rhythmically rendered and key-corrected note sequence including NL, the same further outputs a chord stage indication for every time to the synthesis means 312.
The synthesis means 312 uses, for performing the synthesis, i.e. for an artificial generation of the finally resulting polyphonic melody, style information, which may be input by a user, as is indicated by case 702. For example, by the style information, a user may select from four different styles or musical directions, respectively, in which this polyphonic melody may be generated, i.e. pop, techno, latin or reggae. For each of these styles either one or several accompaniment patterns are deposited in the synthesis means 312. For generating the accompaniment the synthesis means 312 now uses the accompaniment pattern(s) indicated by the style information 702. For generating the accompaniment the synthesis means 312 strings together the accompaniment patterns per bar. If the chord for a time determined by means 310 is the chord version, in which an accompaniment pattern is already present, then the synthesis means 312 simply selects the corresponding accompaniment pattern for the current style for this time for the accompaniment. If, however, for a certain time the chord determined by means 310 is not the one in which an accompaniment pattern is deposited in means 312, then the synthesis means 312 shifts the notes of the accompaniment pattern by the corresponding number of semitones or changes the third and changes the sixth and fifth by a semitone in case of another mode, i.e. by shifting upwards by a semitone in case of a major chord and the other way in case of a minor chord.
Further, the synthesis means 312 instruments the melody represented by the note sequence 700 passed on from the harmony means 310 to the synthesis means 312 in order to obtain a main melody and finally combines accompaniment and main melody to a polyphonic melody which it outputs presently exemplarily in the form of an MIDI file at the output 304.
The key means 308 is further implemented to save the note sequence 700 in the melody storage 314 under a provisioning identification number. If the user is not satisfied with the result of the polyphonic melody at the output 304, he may input the provisioning identification number together with new style information again into the device of
In the following, with reference to
After step 752, means 304 determines a weighted amplitude spectrum or a perception-related spectrogram, respectively, in a step 754. The exact proceeding for determining the perception-related spectrogram is explained in more detail in the following with reference to
The processing 756 following step 754 among other uses the perception-related spectrogram obtained from step 754 in order to finally obtain the melody of the output signal in the form of a melody line organized in note segments, i.e. in a form in which groups of subsequent frames have respectively the same associated tone pitch, wherein these groups are spaced from each other in time over one or several frames, do not overlap and therefore correspond to note segments of a monophonic melody.
In
The following substeps 760 and 762 are provided in order to segment the continuous melody line to thus result in individual notes. In
The result of the processing 756 is processed in step 764 in order to generate a sequence of notes from the melody line segments, wherein to each note an initial note point of time, a note duration, a quantized tone pitch, an exact tone pitch, etc., is associated.
After the functioning of the extraction means 304 of
In the first two steps 750 and 752
The frequency analysis 752 may then for example be performed using a warped filter bank and an FFT (fast Fourier transformation). In particular, in the frequency analysis 752 the sequence of audio values is first of all windowed with a window length of 512 samples, wherein a hop size of 128 samples is used, i.e. the windowing is repeated every 128 samples. Together with the sample rate of 16 kHz and the quantizing resolution of 16 bits, those parameters represent a good compromise between time and frequency resolution. With these exemplary settings, one time section or one frame, respectively, corresponds to a time period of 8 milliseconds.
The warped filter bank is used according to a special embodiment for the frequency range up to approximately 1,550 Hz. This is required in order to obtain a sufficiently good resolution for deep frequencies. For a good semitone resolution sufficient frequency bands should be available. With a lambda value from −0.85 at a 16 kHz sample rate on a frequency of 100 Hz approximately two to four frequency bands correspond to one semitone. For low frequencies, each frequency band may be associated with one semitone. For the frequency range up to 8 kHz then the FFT is used. The frequency resolution of the FFT is sufficient for a good semitone representation from about 1,550 Hz. Here, approximately two to six frequency bands correspond to a semitone.
In the implementation described above as an example, the transient performance of the warped filter bank is to be noted. Preferably, due to this a temporal synchronization is performed in the combination of the two transformations. The first 16 frames of the filter bank output are for example discarded, just like the last 16 frames of the output spectrum FFT are not considered. In a suitable interpretation the amplitude level is identical at filter bank and FFT and need not be adjusted.
According to a special embodiment, the amplitudes in the spectrum of
For correcting the faulty amplitudes the effect of crosstalk may be used. At maximum two neighboring frequency bands in each direction are effected by these faults. According to one embodiment, for this reason in the spectrogram of
The above embodiment for a signal analysis from a combination of a warped filter bank and an FFT enables an audition-adapted frequency resolution and the presence of a sufficient number of frequency bins per semitone. For more details regarding the implementation reference is made to the dissertation of Claas Derboven with the title “Implementierung und Untersuchung eines Verfahrens zur Erkennung von Klangobjekten aus polyphonen Audiosignalen”, developed at the Technical University of Ilmenau in 2003, and to the dissertation of Olaf Schleusing with the title “Untersuchung von Frequenzbereichstransformationen zur Metadatenextraktion aus Audiosignalen”, developed at the Technical University of Ilmenau in 2002.
As mentioned above, the analysis result of the frequency analysis 752 is a matrix or a field, respectively, of spectral values. These spectral values represent the volume by the amplitude. The human volume perception has, however, a logarithmic division. It is therefore sensible to adjust the amplitude spectrum to this division. This is performed in a logarithmizing 770 following step 752. In the logarithmizing 770 all spectral values are logarithmized to the level of the sonic pressure level, which corresponds to the logarithmic volume perception of man. In particular, in the logarithmizing 770 to the spectral value p in the spectrogram, as it is obtained from the frequency analysis 752, p is mapped to a sonic pressure level value or a logarithmized spectral value L by
wherein p0 here indicates the sonic reference pressure, i.e. the volume level that has the smallest perceptible sonic pressure at 1,000 Hz.
Within the logarithmizing 770 this reference value has to be determined first. While in the analog signal analysis as a reference value the smallest perceptible sonic pressure p0 is used, this regularity may not easily be transferred to the digital signal processing. For determining the reference value, according to one embodiment, for this purpose a sample audio signal is used, as it is illustrated in
In
The volume evaluation of humans is frequency-depending. Thus, the logarithmized spectrum, as it results from the logarithmizing 770, is to be evaluated in a subsequent step 772 in order to obtain an adjustment to this frequency-depending evaluation of man. For this purpose, curves of equal volume 774 are used. The evaluation 772 is required in particular in order to adjust the different amplitude evaluation of musical sounds across the frequency scale to human perception, as according to human perception the amplitude values of lower frequencies have a lower evaluation than amplitudes of higher frequencies.
For the curves 774 of equal volume, presently as an example the curve characteristic from DIN 45630 page 2, Deutsches Institut für Normung e.V., Grundlagen der Schallmessung, Normalkurven gleicher Lautstärke, 1967, was used. The graph course is shown in
Preferably, the curves of equal volume 774 are present in an analytical form in means 204, wherein it would also be possible, of course, to provide a look-up table that associates a volume level value to every pair of frequency bin and sonic pressure level quantization value. For the volume curve with the lowest volume level, for example the formula
may be used. Between this curve shape and the audibility threshold according to German industrial standard, however, deviations are present in the low- and high-frequency value range. For adjusting, the functional parameters of the idle audibility threshold may be changed according to the above equation in order to correspond to the shape of the lowest volume curve of the above-mentioned German industrial standard of
Based on the curves 774 of the same volume, means 304 in step 772 maps every logarithmized spectral value, i.e. every value in the array of
The result of this proceeding for the case of the logarithmized spectrogram of
The above-described steps 770-774 represent possible substeps of step 754 from
The method of
In step 776 now for all possible keynote frequencies the intensities in the spectrogram of the audio signal are added at the respective keynote and its overtones. In doing so, however, a weighting of the individual intensity values is performed, as due to several simultaneously occurring sounds in a piece of music there is the possibility that the keynote of a sound is masked by an overtone of another sound having a lower-frequency keynote. In addition, also overtones of a sound may be masked by overtones of another sound.
In order to determine the tones of a sound belonging together anyway, in step 776 a tone model is used based on the principle of the model of Mosataka Goto and adjusted to the spectral resolution of the frequency analysis 752, wherein the tone model of Goto in Goto, M.: A Robust Predominant-F0 Estimation Method for Real-time Detection of Melody and Bass Lines, in CD Recordings, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Istanbul, Turkey, 2000, is described.
Based on the possible basic frequency of a sound, by the harmonic raster 778 for each frequency band or frequency bin, respectively, the overtone frequencies belonging to it are associated. According to a preferred embodiment, overtones for basic frequencies are searched in only one particular frequency bin range, like e.g. from 80 Hz-4,100 Hz, and harmonics are only considered to the 15th order. In doing so, the overtones of different sounds may be associated with the tone model of several basic frequencies. By this effect, the amplitude ratio of a searched sound may be changed substantially. In order to weaken this effect, the amplitudes of the partial tones are evaluated with a halved Gaussian filter. The basic tone here receives the highest valency. Any following partial tones receive a lower weighting according to their order, wherein the weighting for example decreases in a Gauss-shape with an increasing order. Thus, an overtone amplitude of another sound masking the actual overtone has no special effect on the overall result of a searched voice. As the frequency resolution of the spectrum for higher frequencies decreases, not for every overtone of a higher order a bin with the corresponding frequency exists. Due to the crosstalk to the adjacent bins of the frequency environment of the searched overtone, using a Gaussian filter the amplitude of the searched overtone may be reproduced relatively well across the closest frequency bands. Overtone frequencies or the intensities at the same, respectively, do therefore not have to be determined in units of frequency bins, but also an interpolation may be used in order to exactly determine the intensity value at the overtone frequency.
The summation across the intensity values is, however, not performed directly at the perception-related spectrum of step 772. Rather, initially in step 776 the perception-related spectrum of
Next, in a step 780, a preliminary determination of a potential melody line is performed. The melody line corresponds to a function over time, i.e. to a function that associates exactly one frequency band or one frequency bin, respectively, to each frame. In other words, the melody line determined in step 780 defines a trace along the definition range of the sound spectrogram or the matrix, respectively, of step 776, wherein the trace along the frequency axis never overlaps or is ambiguous, respectively.
The determination is performed in step 780 such that for each frame over the complete frequency range of the sound spectrogram the maximum amplitude is determined, i.e. the highest summation value. The result, i.e. the melody line, mainly corresponds to the basic course of the melody of the music title underlying the audio signal 302.
The evaluation of the spectrogram with the curves of equal volume in step 772 and the search for the sonic result with the maximum intensity in step 780 support the statement of musical science that the main melody is the portion of a music title that man perceives the loudest and the most concise.
The above described steps 776 to 780 present possible substeps of step 758 of
In the potential melody line of step 780 segments are located that do not belong to the melody. In melody pauses or between melody notes, dominant segments, like e.g. from the bass course or other accompaniment instruments may be found. These melody pauses have to be removed by the later steps in
After the determination of the potential melody line in step 780, in a step 782 first of all a general segmentation 782 is performed which cares for parts of the potential melody line to be removed that may not belong to the actual melody line prima facie. In
At the melody line of
The general segmentation 782 starts in a step 786 with the filtering of the melody line 784 in the frequency/time range of an illustration in which the melody line 784, as shown in
The step 786 is now provided to remove minor outliers or artifacts, respectively, in the melody line.
In step 786 for this reason from the pixel array of
In the filtering in step 786 first of all—as already mentioned—for every pixel 790 the binary value of the same and the binary value of the neighboring pixels is summed. This is for example illustrated as an example in
This second pixel image is then subjected to a mapping pixel-by-pixel, wherein in the pixel image all sum values of 0 or 1 are mapped to zero and all sum values larger than or equal to 2 are mapped to 1. The result of this mapping is illustrated in
The result of the multiplication for the section of
For illustrating the effect of the filtering 786,
After step 786, within the scope of the general segmentation 782 a step 796 follows, in which parts of the melody line 784 are removed by the fact that those parts of the melody line are neglected which are not located within a predetermined frequency range. In other words, in step 796 the value range of the melody line function of step 780 is restricted to the predetermined frequency range. Again in other words, in step 796 all pixels of the melody matrix of
For illustrating step 796, in
After step 796, in a step 804 a removal of sections of the melody line 802 having an amplitude that is too small is performed, wherein the extraction means 304 hereby goes back to the logarithmic spectrum of
After step 804 in a step 806 an elimination of those sections of the remaining melody line follows at which the course of the melody line changes erratically in the frequency direction in order to only shortly show a more or less continuous melody course. In order to explain this, reference is made to
The melody line, as it resulted from step 804, is exemplarily shown in
For performing the steps 806, means 304 now scans the melody line frame-by-frame for example from front to back. In doing so, means 304 checks for each frame whether between this frame and the following frame a frequency jump larger than the semitone distance HT takes place. If this is the case, means 302 marks those frames. In
After step 806, the processing within the scope of the general segmentation 782 proceeds to step 810, where means 304 divides the remaining residuals of the former potential melody line of step 780 into a sequence of segments. In the division into segments, all elements in the melody matrix are united to one segment or one trajectory, respectively, which are directly adjacent. In order to illustrate this,
The section from the melody line 812 shown in
The segments found this way are numbered so that a sequence of segments results.
The result of the general segmentation 782 is consequently a sequence of melody segments, wherein each melody segment covers a sequence of directly neighboring frames. Within each segment, the melody line jumps from frame to frame by at most a predetermined number of frequency bins, in the preceding embodiment by at most one frequency bin.
After the general segmentation 782, means 304 continues with the melody extraction in step 816. The step 816 serves for closing the gap between neighboring segments in order to address the case that due to for example percussive events in the melody line determination in step 780 inadvertently other sound portions were recognized and filtered out in the general segmentation 782. The gap-closing 816 is explained in more detail with reference to
As the gap-closing 816 again uses the semitone vector, in the following at first with reference to
The gap-closing is based on this division of the frequency axis f into semitone areas, as it is explained in the following with reference to
In the present exemplary case with the above indicated preferred sample frequencies, etc., p is preferably 4. In the present case, the gap 832 is therefore not smaller than four frames, whereupon the processing proceeds with step 834 in order to check whether the gap 832 is equal to or less than q frames, wherein q is preferably 15. This is presently the case, which is why the processing proceeds with step 836, where it is checked whether the segment ends of the reference segment 812a and the follower segment 812b which are facing each other, i.e. the end of the segment 812a and the beginning of the follower segment 812b, are located in one single or in adjacent semitone areas. In
For this case of the positive examination in step 836, the processing within the scope of gap-closing proceeds with step 840, where it is checked which amplitude difference in the perception-related spectrum of step 772 is present at the positions of the end of the reference segment 812a and the beginning of the follower segment 812b. In other words, means 304 looks up the respective perception-related spectral values at the positions of the end of the segment 812a and the beginning of the segments 812b in step 840, in the perception-related spectrum of step 772, and determines the absolute value of the difference of the two spectral values. Further, means 304 determines in step 840, whether the difference is greater than a predetermined threshold value r, wherein the same is preferably 20-40% and more preferably 30% of the perception-related spectral value at the end of the reference segment 812a.
If the determination in step 840 provides a positive result, the gap-closing proceeds with step 842. There, means 304 determines a gap-closing line 844 in the melody matrix directly combining the end of the reference segment 812a and the beginning of the follower segment 812b. The gap-closing line is preferably straight, as it is also shown in
Along this connecting line, means 304 then determines the corresponding perception-related spectral values from the perception-related spectrum of step 772, by looking up at the respective tupels of frequency bin and frame of the gap-closing line 844 in the perception-related spectrum. Via these perception-related spectral values along the gap-closing line, means 304 determines the average value and compares the same to the corresponding average values of the perception-related spectral values along the reference element 812a and the follower segment 812b within the scope of step 842. If both result in comparisons, that the average value for the gap-closing line is greater than or equal to the average value of the reference or follower segments 812a or 812b, respectively, then the gap 832 is closed in a step 846, i.e. by entering the gap-closing line 844 in the melody matrix or setting the corresponding matrix elements of the same to 1, respectively. At the same time, in step 846 the list of segments is changed in order to unite the segments 812a and 812b to one common segment, whereupon the gap closing for the reference segment and the follower segment is completed.
A gap closing along the gap-closing line 844 also results when it results in step 830 that the gap 832 is less than 4 frames long. In this case, in a step 848 the gap 832 is closed, i.e. like in the case of step 846 along a direct and preferably straight gap-closing line 844 connecting the facing ends of the segments 812a-812b, whereupon the gap closing for both segments is completed and proceeds with the following segment, if present. Although this is not shown in
If one of the steps 834, 836, 840 or 842 leads to a negative examination result, the gap closing for the reference segment 812a is completed and is again performed for the follower segment 812b.
The result of the gap closing 816 is therefore possibly a shortened list of segments or a melody line, respectively, comprising gap-closing lines in some places in the melody matrix, if applicable. As it resulted from the preceding discussion, in a gap smaller than 4 frames a connection between neighboring segments in the same or the adjacent semitone area is always provided.
A harmony mapping 850 follows upon the gap closing 816 which is provided to remove errors in the melody line which resulted through the fact that in the determination of the potential melody line 780 by mistake the wrong tonic or keynote of a sound was determined. In particular, the harmony mapping 850 operates segment by segment, in order to shift individual segments of the melody line resulting after the gap closing 816 by an octave, a fifth or a major third, as it is described in more detail in the following. As the following description will show, the conditions for this are strict in order not shift a segment erroneously in the frequency by mistake. The harmony mapping 850 is described in more detail in the following with reference to
As already mentioned, the harmony mapping 850 is performed in segments.
The segment 852b located between the segments 852a and 852c seems to be cut out of the melody line course, as it would result through the segments 852a and 852c. In particular, in the present case the segment 852b exemplarily connects to the reference element 852a without a frame gap, as it is indicated by a dashed line 854. In the same way, exemplarily the time area covered by the segment 852 should directly abut on the time area covered by the segment 852c, as it is indicated by a dashed line 856.
In
As it may be seen from
The harmony mapping 850 begins with the determination of a melody centre line using a average value filter in a step 860. In particular, step 860 includes the calculation of a sliding average value of the melody course 852 with a certain number of frames across the segments in the direction of time t, wherein the window length is for example 80-120 and preferably 100 frames with the frame length of 8 ms mentioned above as an example, i.e. correspondingly different number of frames with another frame length. In more detail, for the determination of the melody center line, a window of the length of 100 frames is shifted along the time axis t in frames. In doing so, all frequency bins associated with frames within the filter window by the melody line 852 are averaged, and this average value for the frame is entered into the middle of the filter window, whereby after a repetition for subsequent frames in the case of
In a subsequent step 864 means 304 checks whether the reference segment 852a directly abuts on the following segment 852b along the time axis t. If this is not the case, the processing is performed again (866) using the following segment as the reference segment.
In the present case of
In a step 870, means 304 then looks up in the spectrum evaluated with curves of equal volume or the perception-related spectrum of step 772, respectively, in order to obtain the respective minimum perception-related spectral value along the reference segment 852a and the line of the octave, fifth and/or third 858a-d. In the exemplary case of
These minimum values are used in the subsequent step 872 in order to select one or none of the shifting lines of the octave, fifth and/or third 858a-d that depends on whether the minimum value determined for the respective lines of octave, fifth and/or third comprises a predetermined relation to the minimum value of the reference segment. In particular, an octave line 858b is selected from lines 858a-d, if the minimum value is smaller than the minimum value for the reference segment 852a by at most 30%. A line of the fifth 858d is selected if the minimum value determined for the same is at most 2.5% smaller than the minimum value of the reference segment 852a. One of the lines of the third 858c is used if the corresponding minimum value for this line is at least 10% greater than the minimum value for the reference segment 852a.
The above-mentioned values which were used as criterions for selecting from lines 858a-858b may of course be varied, although the same provided very good results for pieces of pop music. In addition, it is not necessarily required to determine the minimum values for the reference segment or the individual lines 858a-d, respectively, but for example also the individual average values may be used.
The advantage of the difference of the criterions for the individual lines is that by this a probability may be considered that in the melody line determination 780 erroneously a jump of the octave, fifth or third has occurred, or that such a hop was in fact desired in the melody, respectively.
In a subsequent step 874, means 304 shifts the segment 852b to the selected line 858a-858d, as far as one such line was selected in step 872, provided that the shifting points into the direction of the melody center line 862, that is from the point of view of the follower segment 852b. In the exemplary case of
After the harmony mapping 850, in a step 876 a vibrato recognition and a vibrato balance or equalization takes place whose functioning is explained in more detail with reference to
The step 876 is performed in segments for each segment 878 in the melody line, as it results after the harmony mapping 850. In
In a following step 884 it is checked whether the extremes 882 are arranged such that in the time direction neighboring local extremes 882 are arranged at frequency bins comprising a frequency separation larger than or smaller than or equal to a predetermined number of bins, i.e. for example 15 to 25 but preferably 22 bins in the implementation of the frequency analysis described with reference to
In a subsequent step 888 means 304 examines whether between the neighboring extremes 882 the time distance is always smaller than or equal to a predetermined number of time frames, wherein the predetermined number is for example 21.
If the examination in step 888 is positive, as it is the case in the example of
In other words, the vibrato recognition and the vibrato balance according to
After the vibrato recognition in step 876, in step 898 a statistical correction is performed which also considers the observation that in a melody short and extreme tone pitch fluctuations are not to be expected. The statistical correction according to 898 is explained in more detail with reference to
After that, a second window not shown in
After step 898 a semitone mapping 912 follows. The semitone mapping is performed frame-by-frame, wherein for this the semitone vector of step 818 is used defining the semitone frequencies. The semitone mapping 912 functions such that for each frame at which the melody line which resulted from step 898 is presented, it is examined, in which one of the semitone areas the frequency bin is present, in which one the melody line passes the respective frame or to which frequency bin the melody line function maps the respective frame, respectively. The melody line is then changed such that in the respective frame the melody line is changed to the frequency value corresponding to the semitone frequency of the semitone arrange in which the frequency bin was present which the melody line passed.
Instead of the semitone mapping or quantization frame-by-frame, respectively, also a semitone quantization segment-by-segment may be performed, for example by the fact that only the frequency average value per segment is associated with one of the semitone areas and thus to the corresponding semitone area frequency in the above-described way, which is then used over the whole time length of the corresponding segment as the frequency.
The steps 782, 816, 818, 850, 876, 898 and 912 consequently correspond to step 760 in
After the semitone mapping 912 an onset recognition and correction that takes place for every segment is performed in step 914. The same is explained in more detail with reference to
It is the aim of the onset recognition and correction 914 to correct or specify, respectively, the individual segments of the melody line resulting by the semitone mapping 912 in more detail with regard to their initial points of time, wherein the segments correspond more and more to the individual notes of the searched melody. To this end, again use is made of the incoming audio signal 302 or the one provided in step 750, respectively, as it is described in more detail in the following.
In a step 916, first of all the audio signal 302 is filtered with a band pass filter corresponding to the semitone frequency to which the respective reference segment in step 912 was quantized, or with a band pass filter, respectively, comprising cut-off frequencies between which the quantized semitone frequency of the respective segment is present. Preferably, the band pass filter is used as one that comprises cut-off frequencies corresponding to the semitone cut-off frequencies fu and f0 of the semitone area in which the considered segment is located. Again preferably, as the band pass filter an IIR band pass filter is used with the cut-off frequencies fu and f0 associated with the respective semitone area as filter cut-off frequencies or a Butterworth band pass filter whose transmission function is shown in
Subsequently, in a step 918 a two-way rectification of the audio signal filtered in step 916 is performed, whereupon in a step 920 the time signal obtained in step 918 is interpolated and the interpolated time signal is folded with a hamming filter, whereby an envelope of the two-way rectified or the filtered audio signal, respectively, is determined.
The steps 916-920 are illustrated again with reference to
The steps 916-920 only represent a possibility for generating the envelope 924 and may of course be varied. Anyway, envelopes 924 for the audio signal are generated for all those semitone frequencies or semitone areas, respectively, in which segments or note segments, respectively, of the current melody line are arranged. For each such envelope 924 then the following steps of
First of all, in a step 926 potential initial points of time are determined, that is as the locations of the local maximum increase of the envelope 924. In other words, inflection points in the envelope 924 are determined in step 926. The points of time of the inflection points are illustrated with vertical lines 928 in the case of
For the following evaluation of the determined potential initial points of time or potential slopes, respectively, a downsampling to the time resolution of the preprocessing is performed, if applicable within the scope of step 926, not shown in
In a step 928 it is examined now, whether it holds true for a potential initial point of time that the same lies before the segment beginning of the segment corresponding to the same. If this is the case, the processing proceeds with step 930. Otherwise, i.e. when the potential initial point of time is behind the existing segment beginning, step 928 is repeated for a next potential initial point of time or step 926 for a next envelope which was determined for another semitone area, or the onset recognition and correction performed segment-by-segment is performed for a next segment.
In step 930 it is checked whether the potential initial point of time is more than x frames before the beginning of the corresponding segment, wherein x is for example between 8 and including 12 and preferably 10 with a frame length of 8 ms, wherein the values for other frame lengths would have to be changed correspondingly. If this is not the case, i.e. if the potential initial point of time or the determined initial point of time, respectively, is up to 10 frames before the interesting segment, in a step 932 the gap between the potential initial point of time and the previous segment beginning is closed or the previous segment beginning is corrected to the potential initial point of time, respectively. To this end, if applicable, the previous segment is correspondingly shortened or its segment end is changed to the frame before the potential initial point of time, respectively. In other words, step 932 includes an elongation of the reference segment in forward direction up to the potential initial point of time and a possible shortening of the length of the previous segment at the end of the same in order to prevent an overlapping of the two segments.
If, however, the examination in step 930 indicates that the potential initial point of time is closer than x frames in front of the beginning of the corresponding segments, then it is checked in a step 934 whether the step 934 is run for the first time for this potential initial point of time. If this is not the case, the processing ends here for this potential initial point of time and the corresponding segment and the processing of the onset recognition proceeds with step 928 for a further potential initial point of time or with step 926 for a further envelope.
Otherwise, however, in a step 936 the previous segment beginning of the interesting segment is virtually shifted forward. To this end, the perception-related spectral values which are located at the virtually shifted initial points of time of the segment are looked up in the perception-related spectrum. If the decrease of these perception-related spectral values in the perception-related spectrum exceeds a certain value, then the frame at which this exceeding took place is temporarily used as a segment beginning of the reference segment and step 930 is again repeated. If then the potential initial point of time is not more than x frames in front of the beginning determined in step 936 of the corresponding segment anymore, the gap in step 932 is also closed, as it was described above.
The effect of the onset recognition and correction 914 consequently consists in the fact that individual segments are changed in the current melody line with regard to their temporal extension, i.e. elongated to the front or shortened at the back, respectively.
After step 914 then a length segmentation 938 follows. In the length segmentation 938, all segments of the melody line which now occur as horizontal lines in the melody matrix due to the semitone mapping 912 which lie on the semitone frequencies, are scanned through, and those segments are removed from the melody line which are smaller than a predetermined length. For example, segments are removed which are less than 10-14 frames and preferably 12 frames and less long—again assuming as above a frame length of 8 ms or a corresponding adjustment of the numbers of frames. 12 frames at a time resolution or frame length, respectively, of 8 milliseconds correspond to 96 milliseconds, which is less than about a 1/64 note.
The steps 914 and 938 consequently correspond to step 762 of
The melody line held in step 938 then consists of a slightly reduced number of segments which comprise exactly the same semitone frequency across a certain number of subsequent frames. These segments may uniquely be associated to note segments. This melody line is then, in a step 940 which corresponds to the above-described step 764 of
The midi output 914 through means 304 then results in the note sequence, based on which the rhythm means 306 performs the operations described above.
The preceding description with regard to
Up to step 782 the proceeding according to
In contrast to the proceeding according to
As it may be seen, the amplitude of the keynote of the reference segment 952 obtained in the general segmentation 782 is continuously above the exemplary value. Only the above arranged overtones show an interruption about in the middle of the segment. The continuity of the keynote caused that the segment did not break down into two notes in the general segmentation 782, although probably about in the middle of the segment 952 a note boundary or interface exists. Errors of this kind predominantly only occur with monophonic music, which is why the tone separation is only performed in the case of
In the following now the tone separation 950 is explained in more detail with reference to
In a following step 962, thereupon in the amplitude course with the greatest dynamic those locations are identified as potential separation locations at which a local amplitude minimum falls under a predetermined threshold value. This is illustrated in
In a step 968 then among the possibly several separation locations the ones are sorted out that lie in a boundary area 970 around the segment beginning 972 or within a boundary area 974 around the segment end 976. For the remaining potential separation locations, in a step 978 the difference between the amplitude minimum at the minimum 964 and the average value of the amplitudes of the local maxima 980 or 982, respectively, neighboring the minimum 964, is formed in the amplitude course 960. The difference is illustrated in
In a subsequent step 986 it is checked whether the difference 984 is larger than a predetermined threshold value. If this is not the case, the tone separation for this potential separation location and if applicable for the regarded segment 960 ends. Otherwise, in a step 988 the reference segment is separated into two segments at the potential separation location or the minimum 964, respectively, wherein the one extends from the segment beginning 972 to the frame of the minimum 964 and the other extends between the frame of the minimum 964 or the subsequent frame, respectively, and the segment end 976. The list of segments is correspondingly extended. A different possibility of separation 988 is to provide a gap between the two newly generated segments. For example in the area, in which the amplitude course 960 is below the threshold value—in
A further problem which mainly occurs with monophonic music is that the individual notes are subject to frequency fluctuations that make a subsequent segmentation more difficult. Because of this, after the tone separation 950 in step 992 a tone smoothing is performed which is explained in more detail with reference to
The purpose of the tone smoothing is to select the one among the frequency bins between which the segment 994 fluctuates which is to be constantly associated to the segment 994 for all frames.
The tone smoothing begins in a step 996 with the initialization of a counter variable i to 1. In a subsequent step 998 a counter value z is initialized to 1. This counter variable i has the meaning of the numbering of the frames of the segment 994 from left to right in
In a step 1000 now the counter value z is accumulated to a sum for the frequency bin of the i-th frame of the segment. For each frequency bin in which the segment 994 fluctuates to and fro, a sum or an accumulation value exists, respectively. The counter value may here be weighted according to a varying embodiment, like e.g. with a factor f(i), wherein f(i) is a function continuously increasing with i, in order to thus weight the portions to be summed up at the end of a segment more strongly, as the voice is already better assimilated to the tone, for example, compared to the transient process and the beginning of a note. Below the horizontal time axis in square brackets in
In a step 1002 it is examined whether the i-th frame is the last frame of the segment 994. If this is not the case, then in a step 1004 the counter variable i is incremented, i.e. a skip to the next frame is performed. In a subsequent step 1006 it is examined whether the segment 994 in the current frame, i.e. the i-th frame, is located in the same frequency bin, as where it was located in the (i−1)-th frame. If this is the case, in a step 1008 the counter variable z is imcremented, whereupon the processing again continues at step 1000. If the segment 994 in the i-th frame and the (i−1)-th frame is not in the same frequency bin, however, the processing continues with the initialization of the counter variable z to 1 in step 998.
If it is finally determined in step 1002 that the i-th frame is the last frame of the segment 994, then for each frequency bin in which the segment 994 is located a sum results, illustrated in
In a step 1012 upon the determination of the last frame in step 1002 the one frequency bin is selected for which the accumulated sum 1010 is largest. In the exemplary case of
In other words, the tone smoothing consequently serves for compensating the start of singing and launch of singing of tones starting from lower or higher frequencies and facilitates this by determining a value across the temporal course of a tone which corresponds to the frequency of the steady-state tone. For the determination of the frequency value from the oscillating signal all elements of a frequency band are counted up, whereupon all counted-up elements of a frequency band located at the note sequence are added up. Then, the tone is plotted in the frequency band with the highest sum over the time of the note sequence.
After the tone smoothing 992, subsequently a statistical correction 916 is performed, wherein the performance of the statistical correction corresponds to that of
The steps 950, 992, 1016, 1018 and 1020 consequently correspond to step 760 of
After the semitone mapping 1018 an onset recognition 1022 follows which basically corresponds to the one of
After the onset recognition 1022 an offset recognition and correction 1024 follows which is explained in more detail with reference to
In a step 1026 similar to step 916, first of all the audio signal is filtered with a band pass filter corresponding to the semitone frequency of the reference segment, whereupon in a step 1028 corresponding to step 918 the filtered audio signal is two-way rectified. Further, in step 1028 again an interpretation of the rectified time signal is performed. This proceeding is sufficient for the case of offset recognition and correction in order to approximately determine an envelope, whereby the complicated step 920 of the onset recognition may be omitted.
In a step 1034 now, in the time section 1036 corresponding to a reference segment a maximum of the interpolated time signal 1030 is determined, i.e. in particular the value of the interpolated time signal 1030 at the maximum 1040. In a step 1042, thereupon a potential note end point of time is determined as the point of time at which the rectified audio signal has fallen in time after the maximum 1040 to a predetermined percentage of the value at the maximum 1040, wherein the percentage in step 1042 is preferably 15%. The potential note end is illustrated in
In a subsequent step 1046 it is then examined whether the potential note end 1044 is temporally after the segment end 1048. If this is not the case, as it is exemplarily shown in
If the examination in step 1050 is negative, however, no offset correction takes place and the step 1034 and the following steps are repeated for another reference segment of the same semitone frequency, or it is proceeded with step 1026 for other semitone frequencies.
After the offset recognition 1024, in step 1052 a length segmentation 1052 corresponding to the step 938 of
With reference to the preceding description of
Basically it would of course also be possible to omit the steps 770-774 or only the steps 772 and 774, which will, however, lead to a deterioration of the melody line determination in step 780 and therefore to a deterioration of the overall result of the melody extraction method.
In the basic frequency determination 776 a tone model of Goto was used. Other tone models or other weightings of the overtone portions, respectively, would also be possible, however, and could for example be adjusted to the origin or the source of the audio signal, respectively, as far as the same is known, like e.g. when in the embodiment of the ring tone generation the user is determined to hum.
With regard to the determination of the potential melody line in step 780 it is noted that according to the abovementioned statement of musical science for each frame only the basic frequency of the loudest sound portion was selected, that it is further possible, however, to restrict the selection not only to a unique selection of the largest proportion for each frame. Just like it is for example the case in Paiva, the determination of the potential melody line 780 might comprise the association of several frequency bins to one single frame. Subsequently, a finding of several trajectories may be performed. This means the allowance of a selection of several basic frequencies or several sounds for each frame. The subsequent segmentation would then of course partially have to be performed differently and in particular the subsequent segmentation would be somewhat more expensive, as several trajectories or segments, respectively, would have to be considered and found. Conversely, in this case some of the above-mentioned steps or substeps could be taken over in the segmentation also for this case of the determination of trajectories which may overlap in time. In particular steps 786, 796 and 804 of the general segmentation may easily be also be transferred to this case. The step 806 could be transferred to the case that the melody line consists of trajectories overlapping in time, if this step took place after the identification of the trajectories. The identification of trajectories could take place similar to step 810, wherein, however, modifications would have to be performed such that also several trajectories overlapping in time may be traced. Also the gap-closing could be performed in a similar way for such trajectories between which no time gap exists. Also the harmony mapping could be performed between two trajectories directly subsequent in time. The vibrato recognition or the vibrato compensation, respectively, could easily be applied to one single trajectory just like to the above-mentioned non-overlapping melody line segments. Also the onset recognition and correction could also be applied with trajectories. The same holds true for the tone separation and the tone smoothing as well as for the offset recognition and correction and for the statistical correction and the length segmentation. The admission of the temporal overlapping of trajectories of the melody line in the determination in step 780 at least required, however, that before the actual note sequence output the temporal overlapping of trajectories has to be removed at some time. The advantage of the determination of the potential melody line in the above-described way with reference to
The above-described implementation of the general segmentation does not have to comprise all substeps 786, 796, 804 and 806, but may also include a selection from the same.
In gap closing, in steps 840 and 842 the perception-related spectrum was used. Basically it is possible, however, to also use the logarithmized spectrum or the spectrogram directly obtained from the frequency analysis in these steps, wherein, however, the use of the perception-related spectrum in these steps resulted in the best result with regard to melody extraction. Similar things hold true for step 870 of harmony mapping.
With regard to the harmony mapping it is noted that it might be provided there, when shifting 868 the follower segment, to perform the shifting only in the direction of the melody center line, so the second condition in step 874 may be omitted. With reference to step 872 it is noted that a non-ambiguity in the selection of the different lines of the octave, fifth and/or third may be achieved by the fact that among the same a priority ranking list is generated, like e.g. octave line before line of fifth before line of third, and among lines of the same line type (line of octave, fifth or third), the one which is closer to the original position of the follower segment.
With regard to the onset recognition and the offset recognition it is noted that the determination of the envelope or the interpolated time signal used instead in offset recognition, respectively, might also be performed differently. It is only essential, that in the onset and offset recognition the audio signal is used that is filtered with a band pass filter with a transmission characteristic around the respective semitone frequency in order to recognize the initial point of time of the note from the increase of the envelope of the thus formed filtered signal or the end point of time of the note using the decrease of the envelope.
With regard to the flow charts among
In particular it is noted, that depending on the conditions, the inventive scheme may also be implemented in software. The implementation may be performed on a digital storage medium, in particular a floppy disc or a CD with electronically readable control signals which may cooperate with a programmable computer system such that the corresponding method is performed. In general, the invention thus also consists in a computer program product having a program code stored on a machine readable carrier for performing the inventive method, when the computer program product runs on a computer. In other words, the invention may thus be realized as a computer program with a program code for performing the method when the computer program runs on a computer.
While this invention has been described in terms of several preferred embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10 2004 049 457.6 | Oct 2004 | DE | national |