The present invention relates generally to audio or speech processing and, in particular, to segmenting a humming signal into musical notes.
Multimedia content has become extremely popular over recent years. The popularity of such multimedia content is mainly due to the convenience of transferring and storing such content. This convenience is made possible by the wide availability of audio formats, such as the MP3 format, which are very compact, and an increase of media bandwidth to the home, such as broadband Internet. Also, the emergence of 3G wireless devices assists in the convenient distribution of multimedia content.
With such a large amount of multimedia content being available to users, an increasing need exists for an effective searching mechanism for multimedia content. One possible way of searching is “retrieval by humming”, whereby a user searches for a desired musical piece by humming the melody of that desired musical pieces to a system. The system in response then outputs to the user information about the musical piece associated with the hummed melody.
Humming is defined herein as singing a melody of a song without expressing the actual words or lyrics of that song.
Besides multimedia retrieval purposes, transcribing of melodies that are in acoustic waveforms, such as a humming signal, into written representation, for example musical notes, is very useful as well. Songwriters can compose tunes without a need for instruments, or students can practice by humming on their own.
As a result, effective processing of humming signals into musical notes is desirable. The musical notes should contain information such as the pitch, the start time and the duration of the respective notes.
In order to effectively process such a humming signal, two distinct steps are required. The first step is the segmentation of the acoustic wave representing the humming signal into notes, whereby determining the start time and duration of each note, and the second step is the detection of the pitch of each segment (or note). The segmentation of the acoustic wave is not as straightforward as it may appear, as there is difficulty in defining the boundary of each note in an acoustic wave. Also, there is considerable controversy over exactly what pitch is.
In the case where the note is made up from a single frequency the frequency of the note is also the pitch. However, a musical note, especially when produced by a human vocal system, is made up from more than one frequency. Accordingly, pitch generally refers to the fundamental frequency of a note.
In most prior art, it is assumed that each note will have a peak in amplitude/power or will be separated by a reasonable amount of silence, and these aspects are used for the segmentation of the acoustic signal. In reality the segmentation of the acoustic signal is considerably more complex.
For example, as is described in U.S. Pat. No. 5,874,686 issued on Feb. 23, 1999, after the peak energy levels of the signal are isolated and tracked, autocorrelation is performed on the signal around those peaks to detect the pitch of each note. In order to improve the performance, speech and robustness of the pitch-tracking algorithm, a cubic-spline wavelet transform (or other suitable wavelet transform) is used.
U.S. Pat. No. 5,038,658 issued on Aug. 13, 1991 discloses segmentation based on both power and pitch information. The final note boundaries are determined without being influenced by fluctuations in acoustic signals or abrupt intrusions of outside sounds.
In the method disclosed in International publication No. WO2004034375, the humming signal is subjected to a process of segmentation based on amplitude gradient that comprises the steps of subjecting the signal to a process of envelope detection, followed by a process of differentiation to calculate a gradient function. This gradient function is then used to determine the note boundaries.
Segmentation may also be done by differentiating the characteristics between onset/offset (unvoiced) and steady state (voiced) portion of the note. A known technique for performing voiced/unvoiced discrimination from the field of speech recognition is relying on the estimation of the Root Mean Square (RMS) power and the Zero Crossing Rate.
Yet another method used for segmenting an acoustic signal is by first grouping a data sample stream of the acoustic signal into frames, with each frame including a predetermined number of data samples. It is usual for the frames to have some degree of overlap of samples. A spectral transformation, such as the Fast Fourier Transform (FFT), is performed on each frame, and a fundamental frequency obtained. This creates a frequency distribution over the frames. Segmentation is then performed by tracking clusters of similar frequencies. Energy or power information is often also used for analysing the signal to identify repeated or glissando notes within each group of frames having a similar frequency distribution.
The prior art methods described above lead to inaccuracies in the segmentation of humming signals, and inaccuracy in the segmentation directly leads to poor results in overall transcription of the humming signal into musical notes.
Tracking of frequency changes alone could not accurately segment notes because in practice, there will exist fast repeating or glissando notes within the humming signal. As a result, pauses in-between these notes cannot be identified easily. Furthermore, a person creating the humming signal is generally unable to maintain a pitch. This results in pitch changes within a single note. This may in turn be subsequently misinterpreted as note change.
Using of energy or power distribution, whether the distribution is as a result of average energy over frames or amplitude/power over samples, to segment the humming signal into notes has difficulties associated as well. For example, the difference in energy level between the high-energy and low-energy notes is often large. Accordingly, using a global threshold to threshold the energy distribution is not possible. An adaptive threshold is required, which in turn requires significant processing time because the value of the adaptive threshold is difficult to calculate. This is particularly true for acoustic signals derived from a male as there is generally no specific pattern in the change in the energy or power information. Hummed songs have fluctuations in relation to the pattern of change. In addition, the sound to be transcribed also often contains abrupt sounds, such as outside noises. In these circumstances, a simple segmentation of sound based on change in the power information would not necessarily lead to any good segmentation of individual sounds.
Furthermore, if the person humming does not pause adequately when humming a string of the same notes, the transcription system might interpret the string of the same notes as a single note. The task also becomes increasingly difficult in the presence of expressive variations and the physical limitation of the human vocal system.
It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.
According to a first aspect of the present invention there is provided a method for segmenting a data sample stream of a humming signal into musical notes, said method comprising the steps of:
grouping said data sample stream into frames of data samples;
processing each frame of data samples to derive a frequency distribution for each of said frames;
processing said frequency distributions of said frames to derive a Harmonic Product Energy (HPE) distribution;
segmenting said HPE distribution to obtain boundaries of musical notes.
According to another aspect of the present invention, there is provided an apparatus for implementing any one of the aforementioned method.
According to yet another aspect of the present invention there is provided a computer program product including a computer readable medium having recorded thereon a computer program for implementing the method described above.
Other aspects of the invention are also disclosed.
One or more embodiments of the present invention will now be described with reference to the drawings, in which:
Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.
Overview
For reasons explained in the “Background” section, using of energy or power distribution to segment a humming signal into musical notes leads to inaccuracies in the segmentation. Therefore, a parameter other than energy or power is required which provides a distribution over time that takes a specific pattern in relation to the onset and offset of a note, regardless of different melodies or persons humming. One such possible parameter is timbre of the humming signal. Timbre is mainly determined by the harmonic content of the humming signal, and the dynamic characteristics of the signal, such as vibrato and the attack-decay envelope of the sound.
The inventors have observed that as a humming signal transits from an intended note to another, its timbre changes at the boundary. This is true even for fast repeating or glissando notes. Since the perception of timbre results from the human ear detecting harmonics, the inventors have realised that extracting information about harmonics for use during segmentation would be useful. The manner in which this is done is described in detail below.
The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer from the computer readable medium, and then executed by the computer. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer preferably effects an advantageous apparatus for transcribing the data sample stream 101 of a humming signal into musical notes.
Computer Implementation
The computer system 200 is formed by a computer module 201, input devices such as a keyboard 202, a mouse 203 and a microphone 216, and output devices including a display device 214. The computer module 201 typically includes at least one processor unit 205, and a memory unit 206, for example formed from semiconductor random access memory (RAM) and read only memory (ROM). The module 201 also includes an number of input/output (I/O) interfaces including a video interface 207 that couples to the video display 214, an I/O interface 213 for the keyboard 202 and mouse 203, and an audio interface 208 for the microphone 216.
A storage device 209 is provided and typically includes a hard disk drive 210 and a floppy disk drive 211. A CD-ROM drive 212 is typically provided as a non-volatile source of data. The components 205 to 213 of the computer module 201, typically communicate via an interconnected bus 204 and in a manner which results in a conventional mode of operation of the computer system 200 known to those in the relevant art.
Typically, the application program is resident on the hard disk drive 210 and read and controlled in its execution by the processor 205. In some instances, the application program may be supplied to the user encoded on a CD-ROM or floppy disk and read via the corresponding drive 212 or 211, or alternatively may be read by the user from a network (not illustrated) via a modern device (not illustrated). Still further, the software can also be loaded into the computer system 200 from other computer readable media. The term “computer readable medium” as used herein refers to any storage or transmission medium that participates in providing instructions and/or data to the computer system 200 for execution and/or processing.
The method 100 of transcribing the data sample stream 101 of a humming signal into musical notes may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions thereof.
Methodology
The data sample stream 101 of the humming signal may be formed by the audio interface 208 on receipt of a humming sound through the microphone 216. Alternatively the humming signal may previously have been converted and stored as the data sample stream 101, which is then directly retrievable from the storage device 209 or the CD-ROM 212.
Referring again to
Referring to
In sub-step 320 that follows, the data samples of each frame are spectrally transformed, for example using the Fast Fourier Transform (FFT), to obtain a frequency spectral representation of the data samples of each frame. The spectral representation is expressed using the decibel (dB) scale which, because of its logarithmic nature, shows spectral peaks within the spectral representation more clearly. Step 105 terminates after sub-step 320.
Referring again to
Step 110 starts by analysing each frame y in order to determine whether that frame y contains noise, and hence may be termed a noise frame. A noise frame is defined here as a frame y that contains no tonal components. Accordingly, as shown in
In sub-step 402 the processor 205 then determines whether the average frame energy Eav of that frame y is less than a predetermined threshold T0. If it is determined that the average frame energy Eav is not less than the threshold T0, then step 110 proceeds to sub-step 403 where the number n of frequency samples in frame y having a magnitude that exceeds a threshold T1 is determined. The threshold T1 is set as a predetermined ratio of the maximum magnitude within the spectral representation of the frame y, with the predetermined ratio preferably being set as 32.5 dB. In sub-step 404 the processor 205 then determines whether the number n is greater that a predetermined threshold T2.
If it is determined in sub-step 404 that the number n is not greater that the threshold T2, then the frame y is considered not to be a noise frame and step 110 proceeds to find the tonal components in that frame y.
Accordingly, step 110 continues to sub-step 407 where all the local maxima with magnitude greater than the threshold T1 within the spectral representation are located. A frequency component b constitutes a local maximum if it has magnitude X(b) that is greater than that of its immediately left neighbour frequency component b−1 and that is not lesser than that of its immediately right neighbour frequency component b+1, hence:
X(b)>X(b−1) AND X(b)>=X(b+1) (1)
Next, in sub-step 408, the local maxima are further processed in order to locate all the tonal peaks from the local maxima. A local maximum has to meet a set of criteria before being designated as a tonal peak. Firstly, the energy X(k) of a local maximum k has to be greater than, or equal to, S1 dB of the energy of both the 2nd left neighbour frequency component and the 2nd right neighbour frequency component. Secondly, the energy X(k) has to be greater than, or equal to, S2 dB of the energy of both the 3rd left neighbour frequency component and 3rd right neighbour frequency component, and so on right until the 6th left and 6th right neighbour frequency components are considered. Hence:
X(k)−X(k−2)>=S1 AND X(k)−X(k+2)>=S1
X(k)−X(k−3)>=S2 AND X(k)−X(k+3)>=S2
X(k)−X(k−4)>=S3 AND X(k)−X(k+4)>=S3
X(k)−X(k−5)>=S5 AND X(k)−X(k+5)>=S4
X(k)−X(k−6)>=S5 AND X(k)−X(k+6)>=S5 (2)
After all the tonal peaks are located in sub-step 408, harmonically related tonal peaks are grouped together in sub-step 409. Sub-step 410 then calculates a Harmonic Product Energy (HPE) h(f) of each group by adding the energies X(b) (in dB) of all the harmonics in each group as follows:
where fm is the fundamental frequency corresponding to the harmonic group m, X(f) is the energy, in dB, associated with a frequency f in the spectrum, m is the number of harmonic groups in the frame, a is the multiple the frequency of the second tonal peak (if it exists) of the harmonic group is of the fundamental frequency of the harmonic group, b is the multiple the frequency of the third tonal peak (if it exists) of the harmonic group is of the fundamental frequency of the harmonic group, etc. It is noted that ‘addition’ in the logarithmic scale is equivalent to ‘multiplication’ in the non-logarithmic scale.
The group with the largest HPE h(f) is chosen as the dominant harmonic group for the frame y under consideration. Accordingly, in sub-step 411, the HPE H(v) attributed to frame y is then the HPE of the dominant harmonic group as follows:
H(y)=max{h(f1), h(f2), . . . , h(dm)} (4)
A fundamental frequency F(y) of that frame y is set in sub-step 412 to the fundamental frequency of the dominant harmonic group.
Referring again to sub-steps 402 and 404, if it is determined in sub-step 402 that the average frame energy Eav is less than the threshold T0, or in sub-step 404 that the number n is greater that the threshold T2 in which case the signal in that frame y is considered to have no tonal components and is regarded as a noise frame, then step 110 continues to sub-step 405 where the fundamental frequency F(y) of that frame y is set to 0. Also, the HPE H(y) of that frame y is set to 0.
From sub-step 405 or sub-step 412 the control within step 110 then passes to sub-step 416 where it is determined whether the frame y just processed was the last frame in the data stream. In the case where more frames remain for processing, then control within step 110 returns to sub-step 401 from where the next frame is processed. Alternatively step 110 terminates.
The output from step 110 is thus the HPE H(y) for each frame y and the fundamental frequency F(y) of that frame y. An HPE distribution and a fundamental frequency distribution over the frames are thus produced.
In other words, for each frame in the data sample stream all the harmonics corresponding to a fundamental frequency, if such harmonics exist, are multiplied together to form a HPE distribution over the frames. The HPE distribution not only contains information about timbre of the humming signal, but also contains information about the average magnitude of the fundamental frequency of the dominant harmonic group at each frame instant. Furthermore, the HPE distribution excludes the energy of components that are not relevant to the fundamental frequency at each frame instant, such as is the case with noise. As a result, the HPE distribution shows the boundaries of notes much more clearly than just an average energy or amplitude distribution.
Referring again to the method 100 shown in
Long pauses in the humming signal will typically be represented as noise frames. In step 110 noise frames have been allocated an HPE H(y) value of 0. On the other hand, a distinct pause is typically shown in the HPE distribution as a large dip when compared with the HPE H(y) of the 2 notes separated the dip. Accordingly, the notes that are separated by either a long pause, or a distinct pause, are segmented in step 115 by performing a simple global threshold filtering on the HPE distribution.
In sub-step 603 the frames Y at which the HPE distribution crosses the threshold T4 from below are labelled as being an ‘onset’ of blocks. Similarly, the frames y at which the HPE distribution crosses the threshold T4 from above are labelled as being an ‘offset’ of blocks.
Sub-step 604 then uses the onset and offset frames to obtain the boundary frames of all blocks in the HPE distribution before step 115 terminates.
In practice, few persons humming will deliberately pause for a long time in-between every note. This is especially true when a fast tempo melody is intended. Fast repeating and glissando notes are very common, with the pause in-between fast repeating and glissando notes typically being very short in time and often not detectable in an average energy distribution. However, in the HPE distribution, such short pauses are reflected as clear minima. Typically, these clear minima have a very steep gradient compared to the peaks on either side of those minima. Accordingly, step 120 operates by scanning through the HPE distribution of each block in order to locate short pauses, which are characterised by minima having steep gradients.
Sub-step 702 then determines whether any local minima exist in the block. In the case where local minima exist in the block, step 120 continues by processing each local minimum in turn. Step 120 continues in sub-step 704 where the minimum distance V of the local minimum from either the left boundary BL or the right boundary BR of the block is determined. The left boundary BL is defined as either the starting frame of the block, or the end frame of a previous segmented note within the block. The right boundary BR is defined as the end frame of the block.
In sub-step 706 it is then determined whether the minimum distance V is less than 4 frames. If it is determined that the minimum distance V is less than 4 frames then the local minimum is rejected as being associated with a short pause in sub-step 707. In other words, sub-step 706 sets the minimum number of frames of any note to be 3 frames. If the minimum distance V is 3 frames or less, then the number of frames bounded between the local minimum and the boundary would then be 2 or less.
If it is determined in sub-step 706 that the minimum distance V is greater than or equal to 4 frames then, in sub-step 708, a nearest left local maximum ML and a nearest right local maximum MR to the local minimum under consideration are located. A frame is designated as being a local maximum if the value of its HPE H(Y) is greater than that of its preceding frame (y−1), and greater than or equal to that of its succeeding frame (y+1). In searching for the local maxima on either side of the local minimum, the search excludes the frames directly next to the local minimum as it is not desired for the local maxima to be too close to a local minimum corresponding to a short pause.
If it is determined that the distance of the nearest left local maximum ML from the local minimum is less than 3 frames then, in sub-step 710, a second nearest left local maximum to the local minimum is located, and used as the left local maximum ML instead. It is then determined in sub-step 711 whether the distance of the second left local maximum ML from the local minimum is less than 4 frames.
If it is determined that the distance of the second left local maximum ML from the local minimum is less than 4 frames, then the local minimum is rejected as being associated with a short pause in sub-step 715. This is because a local minimum that has too many local maximums within a short distance away from it is very often caused by unstable humming or by noise, rather than being a pause itself.
Alternatively, if it is determined in sub-step 709 that the distance of the nearest left local maximum from the local minimum is at least 3 frames, or in sub-step 711 that the distance of the second left local maximum from the local minimum is at least 4 frames, then step 120 continues in sub-step 712 where a HPE ratio RL between the left local maximum ML and the local minimum, as well as a HPE ratio RR between the right local maximum MR and the local minimum, are calculated. Since the HPE values are all in the dB scale, the ratios RL and RR are calculated through logarithmic subtraction.
It is then determined in sub-step 713 whether the ratios RL and RR are both smaller than thresholds E11 and E12 respectively. It is observed that the ratio RR is usually larger in value than the ratio RL. Again, this may be explained by the fact that the person humming often hums notes using syllables, such as “da” or “ta”. As a result, the threshold E12 used to test the ration RR is set to a value slightly larger than the threshold E11 used for the ratio RL.
If it is determined that both the ratios RL and RR are smaller than thresholds E11 and E12 respectively then, in sub-step 714 the local minimum is accepted as being associated with a short pause. Alternatively, the local minimum is rejected as being associated with a short pause in sub-step 715.
From either of sub-steps 707, 714 or 715 the processing in step 120 then continues to sub-step 705 where it is determined whether the local minimum just processed is the last local minimum within the block under consideration. In the case where more local minima remain for processing, then step 120 returns to sub-step 704 from where the next local minimum is processed to determine whether that local minimum is associated with a short pause.
If it is determined in sub-step 705 that all the local minima within the current block have been processed, or in sub-step 702 that the current block has no local minima, then processing continues in sub-step 703 where the boundaries of all notes in the block are obtained. In the cases where there were no local minima within the block, or where all the local minima were rejected as being associated with a short pause, the whole block represents a single note. In such cases sub-step 703 designates the boundaries of the block as that of the single note.
In the case where at least one local minimum that is associated with a short pause has been found, the first local minimum of the block constitutes the end of the first note in the block. The frame that comes after this local minimum is then the start of the second note in the block. The boundaries of all the notes in the block are obtained in a similar manner.
Step 120 then ends for the current block. If more blocks remain then step 120 is repeated in its entirety for all the remaining blocks. Hence, following step 120 the boundaries of all the notes in the humming signal are obtained.
Referring again to
It is observed that the start and end of notes are most prone to octave errors. The start of each note being prone to octave error could be caused by overemphasis of an unvoiced section at the start of each note. Since it is impossible for the person humming to change pitch drastically within a 2 frame intervals, sub-step 901 simply checks whether the first frame of the note has a fundamental frequency F(y) higher by a predetermined threshold than that of the second frame. In the preferred implementation the predetermined threshold used is 6 semitones. Similarly, sub-step 901 also determines whether the last frame of the note has a fundamental frequency F(y higher by the same predetermined threshold than that of the second last frame. Sub-step 902 then removes the frames with octave errors from the note.
Next, in sub-step 904 it is determined whether the number of frames in the note is less than 5. If the number of frames in the note is greater than or equal than 5, then step 125 continues in sub-step 905 where a predetermined percentage of frames are discarded from each end of the sorted list. Preferably the predetermined percentage is set to be 20%. For example, if there are 10 frames in the note, the 2 frames that have the highest fundamental frequencies and the 2 frames that have the lowest fundamental frequencies are discarded. In the case where the number of frames in the note is less than 5, no frames are discarded since the number of frames left after such a discard will then be less than 3.
It is noted that sub-step 905 discards the frames having the highest and lowest fundamental frequencies, irrespective of where such frames are located. As explained above, the starts and ends of notes are typically unstable. Accordingly, it is typical that most of the discarded frames are located at the start or end of the note.
Sub-step 906 then calculates the average of the fundamental frequencies Fav of the frames remaining in the note. Finally, in sub-step 907, the final pitch of the note under consideration is given the value of the average fundamental frequency Fav.
As set out in detail above, the method 100 converts the data stream obtained from human humming into musical notes. The segmentation which uses the HPE is an important part of the method 100, as the use of the HPE allows the method 100 to go beyond prior art methods which use traditional segmentation methods that rely on amplitude or average energy. When amplitude or average energy is used, only pauses that are either long enough or has a substantial amount of dip in energy can be detected. The method 100 thus allows a user to hum naturally without consciously trying to deliberately pause between notes, which may not be easy for some users with little musical background. The post-processing performed in step 125 also allows the system 200 to tolerate a user's failure to maintain a constant pitch within a single note. The increased accuracy and robustness in segmentation of notes achieved through method 100 hence brings about an increase in accuracy and robustness in overall transcription of a humming signal into musical notes.
The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SG2005/000183 | 6/7/2005 | WO | 00 | 1/12/2009 |