This application is related to and claims the benefit of U.S. Patent Application No. 60/985,181 filed Nov. 2, 2007. The related application is incorporated by reference.
This application is related to the U.S. patent application Ser. No. 12/263,812 entitled “PITCH SELECTION MODULES IN A SYSTEM FOR AUTOMATIC TRANSCRIPTION OF SUNG OR HUMMED MELODIES,” by the same inventors, which is being filed simultaneously with this application.
This application is related to the U.S. patent application Ser. No. 12/263,827 entitled “VOICING DETECTION MODULES IN A SYSTEM FOR AUTOMATIC TRANSCRIPTION OF SUNG OR HUMMED MELODIES,” by the same inventors, which is being filed simultaneously with this application.
The technology disclosed relates to audio signal processing. It includes a series of modules that individually are useful to solve audio signal processing problems. Among the problems addressed are buzz removal, selecting a pitch candidate among pitch candidates based on local continuity of pitch and regional octave consistency, making small adjustments in pitch, ensuring that a selected pitch is consistent with harmonic peaks, determining whether a given frame or region of frames includes harmonic, voiced signal, extracting harmonics from voice signals, and detecting vibrato. One environment in which these modules are useful is transcribing singing or humming into a symbolic melody. Another environment that would usefully employ some of these modules is speech processing. Some of the modules, such as buzz removal, are useful in many other environments as well.
Pitch selection, given candidate pitches for a frame of sound, is widely recognized as useful. See, e.g., Kwon, Y. H., D. J. Park and B. C. Ihm. “Simplified Pitch Detection Algorithm of Mixed Speech Signals.” Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. the 2000 IEEE International Symposium on. Geneva, Switzerland, May 28-31, 2000; Marley, J. System and Method for Sound Recognition with Feature Selection Synchronized to Voice Pitch. U.S. Pat. No. 4,783,807. Nov. 8, 1988.
Classification of an audio signal into voiced, unvoiced and silence has been recognized as useful for a preliminary acoustic segmentation of speech and song. Qi, Y., and B. R. Hunt. “Voiced-Unvoiced-Silence Classifications of Speech using Hybrid Features and a Network Classifier.” Speech and Audio Processing, IEEE Transactions on 1.2 (1993): 250-5. This segmentation is useful in both music transcription and speech recognition.
Vibrato is an ornamentation of vocal, string and other harmonic sound that varies in pitch (and sometimes in volume) about a selected pitch. Sundberg, J. “Acoustic and Psychoacoustic Aspects of Vocal Vibrato.” Speech, Music and Hearing: Quarterly Progress and Status Report 35.2-3 (1994): 45-67. Nov. 1, 2008<http://www.speech.kth.se/prod/publications/files/qpsr/1994/1994—35—2-3—045-068.pdf>. Vibrato suppression is also useful for acoustic segmentation and for pitch detection. Collins, N. “Using a Pitch Detector for Onset Detection.” 6th International Conference on Music Information Retrieval. London, Sep. 11-15, 2005. In many environments, flagging of frames that contain vibrato helps in pitch selection, for instance, to distinguish between a singer's ornamentation of pitch and alternation between pitches, such as trills.
Therefore, an opportunity arises to improve on the efficiency and accuracy of pitch selection, voicing detection and vibrato suppression or flagging. Improved audio processing components should lead to improved music transcription, voice recognition and acoustic processing.
The technology disclosed relates to audio signal processing. It includes a series of modules that individually are useful to solve audio signal processing problems. Among the problems addressed are buzz removal, selecting a pitch candidate among pitch candidates based on local continuity of pitch and regional octave consistency, making small adjustments in pitch, ensuring that a selected pitch is consistent with harmonic peaks, determining whether a given frame or region of frames includes harmonic, voiced signal, extracting harmonics from voice signals, and detecting vibrato. One environment in which these modules are useful is transcribing singing or humming into a symbolic melody. Another environment that would usefully employ some of these modules is speech processing. Some of the modules, such as buzz removal, are useful in many other environments as well.
Particular aspects of the present invention are described in the claims, specification and drawings.
The following detailed description is made with reference to the figures. Preferred embodiments are described to illustrate the present invention, not to limit its scope, which is defined by the claims. Those of ordinary skill in the art will recognize a variety of equivalent variations on the description that follows.
Overview
It is often useful to have a symbolic representation or transcription of a sung or hummed melody. We presently describe several components that may be used in such a transcription system.
These inputs can, alternatively, be derived using known techniques from raw audio input data, such as a digitally sampled audio signal.
To process these data into a symbolic representation of the sung or hummed melody, the system 210 may utilize some or all of the modules depicted in
The output from these modules may include:
There are also “internal” input and outputs which are provided from one module or sub-module to another, as described below. We will also reference “user”-defined parameters in our descriptions. In this context, the user refers to the person configuring the system.
The audio processing modules described can be implemented on a variety of hardware, as depicted in
Module to Identify and Remove Buzz
The debuzzing module includes two sub-modules: first means for identifying the frequencies and magnitudes of buzz components in a file, and second means for removing buzz from a data set given these frequencies and magnitudes.
Buzz Detection and Characterization Sub-Module
The means for detecting and characterizing buzz can use as input, or generate form more basic inputs, the following:
The module seeks to find narrowband noise that occurs consistently during the quieter parts of the file. Narrow band noise tends to characterize buzzing, and detecting it will facilitate its removal, which in turn aids in music transcription, speech processing or other audio processing.
First, the module determines a noise floor by analyzing a histogram over the values of energy_raw_c. This histogram is as follows:
The module analyzes the histogram, identifying two or more highest peaks. The peak corresponding to louder data is presumed to match the voiced speech, and the next peak corresponding to quieter data is presumed to correspond to silence or background noise. A threshold is set halfway between the quieter data peak and the valley between the two peaks. Data quieter than this value in SPL is assumed to correspond to background noise.
If two peaks are not found, the bottom decile of energy_raw_c frames is assumed to contain background noise. We refer to these quiet frames as low energy frames.
Next, the module works with the peak data from the low energy frames. It will analyze these data to see if certain peak frequencies occur much more often than would be expected from truly random noise. To do so, another histogram may be employed. As above, it uses a variable number of bins, based on the number of frequencies, and also employs smoothing. An iterative histogram peak-picking process follows as described:
At this stage, the buzz detection algorithm has a set of frequency ranges in which buzz is believed to occur. These ranges correspond to the set of lower and upper bounds on frequency surrounding the histogram peaks. Now, the goal is to determine how loud the buzz was in each of these frequency zones. The system now goes back through all of the peak data in the low energy frames, and finds all of the peaks in each such frequency zone. For each zone, the median value is calculated. The median value is chosen to limit the impact of outliers and scaling (dB, squared magnitude, or other).
Now, the system has obtained estimates of the likely bandwidth and loudness of each buzz component. It will use this data in the second and final part of debuzzing: buzz removal.
Buzz Removal Sub-Module
The means for buzz removal is to remove data from the spectrogram peaks and magnitudes whose frequency and magnitude, when compared to the buzz frequency and magnitude identified by the previous buzz module, appears to be buzz. It also includes safeguards to protect against overly aggressive buzz removal. This module may be applied to any set of spectrogram peak data, including the peaks corresponding to the harmonics of one pitch candidate in each frame.
The basic rule is that in order to be eliminated as buzz, a peak must fall within the frequency zone identified as buzz. Additionally, other rules apply. A given peak's magnitude must be no louder than 3 dB above the detected median buzz magnitude for that zone. Also, when peaks within a buzz zone occur over several consecutive frames, the module may perform the following signal trajectory test. A peak may not be part of a string of peaks in the buzz zone that over time has frequencies that continue in the same direction (midi up or midi down) for 5 or more consecutive frames. More generally, a predetermined signal trajectory threshold may be applied. This was done because quiet voiced speech will often continue in a single-direction trajectory for several frames, while buzz data only very rarely does so.
When buzz is eliminated, the magnitude of a peak is set to zero, though its frequency and, optionally, magnitude data may be preserved to create a record of those peaks that were declared buzz by this module.
Module to Select or Create One Pitch Per Frame
In pitch transcription systems, it is typical to generate one or more pitch estimates per frame or time segment. Although there may be multiple estimates per frame, generally only one may be considered perceptually “correct” by a well-trained human listener. Additionally, octave errors (choosing a pitch double or half the correct frequency) are common in transcription systems, due to the typically harmonic structure of voice and musical instruments. Therefore, that one or even all of the pitch estimates for a given frame may initially be incorrect. Therefore, an algorithm that can correctly choose between several pitch candidates in a frame and correctly specify the octave of the pitch (even if that pitch or octave was not initially present among the estimates) is fundamentally useful.
The present pitch selection module 230 uses information across multiple frames to choose pitches and octaves from an initial set of frame-level estimates and supporting data. The module first groups similar “agreeing” estimates across neighboring frames into “streaks.” Once the streaks are complete, the module goes streak by streak and chooses an octave for the pitches contained.
The present module may use as input, or create from more basic inputs. The following:
The output will include a single pitch estimate per frame.
To choose a pitch in each frame, the present module will consider the score of the candidates and the continuity of each candidate with previous candidates. In addition, the current module will choose an octave for the pitch estimates.
The module begins by converting all pitch candidates to a pitch class value, though the pitch including octave information is also retained. Pitch class is an octave-independent number between 0 inclusive and 12 exclusive that represents the note name of a pitch on a Western musical scale. A different number of pitches might be found in an octave of another music scale. Zero corresponds to the pitch C, one corresponds to C sharp, and so on. Of course, zero could be chosen to correspond to a different pitch. Modulo designation of pitches is done under the assumption, generally valid in pitch detection systems, that the pitch class of the pitch candidate is more reliable than the octave information of the pitch candidate. After choosing a pitch class with confidence, we can turn to correctly choosing the octave.
Pitch Streak Creation
Considering the pitch class data, the means for pitch streak creation finds streaks of pitch candidates that are close in pitch class to one another over a number of consecutive frames. In one embodiment:
Then the we would initialize pitch streaks 1, 2, and 3:
In this case, the pitch class distances are:
Now that streaks have been formed, and because multiple streaks are likely to occur in each frame, the next tasks are to pick some streaks as higher priority than others, and to choose an octave for the streaks. To do this, we call on a piece of module input data we mentioned above: the list of preliminary scores for the pitch candidates. For the purposes of the current algorithm, we will not be concerned with the scores themselves, so much as which of the pitch candidates in each frame had the best preliminary score; we call this the “first place score.” To establish an ordered list of high priority streaks, then, the streaks are first sorted by the number of first place scores their candidates contained. Streaks with fewer than two first place scores are not considered high priority, regardless of length. Next, we consider the sum of the energy_raw_c values for the frames each streak spanned. Any streak in the bottom 10% of all values for this measure (though this value may be user-adjusted) is removed from the high priority list. By taking these two steps in assigned streak priority, we greatly enhance the reliability of frequency estimates. This is because longer, louder streaks with more high-reliability (first place) candidates tend to contain valid pitch data as opposed to pitch data generated during silence or non-pitched speech or background sounds.
At this point, we have an ordered list of high priority streaks. The module will now choose an octave for the pitch classes in each high priority streak starting with the highest priority one and proceeding until all high priority streaks are finished. When there is more than one streak present in a given frame, the higher priority streak is the one whose pitch and octave value is written to the output. Nonetheless, all values in a streak may contribute to the octave decision process described below, even though their pitch values may not be written to the output.
Octave Selection
To choose the octave, the means for octave selection employs a variable length buffer that starts at length 15 frames (which in our system represents about 750 ms), a length we term the “octave window length.” Other octave window lengths may be chosen, such as 500 ms or 1000 ms, and may be expanded indefinitely, one octave window length at a time, as long as certain conditions described below are met. Once the conditions for expanding the octave buffer are no longer met, or whenever the entire streak is less than the octave window length and thus is the entire buffer, an octave is chosen for the elements in the buffer as follows:
We now choose octaves for these pitch classes closest to the midi reference value of 66.0, or in other words, midi pitches between 60.0 and 72.0, with ties broken by choosing the closer one to 66.0. This leaves us with the following time-ordered sequence:
62.0
62.2
62.5
64.0
63.2
63.9
64.4
64.8
65.0
66.0
66.4
67.1
67.2
67.5
67.7
We note that a new candidate pitch was created for the fourth frame from the end, because the only candidate pitch for that frame was not in the chosen octave.
Recall that the buffer may be expanded to greater than a single octave window length under certain conditions. We now describe the module's approach to creating, expanding, terminating, and processing buffers.
There are two issues with buffer termination at the end of a streak. If a buffer needs to be extended but the streak does not have enough frames left to form a complete <<octave window length>> window, then the window simply extends only to the end of the streak. Second, a buffer that reaches the end of a streak without being terminated is automatically terminated when it reaches the end of a streak.
At this point, we have chosen octaves for all candidates in each streak. We have also chosen a streak to own most frames, using the high priority streaks. Now, only two tasks remain. First, we choose pitches for those frames in which no high priority streak exists. Second, when there is more than one candidate in a streak, we must only one, as we want to assign only one pitch value per frame.
The first task is simple. Because there is very little contextual information for frames that do not participate in any high priority streak, we simply take the first place candidate for such frames, including whatever octave was chosen for it.
Viterbi Algorithm for Smoothness
For the second task, the module employs the Viterbi algorithm to choose the smoothest trajectory through the streak's candidates in the first order sense. That is, the optimization criterion is that the difference between consecutive midi pitch differences be minimized. Only a single frame window and the first order difference were used, though longer windows or higher order differences could be used.
Once this has been done, the module has assigned a pitch value for each frame and thus the desired output has been obtained.
Modules to Make Small Adjustments in Pitch
In music transcription systems, the preliminary pitch estimate or estimates for each frame may be only approximate. This may occur when a quicker but less accurate algorithm is chosen to perform early estimates of pitch. Even if a rigorous algorithm chooses continuous zones of pitch or octaves, the pitch estimates themselves could be slightly off.
One factor leading to errors in pitch estimates made in an FFT-based system is FFT bias. This bias, also referred to as crosstalk, directly affects the information in the FFT most often used to compute pitch estimates. It effects the frequencies and magnitudes of the harmonic peaks.
Various approaches to minimizing this problem have various drawbacks. One idea is to use as long an FFT analysis window as possible, such as 50 ms or more. Doing so leads to narrower window transforms and therefore less interference between harmonics. However, windows longer than 50 ms may cover enough time that the pitch changes during the window. When this happens, the harmonics become spread out and interfere with one another. Using 50 ms windows is a good compromise, but if the FFT used must also produce data to be used by a speech processing system, then 50 ms is usually too long a window; 25 ms is better. A 25 ms window will unfortunately often have a significant amount of bias, as wide window transforms lead to interfering or unresolved harmonic peaks. Increasing the sampling frequency may ameliorate some issues with a 25 ms window.
Below, we describe a method to minimize the effect of FFT bias on pitch estimates by analyzing the frequencies and magnitudes of the detected harmonics and using them to adjust the estimated pitch. We also describe a method to use a possibly adjusted pitch value to ensure inclusion of all relevant harmonics.
Before making small adjustments, the audio signal processing has produced pitch estimates for each frame in the system. These pitch estimates have been chosen from pitch candidates based on their smoothness relative to each other and consistency of octave.
While these pitch estimates are directly useful for estimating sung pitches, we also use them in a less direct manner. We can use frame-level pitch data to assist indirectly in estimating whether or not voicing (or pitched data) occurs, and where syllable onsets occur. Specifically, we can use the pitch to identify those peaks from the spectrogram representation that correspond to harmonic components of the signal in each frame. For example, if a frame had a pitch estimate of 105 Hz, then we might look for peaks that have frequencies at 105, 210, 315, 420, etc. Hz. Similarly, we might work backwards if we already have harmonic peak data: if we have peaks at 110, 220, 330, and 440 Hz but a frame pitch estimate of 108 Hz, we might assume that our estimate was too low and should be 110 Hz. In either case, the goal is to ensure that we have found the harmonic peaks in the frame and that the pitch for the frame accurately represents them.
The next two modules perform these tasks. The frame pitch correction module uses existing harmonic peak estimates to refine our pitch estimate for each frame. This is only relevant for those frames in which we already have harmonic peak data. We recall that in the current system, a single pitch estimate for each frame—and supporting data—had already been generated by a previous system. Specifically, in addition to the pitch estimate, we have matrices encoding the magnitude and frequency of harmonic peaks that contributed to the estimate of frequency for each frame. (A similarly sized matrix would contain the magnitude values of the peaks.) Therefore, we may use these matrices in a module that refines pitch estimates. Once the pitch estimate is refined, we may go back and check that we included all harmonic peaks.
For other frames where the pitch has been changed significantly by the above processing, we do not yet have harmonic data. In those frames, we must start by looking for harmonic peaks. Then, we may use those peaks to correct the pitch, before once more checking for all harmonic peaks.
We see then, that these two modules could be used in indefinite iteration. However, we found that a single iteration of the form described above was sufficient.
Module to Make Small Adjustments in the Pitch Based on Frequencies of Harmonic Peaks
The means for making small adjustments or “micro-corrections” to frame pitch estimates operates on one frame at a time. It can use the following as input, or derive these from more basic inputs:
The module outputs the following:
The basic idea in this module is to use the frequencies of the first few harmonics to refine the pitch estimate in a given frame. For example, if the first four harmonics had respective frequencies of 210, 420, 630, and 840 Hz, the harmonic model indicates that the fundamental frequency is 210 Hz, even if the original pitch estimate for the frame were different, say 220 Hz. This is an idealized example, and often the first four harmonics will not agree so consistently, or even suggest that the pitch should be changed in the same direction. However, the evidence is often very strong in one direction, and disagreements sometimes come only from spurious harmonic peaks.
It is important to make small adjustments such as this to the pitch, because at higher harmonics the effect of a small pitch error can be very significant and result in failing to identify a high-frequency spectrogram peak as harmonic. For example, the tenth harmonic for a 210 Hz fundamental is 2100 Hz, but the tenth harmonic for a 220 Hz fundamental is 2200 Hz. When seeking a tenth harmonic to track (which is done in the next module), the module is unlikely to find the correct tenth harmonic when it is looking in a location 100 Hz from the correct value.
Next, we again normalize relative to the original pitch estimate by dividing the normalized difference values (−10, −10, −10, and −10) by the pitch estimate. In this example, that generates −0.0455, −0.0455, −0.0455, and −0.0455. We may write:
A value which is off by more than the original pitch will have a value above 1.0; however, in such a case it is possible the harmonic used was actually the wrong number harmonic. For example, if the fifth harmonic were 220 Hz too high relative to the estimated pitch, it is possible that it was misidentified as the fifth harmonic and should have been the sixth harmonic. The module does not attempt to reassign peaks to different harmonics, as that task is outside the scope. Instead, it deletes from consideration any harmonic whose normalized distance is greater than 0.5. For obvious reasons the module also ignores any “empty” harmonics where no peak is present in harmonic_frequencies. Also, the module uses only the second through eighth harmonics, (though other upper harmonic limits may be used), to try to limit cases where misidentification of the harmonic index could occur.
At this point, we have a buffer of normalized eligible errors of frequency represented by our harmonic_frequencies relative to our original estimated pitch. Next, the module will use this buffer to adjust the estimated pitch.
First, the module determines whether positive or negative signs are more common in the buffer of pitch_normalized_harmonic_differences, where we recall that a positive sign indicates the harmonic_frequencies are too high given the estimated pitch. Consider an example where the pitch estimate is 200 Hz, and the second through eighth harmonic_frequencies, normalized_relative_error, and pitch_normalized_harmonic_differences respectively are:
In this case, five of the seven values are negative.
Then the module checks that there are at least two values remaining in the buffer (in the example there are), that we have at least 60 percent of the total number of values before the minority values were deleted (we have 5 of 7 or 71% so we do), and that the pitch estimate is below a fixed ceiling. Let's say the pitch estimate had been 200 Hz and that the ceiling was 250 Hz, so the test is passed. (This last step is done because pitch correction above a certain value tended to do more harm than good.) If any condition is not met, no correction is performed. If all of these conditions are met, the module continues.
Next, the distance from each buffer element to the median is calculated, as is the maximum spread in the buffer. In the example, the median is −0.05, the distances are 0.01, 0.01, 0, 0.01, and 0.04, and the maximum spread is 0.05. Then, any values that are further away from the median than a fixed percentage (half the “maximum spread”) are deleted. In this case, half the maximum spread is 0.05/2 or 0.0025. Only one distance, 0.04, is greater than that, so it is discarded. This leaves us with four example values:
The logic here is that certain harmonic_frequencies values could be spurious even if their corresponding buffer values voted with the majority in choosing the direction to change the estimated pitch. At this point, we may have deleted a significant number of the values from the original buffer. Before continuing, we confirm that we still have at least 40 percent of the number of values before we deleted the minority votes. If this condition is not met, no correction is performed. In the example, we have four of the original seven values, or 57%. Therefore, we continue on with pitch correction.
If the condition is met, the module continues, using the remaining values in the buffer to inform the correction process. For this step, the normalized buffer values in Hz (the “normalized relative error”) are used. In the example, we recall that the third, fourth, sixth, and seventh harmonics' data are being used, corresponding to:
The module takes a median value of the normalized relative error. In the example, the median here is −11 Hz. If this median represents less than a fixed “maximum adjustment” when divided by the original pitch estimate in Hz, the module continues, by adding the median value (positive or negative) to the original pitch estimate in Hz. (Otherwise, as above, no pitch change is made.) In the example, the original pitch estimate is 200 Hz, so an 11 Hz downward adjustment represents a 5.5% change. A typical value for the change ceiling would be 0.75 midi, or about 4.4%. Therefore, in the example, the change would be ignored.
Before keeping a change, the module performs three tests to see if the old pitch was better than the new one. To perform these tests, first the module needs to collect some data. First, as above, ideal harmonic frequencies are calculated, this time for the new pitch. Next, as above, the new ideal harmonic frequencies are subtracted from harmonic_frequencies. We term these differences the “Hz to ideal frequencies.” These values are divided by the new pitch estimate, to obtain the “relative harmonic frequency error,” a quantity that will be used at the end of the module. Now, using these data, four quantities are obtained:
In this respect the harmonic_frequencies values are treated as ground truth with the new and old pitch estimates treated as testable quantities.
Given the data for the number of close harmonic peaks and the number of close lower harmonic peaks for both the old and new pitch estimates, the module will perform three tests to determine if the old pitch or new pitch is a better match. If any of the three tests passes, the old pitch is judged to have been a better match. The three tests are as follows:
If any of the three tests passes, the module reverts to the old (original estimated) pitch. Before concluding, this module checks to see if any of the relative harmonic frequency errors (calculated above for the new pitch, if the pitch was changed) exceed the maximum acceptable relative frequency error. For all harmonic_frequencies values that exceed this, the module forces the harmonic_frequencies and corresponding harmonic_magnitudes values to zero.
Module to Include More Harmonic Peaks
As mentioned above, the module to ensure inclusion of harmonic peaks works in tandem with the module to make small adjustments in frequency. While the frequency adjusting module takes harmonic peaks as ground truth and uses them to adjust the pitch, the current module takes the pitch as ground truth and uses it to find and confirm harmonic peaks.
The basic goal is to determine which, if any, of the spectrogram peaks for a given frame correspond to the harmonic peaks for a given frequency. To qualify, a given peak must meet three criteria:
If a peak passes all three tests, it is judged to be dominant and harmonic. It may be evident from the above criteria that a peak could be dominant (pass the second test) but not pass the other two. In this case, it is judged to be non-harmonic and dominant. Peaks which pass other combinations of tests (this includes all non-dominant peaks) are not of interest to this module, or to the system in general.
This module works on one frame at a time, and takes the following as input:
The module outputs one or more of the following:
Because of the optional harmonic peaks in the input, the module operates in a non-trivial fashion. We presently describe the method employed. It involves a large loop over each harmonic that could occur for a given pitch, starting at the fundamental (first harmonic) and continuing until either the highest chosen harmonic (chosen to be 45, though the number is user-specified) or a maximum frequency (chosen to be 5000 though the number is user-specified). For each harmonic the module does the following:
At this point, we have recorded either 0 or 1 dominant peak for each harmonic. If the peak was judged to be harmonic, its magnitude is recorded in the dominant harmonic array. If it was judged to be non-harmonic, its magnitude is recorded in the dominant non-harmonic array. In either case, the frequency was recorded in the frequency array. In this manner, at most one peak is recorded for each harmonic, be it in one magnitude array or the other. But in either case the frequency is recorded in one frequency array.
Generalization of Pitch Assignment
One should not lose sight of the general principles underlying pitch class selection, octave selection, small adjustments to pitch and finding additional harmonics. These general principles can be applied in a wide variety of embodiments, including methods, devices, devices functionally described as means plus functions and computer readable storage media that include program instructions to carry out methods or to configure computing devices to carry out methods.
One embodiment related to pitch detection is a method of selecting a pitch class for a frame. The frame appears among a sequence of frames that represent an audio signal. The method includes processing electronically a sequence of frames that include at least one pitch estimate per frame. The method includes transforming the pitch estimates for the frames into pitch class estimates that assign equal pitch classes to pitches in different octaves that have equal positions within their respective octaves. While the example given above generally refers to a Western division of an octave into 12 notes, transforming the pitch estimates can be applied to any division of an octave into notes or even to a continuous span of an octave. The method includes constructing a least one pitch class consistency streak including pitch classes selected from consecutive frames that have pitch class estimates within a predetermined pitch margin of one another. Data regarding pitch content of the frames, based on at least the pitch classes in the pitch class consistency streak, are output for further processing.
Building on pitch class selection is octave selection. A method for octave selection assigns a pitch class to an octave. This begins by defining an octave selection buffer that includes all or some of the frames in the pitch class consistency streak. It includes selecting an octave-wide band based on analysis of pitches in the frames of the octave selection buffer. Then, the estimated pitch classes of the frames are converted into pitches by casting them into the octave-wide band, producing selected pitches. One aspect of selecting pitches via octave selection is the smoother may be applied to the selected pitches.
Building on pitch selection is small pitch adjustment. A method for making small adjustments in pitch applies a correction of no more than a predetermined adjustment band, thereby allowing the selected pitch to rise or fall by just a small amount. The method includes electronically processing spectrogram data for the frame. The method determines whether the selected pitch for the frames will be adjusted within the predetermined adjustment band to increase consistency between a selected pitch and frequencies of harmonic peaks in the spectrogram data. This small pitch adjustment has wide application and need not depend on the preceding steps of pitch class selection or octave assignment.
The small pitch adjustment can be combined with using the adjusted selected pitch to search the spectrogram data for the frame to find any additional harmonic peaks in the spectrogram data that were missed in earlier processing. When additional peaks are found, all of the relevant harmonic peaks are used to repeat the small pitch adjustment described in the previous paragraph.
A series of electronic signal processing components can be described, which follow from the methods disclosed. One electronic signal processing component is for selecting a pitch in frames that represent an audio signal. It includes an input port adapted to receive a stream of data frames including at least one pitch estimate per frame. It includes a modulo conversion component coupled to the input port the assigns equal pitch classes to classes in different octaves that have equal positions within their respective octaves. A streak constructor component, coupled to receive the assigned pitch classes from the modulo conversion component, assigns the frames to one or more pitch class consistency streaks of consecutive frames that have pitch class estimates within a predetermined pitch margin of one another. An output port is coupled to the streak constructor. It outputs data regarding pitch content of the frames based on at least the pitch classes in the pitch class consistency streak.
Another electronic signal processing component for octave selection is coupled between the streak constructor and the output port. It includes an octave selection buffer that receives all or some of the frames in the pitch class consistency streak and octave assignment logic. The logic selects an octave-wide band based on analysis of pitches in the frames in the octave selection buffer. The logic assigns the estimated pitch classes in the frames in the octave selection buffer to pitches in the octave-wide band, producing selected pitches.
Generally, the electronic signal processing components disclosed herein can be implemented using a range of software, firmware and hardware. For instance, in a digital signal processor (DSP) implementation, input and output modules and ports may be built into the hardware and other components loaded as software into the digital signal processor or memory accessible to the digital signal processor. In a general purpose CPU implementation, whether using an ASIC or RISC processor, most of the processing components can be implemented as software running on the CPU and utilizing resources coupled to the CPU. In a gate array implementation, the components may be cells in the gate array, whether the gate array is program or masked. The gate array may be a programmable logic array, a field programmable gate array or another logic implementation. One use of gate array implementations is to prototype custom devices. It should be understood that gate array implementations can often be converted into a custom chip implementation, if sales volume warrants or if power or performance requirements justify the custom design.
A further electronic signal processing component that can stand alone or be used in conjunction with the pitch class selection and octave selection components. This component is a pitch adjustment processor that processes the spectrogram data for the data frames. This pitch adjustment processor includes logic to calculate a first harmonic that would be consistent with harmonic peaks in the spectrogram. The logic compares the calculated first harmonic to a selected pitch for the frame and adjusts the selected pitch to increase consistency between the selected pitch and frequencies of harmonic peaks in the spectrogram data. This adjustment is limited to changes within a predetermined adjustment band.
The adjustment processor can further include logic to search the spectrogram data for the frame to find any additional harmonic peaks in the spectrogram data that are relevant to the adjusted pitch. If additional peaks are found, the logic invokes the small pitch adjustment logic to make a further adjustment in the estimated pitch.
Components for pitch selection also can be described as means for functions, relying on the corresponding technical details given above. In one embodiment, an electronic signal processing component includes an input port adapted to receive a stream of data frames including at least one pitch estimate per frame. It includes transformation means for assigning equal pitch classes to pitches in different octaves that have equal positions within their respective octaves. The transformation means is coupled to the inport port. It further includes streak constructor means for assigning the frames to one or more pitch class consistency streaks. The pitch class consistency streaks include consecutive frames that have pitch class estimates within a predetermined pitch margin of one another. The streak constructor means is coupled to the transformation means and to an output port.
A further aspect of this electronic signal processing component combines an octave assessment means with the foregoing transformation and streak constructor means. The octave assessment means is coupled between the streak constructor means and the output port. It is a means for selecting a pitch in the frames based on analysis of the estimated pitch classes in some or all the frames in the pitch class consistency streak.
As described below, any of the methods related pitch selection may be practiced as a computer readable storage medium including program instructions for carrying out a method. In addition, we disclose that program instructions on a computer readable storage medium may be used to manufacturer hardware or hardware loaded with firmware that is adapted to carry out the methods described. For instance, the program instructions may be adapted to run on a DSP or to configure gate array, either a gate array produced in the field or a gate array programmed during manufacturing.
Module to Detect Voicing
In speech and music technology applications, it is often very useful to know whether or not a particular segment of audio contains a pitch. When performing music transcription, large errors result if a system attempts to estimate a pitch for a segment that is not voiced and locks a pitch. Conversely, if it is decided that a pitch does not exist when one in fact does, then another type of error occurs when “silence” is recorded instead of a pitch.
In the speech literature, “voicing” is a term often used to distinguish between the pitched or non-pitched status of a speech segment. “Voiced” means that the segment contains a pitch, “unvoiced” means that it does not, and “mixed voicing” or “partial voicing” means that the segment contains both pitched and unpitched sounds. By convention, voiced sounds are generally understood to be harmonic, meaning that a frequency analysis shows the presence of frequency components at integer multiples of the fundamental frequency. It should be understood that the methods described also have practical application to transcription of some musical content that may not strictly qualify as “voiced,” such as whistled data, or any pitched louder sound on a recording including monophonic musical instruments, sinusoidal non-harmonic instruments such as a Theremin, or the loudest pitched component from a polyphonic instrument, such as the loudest melodic line on a piano, even if other notes are being played. The important concept is that any pitched sound that could be transcribed as a note can be detected as voiced, while any unpitched sound that could not be so transcribed should be detected as unvoiced. If a voiced and unvoiced sound occur simultaneously, we wish to label the segment voiced.
In the singing application, where many syllables are used by the singer, voicing is especially important, because many syllables in English (and in many other languages) alternate between voiced and unvoiced sounds. If voicing is accurately detected, then syllables are also often accurately detected.
In our research, we found that three criteria generally predict whether or not voicing occurs:
As noted above, the voicing decision is used to determine whether or not a pitch was present at a given moment in time. However, in order to make this voicing decision, we also need a pitch estimate for that moment in time. This fact should not lead to confusion. Just because previous processing estimated a pitch for a given moment in time does not mean that a pitch was actually there; it should only be viewed as the best guess at a pitch if in fact one was present. Only if a segment is judged to be voiced will the pitch estimate be considered as a pitch that will ultimately be transcribed.
In one embodiment, the voicing detection module includes tests of these three features of the input. To perform its work, this function takes the following two data as input:
For output, the system produces a binary indicator of voicing for each frame, a quantity we term vu_decision. To earn a vu_decision value of one, a frame passes some or all of three tests confirming each feature that characterizes voicing, as described above. We describe each of these three tests below.
Sufficient Harmonic Energy
The sufficient harmonic energy test is that a given frame contains enough harmonic energy. In general, voiced frames contain more energy than unvoiced frames. However, certain unvoiced frames such as those containing the sounds “sh” and “f” can have high amounts of energy, though that energy is non-harmonic. Therefore, high harmonic energy is detected. In order to determine what constitutes “high,” however, the module normalizes data.
The first task, however, is to detect the amount of harmonic energy present in each frame within certain frequency limits. We call this the “low-frequency harmonic energy.” For each frame, we do the following:
Take a sum over these magnitudes, ensuring that these magnitudes are in the squared magnitude domain before the sum operation. (To perform the conversion from the dB value X to the squared magnitude Y, apply Y=10^(X/10).) In the above example, because the pitch is 400 Hz, we only consider the first through sixth harmonics, because the harmonic frequencies are 400, 800, 1200, 1600, 2000, 2400, 2800, and 3200 Hz, placing the last two harmonics outside the specified frequency range of 59 through 2500 Hz. For the three frames, we obtain approximately:
Once this is done for each frame, smooth the results versus time (versus frame) by using a smoother. We use a 13 frame (about 650 ms) Hamming window, though other smoothing windows (such as Bartlett or Kaiser) or smoothing lengths (such as 500 or 1000 ms) may be used. The final result is termed the smoothed low-frequency harmonic energy.
Next we must determine whether the amount of harmonic energy detected in this way for each frame is sufficient to suggest voicing. Therefore, we create a histogram to determine the typical values for voiced energy in the file. To do so, we first determine the total voiced energy for each frame. (In practice, we found it better to use all frequencies for this rather than limit the frequency range as done just above.) Then, we create a histogram or other frequency distribution (such as Parzen-windowed) analysis using this data, typically with the following features:
An example histogram is shown in
Once the histogram is constructed this way, we find the highest volume index (corresponding to loudest dB SPL) histogram peak that is at least 25 percent as high as the highest value in the histogram. We also require that its corresponding dB SPL value be at least 45.
In the first example plot above, we see a clear and high histogram peak at about 70 dB SPL. This peak is the farthest to the right. It is also the highest peak in the plot, meaning that it is above 25% of the maximum because it is 100% of the maximum. Therefore it qualifies.
Now consider a second example plot in
This latter aspect is required to avoid missing a peak when most of the file is unvoiced and also to avoid spuriously detecting a peak when some of the file is particularly loud. To handle the very rare case, for instance in very brief recordings, when there is no peak that meets the requirements, the dB SPL value representing the top sixty percent of data (everything above the fortieth percentile by loudness) is chosen as the “peak” value.
Next, we wish to see if the previously calculated low-frequency harmonic energy is high enough relative to the typical voiced value we calculated as the peak value. To do so, the module declares any smoothed low-frequency harmonic energy less than 17.5 dB down from the peak value to pass. In the two examples above, the peak values were 70 dB and 55 dB respectively. Therefore, the respective passing energy for identifying a frame as voiced would be 52.5 dB and 37.5 dB.
As a safeguard to make sure the peak value was not too low, there is also a test to see if this results in more than 90 percent of the frames passing. If this occurs, the 17.5 value is repeatedly decremented by one dB until either 85 percent or less of the frames pass, or until the dB value reaches 7 dB. The underlying assumption here is that all languages expected by the system will contain enough unvoiced sound (or enough silence due to breathing) that at least 10 to 15 percent of all frames should be unvoiced. Therefore, this test acts as a safeguard in case a suspiciously high number of frames pass the harmonic energy test.
Presence of Lower Harmonics
Though the harmonic energy test described just above is often a reliable indicator of pitched sounds, there are still some corner cases of sounds which can “fool” the test. Adding one or more lower harmonics tests increases the reliability of voicing detection. Generally speaking, in order to be perceived as pitched, a sound must contain either the first harmonic for a significant time duration, the second harmonic for a significant time duration, or several harmonics (generally the third to sixth) for a significant time duration. In the example here, with overlapping frames each 10 ms of 25 ms duration, a “significant” time duration is 55 ms. This duration might be adjusted to 30 milliseconds or more and satisfy the “significant time” criteria. For example, using four samples of 25 ms duration and 10 ms period, resulting in a 55 ms window, a harmonic of 30 ms centered in the 55 ms window and spanning the four frames would have a significant time duration across the four samples and would be likely to be detected. It is possible that a sound could pass the harmonic energy test above, but not meet one of these lower harmonics criteria.
The voicing module tests may include tests for some or all of the three situations. To help illustrate this, consider the following example. Again, the first through eighth harmonics are represented as rows, and time is represented as columns. In this case, a one represents that the harmonic peak was detected at all (regardless of magnitude) and a zero represents that no harmonic peak was detected.
In order to pass the test for the presence of lower harmonics in a given frame, at least one of these three tests must pass. Therefore, combining the above tests, we see that the following frames pass:
Pitch Stability
The next test is a pitch streak test that determines if the pitch is stable enough to indicate voicing. Adding this test is useful, because certain unvoiced data might appear to be voiced when using only the other two tests. In certain cases, lower harmonics and high amounts of harmonic energy may be present, but that may only occur because the pitch is rapidly changing in such a way as to capture this energy on an individual frame level. (This would occur because the original pitch detection algorithm tries to find any pitch that could place the loudest peaks in a harmonic series.) Such unstable pitch data does not represent voicing. Thus, a test to ensure frequency stability is useful.
This test takes into account that certain segments of stable, valid pitch may occur next to segments of unstable, invalid pitches. The key difficult question: if there is a large pitch transition, is it better to label the earlier or later frame as unstable, or both? In practice, unstable pitch occurs both before and after voiced, stable pitch. What we found works well is to declare pitch of a given frame to be stable if it participates in any five-frame segment whose maximum pitch difference versus time is 1.5 midi units. (This value of 1.5 is called the “midi stability” distance. Another predetermined midi stability criteria or pitch difference criteria can be chosen, as can another predetermined streak length greater or less than 50 ms.) That is, even if all frames to the left of a given frame vary wildly, the frame may be declared stable if all four frames to its right are stable, and vice versa. To help illustrate this, we provide a new example, which does not continue the examples above. Below we show the midi pitch estimate values for several frames. (We recall that to convert to midi value M from a pitch estimate H in Hz, we calculate M=12*log 2(H/440)+69.)
The following frames pass this pitch streak test:
In theory, a file that contained a string of five pitch-stable frames followed by a jump to five pitch-stable frames followed by a jump to five pitch-stable frames etc. would be labeled as entirely pitch stable. However, the occurrence of five consecutive stable frames during unvoiced data is very unlikely. The occurrence of five consecutive stable frames with the other two tests passing is extremely rare.
Although the above pitch modules and this pitch stability processing perform well, there is an occasional “frequency burp” for a single frame, generally due to an anomaly in the pitch streak matching algorithm. In such cases, there may be a jump up by a large interval and then a jump back down in the next frame. If there is a jump up and then a jump back down that brings the final pitch to within twice the midi stability distance from the first pitch, then the frequency burp is forgiven, and the pitch stability test passes. We illustrate this with a new example, where we list the midi pitch estimate for each frame, and then show whether each frame would pass. We observe that the third frame shown here is “forgiven.”
Combining Tests
At this point, the voicing module has data from three tests: sufficient harmonic energy, presence of lower harmonics, and pitch stability. In one embodiment, for a given frame to pass overall, all three tests are passed. A new example is shown here:
Post-Processing
Once the frame-level final voicing decisions have been made above, there may be some isolated frames that were in disagreement with their neighbors. Voicing may have falsely triggered for a few frames, or falsely failed to trigger for others. We have found that eliminating voicing or non-voicing streaks of certain short lengths can be beneficial in reducing such errors. Thus, the module applies a voicing streak test to eliminate any voicing streak of fewer than 5 frames (though this parameter is user selectable) by setting the vu_decision value to zero. As an example, the raw decision data and resulting post-processed data, after this step, are:
Finally, and only after the previous step, the module eliminates any non-voicing decision of fewer than 3 frames by setting the vu_decision value to one. As an example, the data input and resulting vu_decision values are:
At this point, there is a voicing decision for every frame in the file and the module has completed its task.
Generalization of Voicing Detection
Voicing detection principles can be applied to a variety of methods and devices and may be embodied in articles of manufacture. A method of voicing detection can be applied to a series of frames that represent sound including voiced and unvoiced passages. One method includes processing electronically a sequence of frames that include at least one pitch estimate per frame and one or more magnitude data per frame for harmonics of the pitch estimate. Alternatively, the pitch estimate and magnitude data can be derived from more basic input. The method further includes filtering the sequence of frames for harmonic energy that exceeds a dynamically established harmonic energy threshold. This filtering includes calculating a sum per frame that combines at least some of the magnitude data in the frame. It includes dynamically establishing the harmonic energy threshold to be used to identify frames as containing voiced content and identifying frames as containing voiced content by comparing the calculated sum per frame to the dynamically established harmonic energy threshold. The method includes outputting data regarding voice content of the sequence of frames based on at least the identification of frames by the filtering for harmonic energy.
A further aspect of this method includes, prior to calculating sums per frame, excluding from the calculation any magnitude data that have a frequency outside a predetermined frequency band. This exclusion may either be in a choice of data delivered to the method or as a component of the method. The sum per frame may be calculated using the magnitude squared from the input magnitude data.
As part of voicing detection, a smoother may be applied to the calculated sums per frame.
The dynamically established harmonic threshold may be adjusted if a rate at which the frames identified as containing voiced content exceeds a predetermined pass rate. If too many frames pass, the harmonic energy threshold may be adjusted to make it more difficult to pass. After adjusting the harmonic energy threshold, the method proceeds with reevaluating the calculated sums per frame against the adjusted dynamically established harmonic energy threshold.
A further aspect of this method may include dynamically establishing the harmonic energy threshold by estimating a frequency distribution of occurrences at particular magnitudes across magnitudes present in the magnitude data. The dynamically establishing continues with identifying from the frequency distribution a maximum peak with the highest energy of occurrence and other peaks in frequency of occurrence. It further includes identifying a threshold peak by qualifying the maximum peak and those other peaks that are at least a predetermined height in relation to the maximum peak and choosing the qualified peak of the greatest harmonic magnitude and setting the harmonic energy threshold to the magnitude of the chosen other peak. This dynamically established harmonic energy threshold is then used in detection of voicing.
Some corner cases of frames with substantial harmonic energy that do not represent voiced content can be addressed by evaluating the presence of one or two base harmonics and the presence of extended harmonics outside the base harmonic range. In one embodiment, the method further includes identifying frames belonging to streaks of successive frames with base harmonics present. The base harmonics are those within a predetermined range of the pitch estimate. For instance, the first and second harmonics. The presence of these harmonics in the frame is based on evaluating magnitude data for the harmonics. Streaks include successive frames in which the base harmonics are determined to be present. Successive frames are not considered to constitute a streak unless the base harmonics are present for a number of frames that equals or exceeds a predetermined base harmonics streak length. The method continues with outputting data regarding voiced content of the sequence of frames based on at least the identified frames that pass the filtering for harmonic energy and that belong to streaks that exceed the predetermined base harmonics streak length. This method for filtering out corner cases can stand on its own or be applied in conjunction with filtering for harmonic energy, qualified filtering for harmonic energy and/or adjusting the dynamically established harmonic energy threshold if a pass rate is exceeded.
Additional corner cases can be addressed by evaluating the presence of extended harmonics, above the one or two base harmonics. The presence of one or more base harmonics or the presence of a plurality of extended harmonics can be used to determine whether the total harmonic energy in a frame is really indicative of voiced content.
Additional corner cases can be addressed by evaluating the streaks of successive frames for pitch stability, because voiced content tends to have greater stability than random noise. For instance, a gradually rising or descending passage of singing or humming will not have the random rises and falls of background noise. This method of evaluating streaks of successive frames for pitch stability includes determining whether pitch estimates for frames in subsequences of a predetermined sequence length exhibit pitch stability among the frames within a predetermined pitch stability range retained those frames exhibiting pitch stability as having voiced content. Optionally, a single frame deviation beyond the pitch stability range may be determined to be forgivable based on pitch stability when it is a single frame in a streak that includes both frames before the single frame and frames after the single frame. The pitch stability range criteria may be relaxed so that the single frame is forgiven if the immediately following frame is within a multiple of the predetermined pitch stability range. The method further includes outputting data reflects the evaluating of pitch stability.
Determination of pitch stability can be further enhanced by filtering out of the identified frames any frames belonging to short streaks of identify frames that are not part of a streak of successive frames that equals or exceeds a predetermined identified frame streak length.
Corresponding to the methods are a variety of electronic signal processing components.
One voicing detection component includes an input port, first filtering component and an output port. The input port is adapted to receive electronically a stream of data frames including at least one pitch estimate per frame. Alternatively, the input port can receive more basic data and derive the pitch estimate from the basic data. The first filtering component calculates a sum per frame, dynamically establishes a harmonic energy threshold, and identifies frames containing voiced content by comparing the calculated sum with the threshold. The calculated sum represents a combination of at least some of the magnitude data in the frame. The harmonic energy threshold is dynamically established based on harmonic energy distribution across a plurality of identified frames. Frames are identified as containing voiced content by comparing the calculated sum per frame to the dynamically established harmonic energy threshold. The output port coupled to the filtering component is operative to output data regarding voiced content of the sequence of frames based on at least the identification of frames by the first filtering component.
In combination with the first filtering component, second filtering component, operative before the first filtering component, can be applied to exclude from the sum per frame any magnitude data that have a frequency outside a predetermined frequency band. This reduces the impact of the higher order harmonics of a fundamental frequency. A smoother component optionally may operate on the calculated sums per frame, coupled between the first filtering component and the output port.
As a further aspect, the first filter component may include logic to readjust the dynamically established harmonic energy threshold. This logic may determine if a rate at which frames are identified as containing voiced content exceeds a predetermined pass rate. It adjusts the harmonic energy threshold to reduce the rate at which frames pass and repeats the comparison of the calculated sum per frame against an adjusted dynamically established harmonic energy threshold.
In some embodiments, the first filtering component includes logic to dynamically establish the harmonic energy threshold that estimates a frequency distribution, identifies peaks in the frequency distribution, selecting one peak as the maximum peak and qualifying others as being of a predetermined relation in height to the maximum peak. It chooses the peak of the greatest harmonic magnitude from among the qualified peaks and sets the harmonic energy threshold at the magnitude of the chosen peak.
One or more streak identification subcomponents may be coupled to the first filtering component. One subcomponent identifies frames belonging to streaks of successive frames with base harmonics present. Another subcomponent that optionally may be combined with identifying streaks of successive frames with base harmonics present tests, alternatively, for the presence of extended harmonics. The logic of the streak identification subcomponents follows the methods described above. Each of the streak identification subcomponents sense data regarding voiced content of the sequence of frames to the output port data. This data is based on at least the identified frames the pass both filtering for harmonic energy and one of the streak length criteria.
Signal processing components for detecting voicing also may be described in terms of means and functions, relying on the more detailed technical descriptions given above. A further embodiment of a signal processing component may include an input port adapted to receive a stream of data frames including at least one pitch estimate per frame. It further includes first filtering means for identifying frames as containing voiced content by comparing a calculated sum per frame to a dynamically established harmonic energy threshold. It includes an output port coupled to the first filtering means that outputs data regarding voiced content of the sequence of frames based on at least the identification of frames by the first filtering means.
This means plus function component further may include base harmonic presence means and, optionally extended harmonic presence means. These harmonic means are coupled to the filtering means in the output port and, when both present, work cooperatively. These means identify frames belonging to streaks of successive frames with base or extended harmonics present, respectively. Base harmonics are those harmonics within a predetermined range of the pitch estimate. Extended harmonics are outside the predetermined range. The presence means identify streaks having harmonics present in successive frames that equal or exceed a predetermined harmonics streak length.
This component further may include a pitch stability means and/or a short streak exclusion means. Both are coupled before the output. The pitch stability means is for evaluating the streaks of successive frames for pitch stability and passing the results of the evaluating to the output port. The short streak exclusion means is for filtering out of the identified frames any frames belonging to short streaks of identified frames that are not part of a streak of successive frames that equal or exceed a predetermined identified frame streak length.
As with other methods disclosed, we disclose a computer readable storage medium including program instructions for carrying out voicing detection methods. Variations on this computer readable storage medium are as described above in the context of pitch detection and as further described below.
Module to Retain Only Voiced Harmonic Peaks
At this point, the system has data indicating which frames are judged to be voiced and which are judged to be non-voiced. It also has data indicating the magnitudes of harmonics for each frame, whether those frames are voiced or not. This module sets to zero all harmonic magnitude data that occurs in frames judged to be unvoiced.
Module to Detect Vibrato
In many music transcription systems, the ultimate goal is to obtain a musical representation that could be played on an instrument such as a piano, where there are discrete pitches. A piano has keys that represent twelve pitches per octave, while the human voice can sound any pitch continuously within an octave. This leads to a number of challenges in automatic music transcription, including how to deal with vibrato.
Vibrato is a performance technique where a singer or instrumentalist uses oscillations of pitch (and sometimes amplitude) on one or more notes, usually for musical expression. Musicians understand these oscillations to be ornamentation or stylization of what should be transcribed or notated as a single pitch. (This single pitch need not be on a Western twelve tone scale such as on a piano; the key point is that there is a single pitch.) An automatic transcription system that is unaware of vibrato, however, might transcribe these oscillations as alternations between pitches, sometimes referred to as trills. Musically, vibrato and trills are very different. Therefore, a system that confuses these two phenomena should be considered erratic. Conversely, a system that can detect vibrato accurately, and use this information to indicate single pitches where appropriate, should be considered accurate.
The goal of this module is to accurately detect vibrato. This way, single pitches containing vibrato will be transcribed as single pitches, not as alternating pitches. A second benefit comes in the area of syllable detection. Because vibrato also tends to include amplitude variation, we can prevent spurious syllable detection by knowing when vibrato occurs. Just as the final transcription will know not to trigger new pitches during a single pitch vibrato segment, it will also know not to trigger new syllables there.
Frame-Level Detector Sub-Module
The means for frame level vibrato detection uses as input or derives from more basic inputs:
It outputs: a frame-level binary indicator of whether vibrato was detected in the region beginning on a given frame.
To help illustrate the three features mentioned here,
We find that taking the FFT (fast Fourier transform) of the pitch trajectory over a local time window is a good tool for estimating all three of these quantities. Because the FFT may be thought of as a sinusoidal oscillation detector, it is a good tool to use to detect vibrato. Below, we will employ the FFT, along with a set of requirements on the data found in the FFT, to detect vibrato.
Before moving on, we make two notes. First, we deliberately avoid use of the word “frequency” when discussing vibrato, because this word has triple meaning. It could possibly refer to pulses per second, pitch depth, or pitch. Second, though there are magnitude-based cues to vibrato that often co-occur with pitch oscillations, we found pitch to be much more reliable.
Experiments Informing This Sub-Module
We conducted semi-formal psychoacoustcal experiments to determine (based on one trained listener) what combination of pulses per second and pitch depth would be perceived as vibrato. Though other researchers have claimed that vibrato occurs within fixed, independent ranges of these variables, we found that this is not a sufficient model to predict when a human listener would hear vibrato. Our experiments have shown that the ranges of pitch depth and pulses per second must be considered together to model an envelope of what users perceive as vibrato, as illustrated in
It can be observed that in general, faster vibrato (more pulses per second) allows for larger pitch depth, though this trend reverses when vibrato exceeds about 6.5 pulses per second. When an oscillation of greater pitch depth than the maximum occurs with the given pulses per second, it tends to sound like a trill or a rapidly repeated note rather than vibrato. Hence, the maximum pitch depth may be thought of as a ceiling. There is also a floor. If there is slight oscillation in pitch, but that oscillation has very little pitch depth, then the pitch will be perceived as stable and lacking in vibrato. In the sub-module, there are measures described below to prevent vibrato from triggering in such situations.
Method of Sub-Module
We now return to a discussion of the sub-module's procedures. The sub-module begins with an optional smoothing of the input pitch track 281. We used a 90 millisecond (or nine frame) Hamming window to smooth, though other smoothers (such as a Kaiser or Blackman window, using a length of 50 ms to 130 ms) may be used. It is generally well-advised to apply the smoother. The frame-level pitch data is prone to errors introduced by FFT bias, so smoothing can produce more accurate pitch estimates. These improved pitch estimates are less likely to contain outlier values that spoil the sinusoidal nature of the pitch trajectory that indicates vibrato.
The next major step in the sub-module 282 is to take an FFT of segments of the pitch trajectory. In a sense, it is like taking a spectrogram, except that the input data is the pitch trajectory rather than the raw audio data. As mentioned above, the goal here is to use the FFT to obtain estimates of the pulses per second, pitch depth, and sinusoidal nature of any potential vibrato. We may think of the FFT as a sinusoid detector where sinusoids of frequency F and peak-to-peak magnitude M show up as FFT peaks of height M at frequency index F. Therefore, a pitch trajectory with P pulses per second and pitch depth D should show up in an FFT of the pitch trajectory as peaks of height D at frequency index P. To help illustrate this, in
The sub-module considers a hopped vibrato window of length <<vibrato window>>, which we chose as 32 frames or 320 ms. That is, it takes a buffer of the input pitch track values of length <<vibrato window>> starting at frame one, then considers another such buffer starting <<vibrato hop>> frames ahead, which we chose as 2 frames or 20 ms later. There is some latitude in choosing the hop count, e.g., from 10 ms to 30 ms, or 1 to 3 frames of 10 ms spacing. The <<vibrato window>> value represents a tradeoff between (1) choosing a window long enough to catch sufficient cycles of slow vibrato and (2) choosing a window short enough to catch vibrato that occurs for only a brief amount of time or during a short note. For each vibrato window the sub-module does the following:
We note that in order for a peak to correspond to data whose pitch depth falls below the maximum, this ratio must be at least 1.0.
In typical vibrato, this quantity will be high, because there should be only one FFT peak if the oscillations are truly sinusoidal, as most vibrato is. Non-sinusoidal oscillations such as square waves or triangle waves tend not to sound like vibrato, and are also characterized by additional significant peaks. Therefore, their top_peak_ratio value will be low. This is the test of sinusoidal nature that we alluded to above.
This quantity will be useful in post-processing, so it is also stored in an array, with the swing_covered_by_peak value stored in the frame corresponding to the first frame of the vibrato window. Ideally, for cases of vibrato, a large amount of the swing is explained by (covered by) the peak. This is because the peak represents the amount of constant-pulses per second sinusoidal energy in the signal.
This gives a representation of the detected vibrato size relative to any non-vibrato pitch change in the vibrato window. For most situations in which there are oscillations too small to be perceived as vibrato, there is either no peak detected above, or the top_zero_freq_ratio value is very small. This is how we meet the goal of preventing vibrato from falsely triggering.
Now, we have data for several variables that together can indicate the presence or absence of vibrato. There are two ways to pass the vibrato detection test. The first is for all of the following conditions to be met:
the ceiling_peak_ratio is at least 1.0
the ceiling_swing_ratio is at least 1.0
the swing_covered_by_peak is at least 0.6
the top_peak_ratio is at least 1.20
the top_zero_freq_ratio is at least 1.38, and
the zero_freq_val is less than 1.60.
The second way to pass the test is for the following alternate conditions to be met:
the ceiling_peak_ratio is at least 1.0
the swing_covered_by_peak is at least 0.5
the top_peak_ratio is at least 1.20*1.6
the top_zero_freq_ratio is at least 1.38*1.6, and
the zero_freq_val is less than 1.60*1.6.
In all cases, values for these parameters are user-selectable, though the given values worked well.
Vibrato Post-Processor Sub-Module
At this point, we have three outputs from the vibrato frame detection sub-module that will be used by the vibrato post-processor sub-module. These are:
Now, the vibrato post-processor sub-module 284 will use these data to determine which frames in the audio file contain vibrato. To do so, it will complete a few tasks. Specifically, we recall that each vibrato window is some number of frames long, for example 32 frames. The sub-module must determine how many of the 32 frames in any given vibrato window should be declared vibrato. Also, it must deal with ambiguous situations where there are vibrato windows with vibrato detected near or overlapping vibrato windows without vibrato detected.
The sub-module 284 performs the following steps:
We see three vibrato islands, of lengths 8, 3, and 5.
As an example, let's consider the first island (of length 8) in the example just above, and assume we are using 32 frame vibrato analysis windows. We first look at the swing_covered_by_peak value for the island's leftmost frame (which is the second frame in the total sequence shown) and say we find this value is 0.9. In that case, we would include a vibrato indicator for 14 of the 16 frames on the left of the first vibrato window in the island, recalling that the 16 frames in question are the second through seventeenth frames in the example. In other words, only the first two frames of the vibrato window would not receive new flags. Now, we also check the swing_covered_by_peak value for the island's rightmost frame (which is the ninth frame in the total sequence shown) and say we find this value is 0.6. In that case we would include a vibrato indicator for 8 of the 16 frames on the right of the last vibrato window. We recall that this time, the 16 frames in question begin on the twenty-fifth frame of the example, because that is the seventeenth frame of the last vibrato window in the first island. At the end of processing the first island, this is how the new vibrato flags appear:
We see that there had been 9 non-vibrato frames. Say that after processing islands, the two islands were merged as follows (note the appearance of flags further to the right which reminds us of the change in indexing convention during after processing):
Now, we have overwritten a streak of nine zeros with ones. Therefore, we set the middle third to zero, noting that their location is shifted to the right by 16 frames (the center of the vibrato windows):
We see that we have prevented the merging of the islands.
At this point, we have binary flags for each frame in the file indicating whether or not vibrato has been detected in that frame.
Generalization of Vibrato Detection
The disclosed principles of vibrato detection can be applied to a variety of methods, devices and articles of manufacture. One method of vibrato detection is applied to a series of detected pitches, beginning with processing electronically a sequence of the detected pitches for frames and estimating a rate and pitch death of oscillations in the sequence. It includes comparing the estimated rate and pitch death to a predetermined vibrato detection envelope and determining whether the sequence of detected pitches would be perceived as vibrato. The method repeatedly determines vibrato perception of successive sequences of frames and repeating the processing and comparing actions. It outputs data regarding whether the successive sequences would be perceived as vibrato.
In a further aspect, the estimating of the rate in pitch death of oscillations may include applying a zero-padded FFT to sequence the detected pitches, producing an FFT output including raw rate and pitch depth data for oscillations in the sequence. From the raw rate and pitch depth data, this further aspect includes interpolating the rate and pitch depth of oscillations centered on at least one peak in the FFT output to produce the estimated rate and pitch depth. One useful interpolation is a quadratic interpolation.
A useful method that can operate on its own or in connection with the methods described above is a method of excluding from vibrato perception those sequences of frames in which a plurality of peaks in the FFT output indicate a waveform that would not be perceived as vibrato. For instance, square waves, triangle waves, and sawtooth waves are not perceived as vibrato. Variations on sinusoidal waves are perceived as vibrato.
Also useful as a method of its own or in combination with the methods above is a method of excluding from vibrato perception the sequences of frames in which DC component in the FFT output indicates a new tone that would not be perceived as vibrato. When a median magnitude is subtracted from the sequence of detected pitches before applying an FFT analysis, a pitch transition from one tone to the next appears in the FFT output as a strong DC component. A clean pitch transition, of course, is perceived as a new tone, rather than as vibrato.
Methods for detecting vibrato in frames that represent an audio signal also can be practiced in electronic signal processing components. One component embodiment includes an input port adapted to receive a stream of data frames including detected pitches. It further includes an FFT processor coupled to the input that processes sequences of data frames in the stream and estimates rate and pitch depth of oscillations in pitch. A comparison processor receives output of the FFT processor and compares those estimates of rate in pitch depth and oscillations in pitch to an envelope of combinations of rates and pitch depths of oscillations that would be perceived by listeners as vibrato. An output port is coupled to the comparison processor to output results of the comparisons.
The FFT processor and/or the comparison processor of the signal processing component can be implemented using a digital signal processor or software running on a general-purpose central processing unit (CPU). Alternatively, you it can be implemented using a gate array or a custom circuit.
In the signal processing component described above, the FFT processor may apply a zero padded FFT to the detected pitches in the sequence of frames and the comparison processor may interpolate the rate and pitch depth of oscillations centered on at least one peak in output from the FFT processor to produce the estimates of rate and pitch depth of oscillations.
A first exclusion filter may be useful by itself or in combination with the components described above. This first exclusion filter processes an FFT analysis of pitch data and senses when a plurality of peaks in the estimates indicate a non-sinusoidal waveform that would not be perceived as vibrato. Excludes the corresponding sequence of frames from being reported as containing vibrato.
A second exclusion filter also may be useful by itself or in combination with the components described above. This exclusion filter processes an FFT analysis of pitch data and senses when a DC component in the estimates indicates a new tone that would not be perceived as vibrato. A DC component is present when a normalizing component subtracts from the sequence of detected pitches a median magnitude value before the sequence is processed by the FFT processor. Then, a clean pitch transition appears in the FFT output as a DC component. The DC component indicates a new tone that would not be perceived as vibrato and the second exclusion filter excludes the corresponding sequence of frames from being reported as containing vibrato.
Electronic signal processing component also may be described in means plus function terms. One electronic signal processing embodiment includes an input port adapted to receive a stream of data frames including detected pitches, and FFT means, comparison means and an output port. The FFT means is coupled to the input port. It is for processing the sequences of data frames in the stream and for estimating rate and pitch depth of oscillations in pitch. The comparison means is for evaluating whether estimated rates and pitch depth would be perceived as vibrato, based on comparison to data representing a psychoacoustic envelope of perceived vibrato. The output port is coupled to the comparison means and reports results.
First and/or second exclusion means may be used in combination with the vibrato detection component. The first exclusion means is for detecting and excluding sequences in which the FFT output includes a plurality of peaks the indicated non-sinusoidal waveform that would not be perceived as vibrato. The second exclusion means is for detecting and excluding sequences in which a DC component of the FFT output indicates a new tone that would not be perceived as vibrato.
Once again, a computer readable storage medium can be configured with program instructions for carrying out or configuring a device to practice the methods described above.
The methods disclosed can be embodied in articles of manufacture. These articles of manufacture are computer readable storage media, such as rotating or nonvolatile memory. Rotating memory includes magnetic and optical disk drives and drums. Nonvolatile memory may be programmable or flash memory, magnetic or phase change memory or any number of variations on memory media described in the literature. The articles of manufacture also may be volatile memory manufactured by loading compiled or uncompiled computer program instructions from another place into memory. Volatile memory may be working memory or a battery backed quasi non-volatile storage medium.
Embodied in articles of manufacture, program instructions can either be loaded onto processors that carry out the methods described or combined with hardware to manufacture the hardware and software combination that practices the methods disclosed. Program instructions can be loaded onto a variety of processors that have associated memory, such as general-purpose CPUs, DSPs or programmable gate arrays. Program instructions can be used to configure devices to carry out the methods, such as programmable logic arrays our custom processor designs.
While we have disclosed technology by reference to preferred embodiments and examples detailed above, it is understood that these examples are in intended in an illustrative rather than limiting sense. Processing by computing devices is implicated in the embodiments described. Accordingly, the technology disclosed can be embodied in methods for pitch selection, voicing detection and vibrato detection, systems including logic and resources to carry out pitch selection, voicing detection and vibrato detection, systems that take advantage of pitch selection, voicing detection and vibrato detection, media impressed with program instructions to carry out pitch selection, voicing detection and vibrato detection, or computer accessible services to carry out pitch selection, voicing detection and vibrato detection. Devices that practice the technology disclosed can be manufactured by combining program instructions with hardware, for instance, by transmitting program instructions to a computing device as described or that carries out a method as described.
Number | Name | Date | Kind |
---|---|---|---|
5536902 | Serra et al. | Jul 1996 | A |
5563358 | Zimmerman | Oct 1996 | A |
5621182 | Matsumoto | Apr 1997 | A |
5973252 | Hildebrand | Oct 1999 | A |
6775650 | Lockwood et al. | Aug 2004 | B1 |
6990453 | Wang et al. | Jan 2006 | B2 |
7592533 | Lee | Sep 2009 | B1 |
7606709 | Yoshioka et al. | Oct 2009 | B2 |
7627477 | Wang et al. | Dec 2009 | B2 |
7825321 | Bloom et al. | Nov 2010 | B2 |
7945446 | Kemmochi et al. | May 2011 | B2 |
8022286 | Neubacker | Sep 2011 | B2 |
20020105359 | Shimizu et al. | Aug 2002 | A1 |
20030009336 | Kenmochi et al. | Jan 2003 | A1 |
20060150805 | Pang | Jul 2006 | A1 |
20070288232 | Kim | Dec 2007 | A1 |
Number | Date | Country |
---|---|---|
2004-102146 | Apr 2004 | JP |
10-2003-0062369 | Jul 2003 | KR |
Entry |
---|
Hee-Suk Pang and Doe-Hyun Yoon, Automatic detection of vibrato in monophonic music, Pattern Recognition, vol. 38, Issue 7, Jul. 2005, pp. 1135-1138. |
Mayor et al., “The Singing Tutor: Expression Categorization and Segmentation of the Singing Voice,” Audio Engineering Society Convention Paper, Oct. 5-8, 2006, San Francisco, CA, 8 pages. |
Nakano et al., “An Automatic Singing Skill Evaluation Method for Unknown Melodies Using Pitch Interval Accuracy and Vibrato Features,” Interspeech, 1706-1709 (2006). |
International Search Report issued in corresponding PCT Application No. PCT/US2008/082256 on Apr. 28, 2009. |
Huas, G., Pollastri, E., “An audio front end for query-by-humming systems”, published in International Symposium on Music Information Retrieval conference proceedings, http://ismir2001.indiana.edu/pdf/haus.pdf, 2001, 8 pages. |
Collins, N., “Using a Pitch Detector for Onset Detection”, http://www.cogs.susx.ac.uk/users/nc81/research/pitchdetectforonset.pdf , Proceedings of 6th Int. Confernece on Music Information Retrieval, pp. 100-106, London, UK, 2005. |
Qi, Y., Hunt, B.R., “Voiced-unvoiced-silence classifications of speech using hybrid features and a network classifier”, IEEE Transactions on Speech and Audio Processing, vol. 1, Issue 2, Apr. 1993, pp. 250-255. |
Herrera P., Bonada, J, “Vibrato Extraction and Parameterization in the Spectral Modeling Synthesis Framework”, Proceedings of the Digital Audio Effects Workshop (DAFX98), 1998, 4 pages. |
Rossignol S., Depalle P., Soumagne J., Rodet X., Collette, J. -I., “Vibrato: Detection, estimation, extraction and modification”, In Proceedings of Digital Audio Effects Workshop (DAFx), http://echo.gaps.ssr.upm.es/costg6/bibliography/proceedings/rossignol.pdf, 1999, 4 pages. |
Master, A., “Stereo Music Source Separation via Bayesian Modeling”, Jun. 2006, dissertation, 199 pages. |
Number | Date | Country | |
---|---|---|---|
20090125298 A1 | May 2009 | US |
Number | Date | Country | |
---|---|---|---|
60985181 | Nov 2007 | US |