The invention relates to tracking sound pitch across an audio signal through analysis of audio information that tracks harmonic envelope as well as pitch, and leverages a representation of harmonic envelope in vector form along with pitch to track the pitch of individual sounds.
Systems and techniques for tracking sound pitch across an audio signal are known. Known techniques implement a transform to transform the audio signal into the frequency domain (e.g., Fourier Transform, Fast Fourier Transform, Short Time Fourier Transform, and/or other transforms) for individual time sample windows, and then attempt to identify pitch within the individual time sample windows by identifying spikes in energy at harmonic frequencies. These techniques assume pitch to be static within the individual time sample windows. As such, these techniques fail to account for the dynamic nature of pitch within the individual time sample windows, and may be inaccurate, imprecise, and/or costly from a processing and/or storage perspective.
One aspect of the disclosure relates to a system and method configured to analyze audio information derived from an audio signal. The system and method may track sound pitch across the audio signal. The tracking of pitch across the audio signal may take into account change in pitch by determining at individual time sample windows in the signal duration an estimated pitch and a representation of harmonic envelope at the estimated pitch. The estimated pitch and the representation of harmonic envelope may then be implemented to determine an estimated pitch for another time sample window in the signal duration with an enhanced accuracy and/or precision.
In some implementations, a system configured to analyze audio information may include one or more processors configured to execute computer program modules. The computer program modules may include one or more of an audio information module, a processing window module, a primary window module, a pitch estimation module, an envelope vector module, an envelope correlation module, a weighting module, an estimated pitch aggregation module, a voiced section module, and/or other modules.
The audio information module may be configured to obtain audio information derived from an audio signal representing one or more sounds over a signal duration. The audio information correspond to the audio signal during a set of discrete time sample windows. The audio information may specify a magnitude of an intensity coefficient related to an intensity of the audio signal as a function and/or fractional chirp rate of frequency during the first time sample window. The audio information may specify, as a function of pitch and fractional chirp rate, a pitch likelihood metric for the individual time sample windows. The pitch likelihood metric for a given pitch and a given fractional chirp rate in a given time sample window may indicate the likelihood a sound represented by the audio signal had the given pitch and the given fractional chirp rate during the given time sample window.
The audio information module may be configured such that the audio information includes transformed audio information. The transformed audio information for a time sample window may specify magnitude of a coefficient related to signal intensity as a function of frequency for an audio signal within the time sample window. In some implementations, the transformed audio information for the time sample window may include a plurality of sets of transformed audio information. The individual sets of transformed audio information may correspond to different fractional chirp rates. Obtaining the transformed audio information may include transforming the audio signal, receiving the transformed audio information in a communications transmission, accessing stored transformed audio information, and/or other techniques for obtaining information.
The processing window module may be configured to define one or more processing time windows within the signal duration. An individual processing time window may include a plurality of time sample windows. The processing time windows may include a plurality of overlapping processing time windows that span some or all of the signal duration. For example, the processing window module may be configured to define the processing time windows by incrementing the boundaries of the processing time window over the span of the signal duration. The processing time windows may correspond to portions of the signal duration during which the audio signal represents voiced sounds.
The primary window module may be configured to identify, for a processing time window, a primary time sample window within the processing time window. This primary time sample window may become the starting point from which pitch may be tracked forward and/or backward with respect to time through the processing time window.
The pitch estimation module may be configured to determine, for the individual time sample windows in the processing time window, estimated pitch and estimated fractional chirp rate. For the primary time sample window, this may be performed by determining the estimated pitch and the estimated fractional chirp rate randomly, through an analysis of the pitch likelihood metric, by rule, by user selection, and/or based on other criteria. For other time sample windows in the processing time window, the pitch estimation module may be configured to determine estimated pitch and estimated fractional chirp rate by iterating through the processing time window from the primary time sample window and determining the estimated pitch and/or estimated fractional chirp rate for a given time sample window based on (i) the pitch likelihood metric specified by the transformed audio information for the given time sample window, and (ii) for a correlation between harmonic envelope at different pitches in the given time sample window and the harmonic envelope at an estimated pitch for a time sample window adjacent to the given time sample window.
To facilitate the determination of an estimated pitch and/or estimated fractional chirp rate for a first time sample window between the primary time sample window and a boundary of the processing time window, the envelope vector module may be configured to determine envelope vectors for sound in the first time sample window as a function of pitch and/or fractional chirp rate. The envelope vector module may be configured to determine the envelope vector for a given pitch and/or fractional chirp rate in the first time sample window based on the values for the intensity coefficient at harmonic frequencies of the given pitch in the first time sample window. For example, the coordinates of the envelope vector for the given pitch and/or fractional chirp rate may be the values for the intensity coefficient at the first n harmonic frequencies (or some other set of harmonic frequencies).
The envelope correlation module may be configured to obtain an envelope vector for a sound represented by the audio signal during a second time sample window. The envelope vector may be for an estimated pitch and/or estimated fractional chirp rate of the second time sample window. The envelope correlation module may be configured to determine, for the first time sample window, values of a correlation metric as a function of pitch from the envelope vectors determined by the envelope vector module for the first time sample window and the obtained envelope vector for the second time sample window. The value of the correlation metric for a given pitch and/or fractional chirp rate in the first time sample window may indicate a level of correlation between the obtained envelope vector for the second time sample window and the envelope vector for the given pitch and/or fractional chirp rate in the first time sample window.
The weighting module may be configured to weight the pitch likelihood metric for the first time sample window. This weighting may be based on one or more of a predicted pitch for the first time sample window, the values for the correlation metric in the first time sample window, and/or other weighting parameters.
The weighting performed by the weighting module may apply relatively larger weights to the pitch likelihood metric at pitches and/or fractional chirp rates having correlation metric values in the first time sample window that indicate relatively high correlation with the envelope vector for the second time sample window. The weighting may apply relatively smaller weights to the pitch likelihood metric at pitches and/or fractional chirp rates having correlation metric values in the first time sample window that indicate relatively low correlation with the envelope vector for the second time sample window.
Once the pitch likelihood metric for the first time sample window has been weighted, the pitch estimation module may be configured to determine an estimated pitch for the first time sample window based on the weighted pitch likelihood metric. This may include identifying the pitch and/or the fractional chirp rate for which the weighted pitch likelihood metric is a maximum in the first time sample window.
In implementations in which the processing time windows include overlapping processing time windows within at least a portion of the signal duration, a plurality of estimated pitches may be determined for the first time sample window. For example, the first time sample window may be included within two or more of the overlapping processing time windows. The paths of estimated pitch and/or estimated chirp rate through the processing time windows may be different for individual ones of the overlapping processing time windows. As a result the estimated pitch and/or chirp rate upon which the determination of estimated pitch for the first time sample window may be different within different ones of the overlapping processing time windows. This may cause the estimated pitches determined for the first time sample window to be different. The estimated pitch aggregation module may be configured to determine an aggregated estimated pitch for the first time sample window by aggregating the plurality of estimated pitches determined for the first time sample window.
The estimated pitch aggregation module may be configured such that determining an aggregated estimated pitch. The determination of a mean, a selection of a determined estimated pitch, and/or other aggregation techniques may be weighted (e.g., based on pitch likelihood metric corresponding to the estimated pitches being aggregated).
The voiced section module may be configured to categorize time sample windows into a voiced category, an unvoiced category, and/or other categories. A time sample window categorized into the voiced category may correspond to a portion of the audio signal that represents harmonic sound. A time sample window categorized into the unvoiced category may correspond to a portion of the audio signal that does not represent harmonic sound. Time sample windows categorized into the voiced category may be validated to ensure that the estimated pitches for these time sample windows are accurate. Such validation may be accomplished, for example, by confirming the presence of energy spikes at the harmonics of the estimated pitch in the transformed audio information, confirming the absence in the transformed audio information of periodic energy spikes at frequencies other than those of the harmonics of the estimated pitch, and/or through other techniques.
These and other objects, features, and characteristics of the system and/or method disclosed herein, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the invention. As used in the specification and in the claims, the singular form of “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise.
At an operation 12, audio information derived from an audio signal may be obtained. The audio signal may represent one or more sounds. The audio signal may have a signal duration. The audio information may include audio information that corresponds to the audio signal during a set of discrete time sample windows. The time sample windows may correspond to a period (or periods) of time larger than the sampling period of the audio signal. As a result, the audio information for a time sample window may be derived from and/or represent a plurality of samples in the audio signal. By way of non-limiting example, a time sample window may correspond to an amount of time that is greater than about 15 milliseconds, and/or other amounts of time. In some implementations, the time windows may correspond to about 10 milliseconds, and/or other amounts of time.
The audio information obtained at operation 12 may include transformed audio information. The transformed audio information may include a transformation of an audio signal into the frequency domain (or a pseudo-frequency domain) such as a Fourier Transform, a Fast Fourier Transform, a Short Time Fourier Transform, and/or other transforms. The transformed audio information may include a transformation of an audio signal into a frequency-chirp domain, as described, for example, in U.S. patent application Ser. No. 13/205,424, filed Aug. 8, 2011, and entitled “System And Method For Processing Sound Signals Implementing A Spectral Motion Transform” (“the '424 application”) which is hereby incorporated into this disclosure by reference in its entirety. The transformed audio information may have been transformed in discrete time sample windows over the audio signal. The time sample windows may be overlapping or non-overlapping in time. Generally, the transformed audio information may specify magnitude of an intensity coefficient related to signal intensity as a function of frequency (and/or other parameters) for an audio signal within a time sample window. In the frequency-chirp domain, the transformed audio information may specify magnitude of the coefficient related to signal intensity as a function of frequency and fractional chirp rate. Fractional chirp rate may be, for any harmonic in a sound, chirp rate divided by frequency.
By way of illustration,
Other spikes (e.g., spikes 18 and/or 20) may be present in the transformed audio information. These spikes may not be associated with harmonic sound corresponding to spikes 16. The difference between spikes 16 and spike(s) 18 and/or 20 may not be amplitude, but instead frequency, as spike(s) 18 and/or 20 may not be at a harmonic frequency of the harmonic sound. As such, these spikes 18 and/or 20, and the rest of the amplitude between spikes 16 may be a manifestation of noise in the audio signal. As used in this instance, “noise” may not refer to a single auditory noise, but instead to sound (whether or not such sound is harmonic, diffuse, white, or of some other type) other than the harmonic sound associated with spikes 16.
In some implementations, the transformed audio information may represent all of the energy present in the audio signal, or a portion of the energy present in the audio signal. For example, if the transformed on the audio signal places the audio signal into a frequency-chirp domain, the coefficient related to energy may be specified as a function of frequency and fractional chirp rate (e.g., as described in the '424 application). In such examples, the transformed audio information for a given time sample window may include a representation of the energy present in the audio signal having a common fractional chirp rate (e.g., a one-dimensional slice through the two-dimensional frequency-domain along a single fractional chirp rate).
Referring back to
By way of illustration,
Turning back to
Referring again to
In some implementations, the processing time windows may include a plurality of overlapping processing time windows. For example, for some or all of the signal duration, the overlapping processing time windows may be defined by incrementing the boundaries of the processing time windows by some increment. This increment may be an integer number of time sample windows (e.g., 1, 2, 3, and/or other integer numbers). by way of illustration,
Turning back to
At an operation 48, an estimated pitch for the primary time sample window may be determined. In some implementations, the estimated pitch may be selected randomly, based on an analysis of pitch likelihood within the primary time sample window, by rule or parameter, based on user selection, and/or based on other criteria. As was mentioned above, the audio information may indicate, for a given time sample window, the pitch likelihood metric as a function of pitch. As such, the estimated pitch for the primary time sample window may be determined as the pitch for exhibiting a maximum for pitch likelihood metric for the primary time sample window.
As was mentioned above, in the audio information the pitch likelihood metric may further be specified as a function of fractional chirp rate. As such, the pitch likelihood metric may indicate chirp likelihood as a function of the pitch likelihood metric and pitch. At operation 48, in addition to the estimated pitch, an estimated fractional chirp rate for the primary time sample window may be determined. The estimated fractional chirp rate may be determined as the chirp rate corresponding to a maximum for the pitch likelihood metric on the estimated pitch.
At operation 48, an envelope vector for the estimated pitch of the primary time sample window may be determined. As is described herein, the envelope vector for the predicted pitch of the primary time sample window may represent the harmonic envelope of sound represented in the audio signal at the primary time sample window having the predicted pitch.
At an operation 50, a predicted pitch for a next time sample window in the processing time window may be determined. This time sample window may include, for example, a time sample window that is adjacent to the time sample window having the estimated pitch and estimated fractional chirp rate determined at operation 48. The description of this time sample window as “next” is not intended to limit the this time sample window to an adjacent or consecutive time sample window (although this may be the case). Further, the use of the word “next” does not mean that the next time sample window comes temporally in the audio signal after the time sample window for which the estimated pitch and estimated fractional chirp rate have been determined. For example, the next time sample window may occur in the audio signal before the time sample window for which the estimated pitch and the estimated fractional chirp rate have been determined.
Determining the predicted pitch for the next time sample window may include, for example, incrementing the pitch from the estimated pitch determined at operation 48 by an amount that corresponds to the estimated fractional chirp rate determined at operation 48 and a time difference between the time sample window being addressed at operation 48 and the next time sample window. For example, this determination of a predicted pitch may be expressed mathematically for some implementations as:
where φ0 represents the estimated pitch determined at operation 48, φ1 represents the predicted pitch for the next time sample window, Δt represents the time difference between the time sample window from operation 48 and the next time sample window, and
represents an estimated fractional chirp rate of the fundamental frequency of the pitch (which can be determined from the estimated fractional chirp rate).
At an operation 51, an envelope vector may be determined for the next time sample window as a function of pitch within the next time sample window. The envelope vector for the next time sample at a given pitch may represent the harmonic envelope of sound represented in the audio signal during the next time sample window having the given pitch. Determination of the coordinates for the envelope vector for the given pitch may be based on the values for the intensity coefficient at harmonic frequencies of the given pitch in the next time sample window. In implementations in which the transformed audio information includes, for the next time sample window, different sets of transformed audio information corresponding to different fractional chirp rates, operation 51 may include determining the envelope vectors for the next time sample window as a function both of pitch and fractional chirp rate.
By way of illustration, turning back to
Referring back to
By way of illustration,
Determination of values of a correlation metric for the second time sample window may include determining values of a metric that indicates correlation between the envelope vectors 118, 120, and 122 for the individual pitches in the second time sample window with the envelope vector 114 for the estimated pitch of the first time sample window. Such a correlation metric may include one or more of, for example, a distance metric, a dot product, a correlation coefficient, and/or other metrics that indicate correlation.
In the example provided in
It will be appreciated that the illustration of the envelope vectors in
Turning back to
In implementations in which the weighting performed at operation 53 is based on the predicted pitch determined at operation 50, the weighting may apply relatively larger weights to the pitch likelihood metric for pitches in the next time sample window at or near the predicted pitch and relatively smaller weights to the pitch likelihood metric for pitches in the next time sample window that are further away from the predicted pitch. For example, this weighting may include multiplying the pitch likelihood metric by a weighting function that varies as a function of pitch and may be centered on the predicted pitch. The width, the shape, and/or other parameters of the weighting function may be determined based on user selection (e.g., through settings and/or entry or selection), fixed, based on noise present in the audio signal, based on the range of fractional chirp rates in the sample, and/or other factors. As a non-limiting example, the weighting function may be a Gaussian function.
In implementations in which the weighting performed at operation 53 is based on the correlation metric determined at operation 52, relatively larger weights may be applied to the pitch likelihood metric at pitches having values of the correlation metric that indicate relatively high correlation with the envelope vector for the estimated pitch in the other time sample window. The weighting may apply relatively smaller weights to the pitch likelihood metric at pitches having correlation metric values in the next time sample window that indicate relatively low correlation with the envelope vector for the estimated pitch in the other time sample window.
At an operation 54, an estimated pitch for the next time sample window may be determined based on the weighted pitch likelihood metric for the next sample window. Determination of the estimated pitch for the next time sample window may include, for example, identifying a maximum in the weighted pitch likelihood metric and determining the pitch corresponding to this maximum as the estimated pitch for the next time sample window.
At operation 54, an estimated fractional chirp rate for the next time sample window may be determined. The estimated fractional chirp rate may be determined, for example, by identifying the fractional chirp rate for which the weighted pitch likelihood metric has a maximum along the estimated pitch for the time sample window.
At operation 56, a determination may be made as to whether there are further time sample windows in the processing time window for which an estimated pitch and/or an estimated fractional chirp rate are to be determined. Responsive to there being further time sample windows, method 10 may return to operations 50 and 51, and operations 50, 51, 52, 53, and/or 54 may be performed for a further time sample window. In this iteration through operations 50, 51, 52, 53, and/or 54, the further time sample window may be a time sample window that is adjacent to the next time sample window for which operations 50, 51, 52, 53, and/or 54 have just been performed. In such implementations, operations 50, 51, 52, 53, and/or 54 may be iterated over the time sample windows from the primary time sample window to the boundaries of the processing time window in one or both temporal directions. During the iteration(s) toward the boundaries of the processing time window, the estimated pitch and estimated fractional chirp rate implemented at operation 50 may be the estimated pitch and estimated fractional chirp rate determined at operation 48, or may be an estimated pitch and estimated fractional chirp rate determined at operation 50 for a time sample window adjacent to the time sample window for which operations 50, 51, 52, 53, and/or 54 are being iterated.
Responsive to a determination at operation 56 that there are no further time sample windows within the processing time window, method 10 may proceed to an operation 58. At operation 58, a determination may be made as to whether there are further processing time windows to be processed. Responsive to a determination at operation 58 that there are further processing time windows to be processed, method 10 may return to operation 47, and may iterate over operations 47, 48, 50, 51, 52, 53, 54, and/or 56 for a further processing time window. It will be appreciate that iterating over the processing time windows may be accomplished in the manner shown in
Responsive to a determination at operation 58 that there are no further processing time windows to be processed, method 10 may proceed to an operation 60. Operation 60 may be performed in implementations in which the processing time windows overlap. In such implementations, iteration of operations 47, 48, 50, 51, 52, 53, 54, and/or 56 for the processing time windows may result in multiple determinations of estimated pitch for at least some of the time sample windows. For time sample windows for which multiple determinations of estimated pitch have been made, operation 60 may include aggregating such determinations for the individual time sample windows to determine aggregated estimated pitch for individual the time sample windows.
By way of non-limiting example, determining an aggregated estimated pitch for a given time sample window may include determining a mean estimated pitch, determining a median estimated pitch, selecting an estimated pitch that was determined most often for the time sample window, and/or other aggregation techniques. At operation 60, the determination of a mean, a selection of a determined estimated pitch, and/or other aggregation techniques may be weighted. For example, the individually determined estimated pitches for the given time sample window may be weighted according to their corresponding pitch likelihood metrics. These pitch likelihood metrics may include the pitch likelihood metrics specified in the audio information obtained at operation 12, the weighted pitch likelihood metric determined for the given time sample window at operation 53, and/or other pitch likelihood metrics for the time sample window.
At an operation 62, individual time sample windows may be divided into voiced and unvoiced categories. The voiced time sample windows may be time sample windows during which the sounds represented in the audio signal are harmonic or “voiced” (e.g., spoken vowel sounds). The unvoiced time sample windows may be time sample windows during which the sounds represented in the audio signal are not harmonic or “unvoiced” (e.g., spoken consonant sounds).
In some implementations, operation 62 may be determined based on a harmonic energy ratio. The harmonic energy ratio for a given time sample window may be determined based on the transformed audio information for given time sample window. The harmonic energy ratio may be determined as the ratio of the sum of the magnitudes of the coefficient related to energy at the harmonics of the estimated pitch (or aggregated estimated pitch) in the time sample window to the sum of the magnitudes of the coefficient related to energy at the harmonics across the spectrum for the time sample window. The transformed audio information implemented in this determination may be specific to an estimated fractional chirp rate (or aggregated estimated fractional chirp rate) for the time sample window (e.g., a slice through the frequency-chirp domain along a common fractional chirp rate). The transformed audio information implemented in this determination may not be specific to a particular fractional chirp rate.
For a given time sample window if the harmonic energy ratio is above some threshold value, a determination may be made that the audio signal during the time sample window represents voiced sound. If, on the other hand, for the given time sample window the harmonic energy ratio is below the threshold value, a determination may be made that the audio signal during the time sample window represents unvoiced sound. The threshold value may be determined, for example, based on user selection (e.g., through settings and/or entry or selection), fixed, based on noise present in the audio signal, based on the fraction of time the harmonic source tends to be active (e.g. speech has pauses), and/or other factors.
In some implementations, operation 62 may be determined based on the pitch likelihood metric for estimated pitch (or aggregated estimated pitch). For example, for a given time sample window if the pitch likelihood metric is above some threshold value, a determination may be made that the audio signal during the time sample window represents voiced sound. If, on the other hand, for the given time sample window the pitch likelihood metric is below the threshold value, a determination may be made that the audio signal during the time sample window represents unvoiced sound. The threshold value may be determined, for example, based on user selection (e.g., through settings and/or entry or selection), fixed, based on noise present in the audio signal, based on the fraction of time the harmonic source tends to be active (e.g. speech has pauses), and/or other factors.
Responsive to a determination at operation 62 that the audio signal during a time sample window represents unvoiced sound, the estimated pitch (or aggregated estimated pitch) for the time sample window may be set to some predetermined value at an operation 64. For example, this value may be set to 0, or some other value. This may cause the tracking of pitch accomplished by method 10 to designate that harmonic speech may not be present or prominent in the time sample window.
Responsive to a determination at operation 62, that the audio signal during a time sample window represents voiced sound, method 10 may proceed to an operation 68.
At operation 68, a determination may be made as to whether further time sample windows should be processed by operations 62 and/or 64. Responsive to a determination that further time sample windows should be processed, method 10 may return to operation 62 for a further time sample window. Responsive to a determination that there are no further time sample windows for processing, method 10 may end.
It will be appreciated that the description above of estimating an individual pitch for the time sample windows is not intended to be limiting. In some implementations, the portion of the audio signal corresponding to one or more time sample window may represent two or more harmonic sounds. In such implementations, the principles of pitch tracking above with respect to an individual pitch may be implemented to track a plurality of pitches for simultaneous harmonic sounds without departing from the scope of this disclosure. For example, if the audio information specifies the pitch likelihood metric as a function of pitch and fractional chirp rate, then maxima for different pitches and different fractional chirp rates may indicate the presence of a plurality of harmonic sounds in the audio signal. These pitches may be tracked separately in accordance with the techniques described herein.
The operations of method 10 presented herein are intended to be illustrative. In some embodiments, method 10 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of method 10 are illustrated in
In some embodiments, method 10 may be implemented in one or more processing devices (e.g., a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information). The one or more processing devices may include one or more devices executing some or all of the operations of method 10 in response to instructions stored electronically on an electronic storage medium. The one or more processing devices may include one or more devices configured through hardware, firmware, and/or software to be specifically designed for execution of one or more of the operations of method 10.
The processor 82 may be configured to execute one or more computer program modules. The computer program modules may be configured to execute the computer program module(s) by software; hardware; firmware; some combination of software, hardware, and/or firmware; and/or other mechanisms for configuring processing capabilities on processor 82. In some implementations, the one or more computer program modules may include one or more of an audio information module 84, a processing window module 86, a peak likelihood module 88, a pitch estimation module 90, a pitch prediction module 92, an envelope vector module 93, an envelope correlation module 94, a weighting module 95, an estimated pitch aggregation module 96, a voice section module 98, and/or other modules.
The audio information module 84 may be configured to obtain audio information derived from an audio signal. Obtaining the audio information may include deriving audio information, receiving a transmission of audio information, accessing stored audio information, and/or other techniques for obtaining information. The audio information may be divided in to time sample windows. In some implementations, audio information module 84 may be configured to perform some or all of the functionality associated herein with operation 12 of method 10 (shown in
The processing window module 86 may be configured to define processing time windows across the signal duration of the audio signal. The processing time windows may be overlapping or non-overlapping. An individual processing time windows may span a plurality of time sample windows. In some implementations, processing window module 86 may perform some or all of the functionality associated herein with operation 30 of method 10 (shown in
The primary window module 88 may be configured to identify a primary time sample window. In some implementations, primary window module 88 may be configured to perform some or all of the functionality associated herein with operation 47 of method 10 (shown in
The pitch estimation module 90 may be configured to determine an estimated pitch and/or an estimated fractional chirp rate for the primary time sample window. In some implementations, pitch estimation module 90 may be configured to perform some or all of the functionality associated herein with operation 48 in method 10 (shown in
The pitch prediction module 92 may be configured to determine a predicted pitch for a first time sample window within the same processing time window as a second time sample window for which an estimated pitch and an estimated fractional chirp rate have previously been determined. The first and second time sample windows may be adjacent. Determination of the predicted pitch for the first time sample window may be made based on the estimated pitch and the estimated fractional chirp rate for the second time sample window. In some implementations, pitch prediction module 92 may be configured to perform some or all of the functionality associated herein with operation 50 of method 10 (shown in
The envelope vector module 93 may be configured to determine, as a function of pitch in the first time sample window, an envelope vector having coordinates. The envelope vector module 93 may be configured to determine the envelope vector for a given pitch in the first time sample window based on the values for the intensity coefficient at harmonic frequencies of the given pitch in the first time sample window. In some implementations, envelope vector module 93 may be configured to perform some or all of the functionality associated herein with operation 51 of method 10 (shown in
The envelope correlation module 94 may be configured to obtain an envelope vector for a sound represented by the audio signal during the second time sample window (e.g., as previously determined by envelope vector module 93). The envelope correlation module 94 may be configured to determine, for the first time sample window, values of a correlation metric as a function of pitch, wherein the value of the correlation metric for a given pitch in the first time sample window may indicate a level of correlation between the envelope vector for the second time sample window and the envelope vector for the given pitch in the first time sample window. In some implementations, envelope correlation module 94 may be configured to perform some or all of the functionality associated herein with operation 52 (shown in
The weighting module 95 may be configured determine to the pitch likelihood metric for the first time sample window based on the predicted pitch determined for the first time sample window. This weighting may be based on one or more of the predicted pitch determined by pitch prediction module 92, the values of the correlation metric determined by envelope correlation module 94, and/or other weighting parameters.
The weighting module 95 may be configured to weight the pitch likelihood metric for the first time sample window such that relatively larger weights may be applied to the pitch likelihood metric at pitches having correlation metric values in the first time sample window that indicate relatively high correlation with the envelope vector for the estimated pitch in the second time sample window. The weighting module 95 may be configured to weight the pitch likelihood metric for the first time sample window such that relatively smaller weights may be applied to the pitch likelihood metric at pitches having correlation metric values in the first time sample window that indicate relatively low correlation with the envelope vector for the estimated pitch in the second time sample window. In some implementations, weighting module 95 may be configured to perform some or all of the functionality associated herein with operation 53 in method 10 (shown in
The pitch estimation module 90 may be further configured to determine an estimated pitch and/or an estimated fractional chirp rate for the first time sample window based on the weighted pitch likelihood metric for the first time sample window. This may include identifying a maximum in the weighted pitch likelihood metric for the first time sample window. The estimated pitch and/or estimated fractional chirp rate for the first time sample window may be determined as the pitch and/or fractional chirp rate corresponding to the maximum weighted pitch likelihood metric for the first time sample window. In some implementations, pitch estimation module 90 may be configured to perform some or all of the functionality associated herein with operation 54 in method 10 (shown in
As, for example, described herein with respect to operations 47, 48, 50, 51, 52, 53, 54, and/or 56 in method 10 (shown in
The estimated pitch aggregation module 96 may be configured to aggregate a plurality of estimated pitches determined for an individual time sample window. The plurality of estimated pitches may have been determined for the time sample window during analysis of a plurality of processing time windows that included the time sample window. Operation of estimated pitch aggregation module 96 may be applied to a plurality of time sample windows individually across the signal duration. In some implementations, estimated pitch aggregation module 96 may be configured to perform some or all of the functionality associated herein with operation 60 in method 10 (shown in
Processor 82 may be configured to provide information processing capabilities in system 80. As such, processor 82 may include one or more of a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information. Although processor 82 is shown in
It should be appreciated that although modules 84, 86, 88, 90, 92, 93, 94, 95, 96, and 98 are illustrated in
Electronic storage 102 may comprise electronic storage media that stores information. The electronic storage media of electronic storage 102 may include one or both of system storage that is provided integrally (i.e., substantially non-removable) with system 102 and/or removable storage that is removably connectable to system 80 via, for example, a port (e.g., a USB port, a firewire port, etc.) or a drive (e.g., a disk drive, etc.). Electronic storage 102 may include one or more of optically readable storage media (e.g., optical disks, etc.), magnetically readable storage media (e.g., magnetic tape, magnetic hard drive, floppy drive, etc.), electrical charge-based storage media (e.g., EEPROM, RAM, etc.), solid-state storage media (e.g., flash drive, etc.), and/or other electronically readable storage media. Electronic storage 102 may include virtual storage resources, such as storage resources provided via a cloud and/or a virtual private network. Electronic storage 102 may store software algorithms, information determined by processor 82, information received via user interface 104, and/or other information that enables system 80 to function properly. Electronic storage 102 may be a separate component within system 80, or electronic storage 102 may be provided integrally with one or more other components of system 80 (e.g., processor 82).
User interface 104 may be configured to provide an interface between system 80 and users. This may enable data, results, and/or instructions and any other communicable items, collectively referred to as “information,” to be communicated between the users and system 80. Examples of interface devices suitable for inclusion in user interface 104 include a keypad, buttons, switches, a keyboard, knobs, levers, a display screen, a touch screen, speakers, a microphone, an indicator light, an audible alarm, and a printer. It is to be understood that other communication techniques, either hard-wired or wireless, are also contemplated by the present invention as user interface 104. For example, the present invention contemplates that user interface 104 may be integrated with a removable storage interface provided by electronic storage 102. In this example, information may be loaded into system 80 from removable storage (e.g., a smart card, a flash drive, a removable disk, etc.) that enables the user(s) to customize the implementation of system 80. Other exemplary input devices and techniques adapted for use with system 80 as user interface 104 include, but are not limited to, an RS-232 port, RF link, an IR link, modem (telephone, cable or other). In short, any technique for communicating information with system 80 is contemplated by the present invention as user interface 104.
Although the system(s) and/or method(s) of this disclosure have been described in detail for the purpose of illustration based on what is currently considered to be the most practical and preferred implementations, it is to be understood that such detail is solely for that purpose and that the disclosure is not limited to the disclosed implementations, but, on the contrary, is intended to cover modifications and equivalent arrangements that are within the spirit and scope of the appended claims. For example, it is to be understood that the present disclosure contemplates that, to the extent possible, one or more features of any implementation can be combined with one or more features of any other implementation.
Number | Name | Date | Kind |
---|---|---|---|
3617636 | Ogihara | Nov 1971 | A |
5054072 | McAulay et al. | Oct 1991 | A |
5195166 | Hardwick et al. | Mar 1993 | A |
5216747 | Hardwick et al. | Jun 1993 | A |
5226108 | Hardwick et al. | Jul 1993 | A |
5321636 | Beerends | Jun 1994 | A |
5548680 | Cellario | Aug 1996 | A |
5684920 | Iwakami et al. | Nov 1997 | A |
6356868 | Yuschik et al. | Mar 2002 | B1 |
6477472 | Qian et al. | Nov 2002 | B2 |
6526376 | Villette et al. | Feb 2003 | B1 |
7660718 | Padhi et al. | Feb 2010 | B2 |
7672836 | Lee et al. | Mar 2010 | B2 |
8548803 | Bradley et al. | Oct 2013 | B2 |
20020152078 | Yuschik et al. | Oct 2002 | A1 |
20040133424 | Ealey et al. | Jul 2004 | A1 |
20050149321 | Kabi et al. | Jul 2005 | A1 |
20060080088 | Lee et al. | Apr 2006 | A1 |
20070299658 | Wang et al. | Dec 2007 | A1 |
20090091441 | Schweitzer, III et al. | Apr 2009 | A1 |
20100042407 | Crockett | Feb 2010 | A1 |
20100262420 | Herre et al. | Oct 2010 | A1 |
20120243694 | Bradley et al. | Sep 2012 | A1 |
20130041489 | Bradley et al. | Feb 2013 | A1 |
20130041656 | Bradley et al. | Feb 2013 | A1 |
20130041658 | Bradley et al. | Feb 2013 | A1 |
Number | Date | Country |
---|---|---|
WO 2013022914 | Feb 2013 | WO |
WO 2013022918 | Feb 2013 | WO |
WO 2013022923 | Feb 2013 | WO |
WO 2013022930 | Feb 2013 | WO |
Entry |
---|
Rabiner, L. “On the use of autocorrelation analysis for pitch detection,” Acoustics, Speech and Signal Processing, IEEE Transactions on, Feb. 1977, vol. 25, Issue 1, p. 24-33. |
Lahat, M.; Niederjoh, R.; Krubsack, D. “A spectral autocorrelation method for measurement of the fundamental frequency of noise-corruped speech,” Acoustics, Speech and Signal Processing, IEEE Transactions on, Jun. 1987, vol. 35, Issue 6, p. 741-750. |
Kepesi, M; Weruaga, L. “Adaptive chirp-based time-frequency analysis of speech signals,” Speech Communication, vol. 48, No. 5, pp. 474-492, May 2006. |
Robel, A.; Rodet, X. “Efficient spectral envelope estimation and its application to pitch shifting and envelope preservation,” Proc. of the 8th Int. Conference on Digital Audio Effects (DAFx '05), Madrid, Spain, Sep. 20-22, 2005. |
M. Kepesi and L. Weruaga, “High-Resolution Noise-Robust Spectral-based Pitch Estimation”, 2005. |
G. Hu and D. Wang, “Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation”, IEEE Transactions on Neural Networds, vol. 15, No. 5, Sep. 2004. |
S. Roa et al., “Fundamental Frequency Estimation Based on Pitch-Scaled Harmonic Filtering”, 2007. |
Xia, Xiang-Gen, “Discrete Chirp-Fourier Transform and Its Application to Chirp Rate Estimation”, IEEE Transactions on Signal Processing, vol. 48, No. 11, Nov. 2000, pp. 3122-3133. |
Boashash, Boualem, “Time-Frequency Signal Analysis and Processing: A Comprehensive Reference”, [online], Dec. 2003, retrieved on Sep. 26, 2012 from http://gspace.gu.edu.ga/bitstream/handle/10576/10686/Boashash%20book-part1—tfsap—concepts.pdf?seq., 103 pages. |
Yin et al., “Pitch- and Formant-Based Order Adaptation of the Fractional Fourier Transform and Its Application to Speech Recognition”, EURASIP Journal of Audio, Speech, and Music Processing, vol. 2009, Article ID 304579, [online], Dec. 2009, Retrieved on Sep. 26, 2012 from http://downloads.hindawi.com/journals/asmp/2009/304579.pdf, 14 pages. |
Weruaga, Luis, et al., “Speech Analysis with the Fast Chirp Transform”, Eusipco, www.eurasip.org/Proceedings/Eusipco/Eusipco2004/.../cr1374.pdf, 2004, 4 pages. |
Kepesi, Marian, et al., “Adaptive Chirp-Based Time-Frequency Analysis of Speech Signals”, Speech Communication, vol. 48, No. 5, 2006, pp. 474-492. |
Ioana, Cornel, et al., “The Adaptive Time-Frequency Distribution Using the Fractional Fourier Transform”, 18° Colloque sur le traitement du signal et des images, 2001, pp. 52-55. |
Abatzoglou, Theagenis J., “Fast Maximum Likelihood Joint Estimation of Frequency and Frequency Rate”, IEEE Transactions on Aerospace and Electronic Systems, vol. AES-22, Issue 6, Nov. 1986, pp. 708-715. |
Badeau et al., “Expectation-Maximization Algorithm for Multi-Pitch Estimation and Separation of Overlapping Harmonic Spectra”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2009, 4 pages. |
Camacho et al., “A Sawtooth Waveform Inspired Pitch Estimator for Speech and Music”, Journal of the Acoustical Society of America, vol. 124, No. 3, Sep. 2008, pp. 1638-1652. |
Number | Date | Country | |
---|---|---|---|
20130041657 A1 | Feb 2013 | US |