The disclosure relates to processing an audio signal by leveraging a transform the places the audio signal into a frequency-chirp domain that specifies a coefficient related to signal intensity as a function of frequency and fractional chirp rate.
Generally, conventional sound processing involves converting an audio signal from the time domain into the frequency domain for individual time windows. Various types of signal processing techniques and algorithms may then be performed on the signal in the frequency domain in an attempt to distinguish between sound and noise represented in the signal before further processing can be performed. This processed signal may then be analyzed to determine sound parameters such as pitch, envelope, and/or other sound parameters. Sounds represented in the signal may be classified.
Conventional attempts to distinguish between harmonic sound and noise (whether sonic noise represented in the signal or signal noise) may amount to attempts to “clean” the signal to distinguish between harmonic sounds and background noise. Unfortunately, often times these conventional techniques result in a loss of information about harmonic sounds represented in the signal, as well as noise. The loss of this information may impact the accuracy and/or precision of downstream processing to, for example, determine sound parameter(s) of harmonic sound, classify harmonic sounds, and/or other downstream processing.
One aspect of the disclosure relates to a system and method configured to process an audio signal. The system and method may track pitch, chirp rate, and/or harmonic envelope across the audio signal, may reconstruct sound represented in the audio signal, and/or may segment or classify the audio signal. A transform may be performed on the audio signal to place the audio signal in a frequency chirp domain that enhances the sound parameter tracking, reconstruction, and/or classification.
The system may include one or more processors configured to execute computer program modules. The computer program modules may include one or more of a first set, a second set, a third set, a fourth set, and/or a fifth set of computer program modules.
The first set of computer program modules may be configured to transform individual time sample windows of the audio signal into the frequency-chirp domain. The frequency-chirp domain representation of a given time sample window may specify a transform coefficient as a function of frequency and fractional chirp rate for the signal portion. The first set of computer program modules may include one or more of a signal module, a time sample window module, a transform module, and/or other modules.
The second set of computer program modules may be configured to determine, from the transformed audio information for the given time sample window, a pitch likelihood metric as a function of pitch and fractional chirp rate for the audio signal within the time sample window. The second set of computer program modules may be configured (i) to determine, from the transformed audio information for the given time sample window, a tone likelihood metric as a function of frequency for the audio signal within the time sample window, and (ii) to determine the pitch likelihood metric for the given time sample window from the tone likelihood metric. The second set of computer program modules may include one or more of a tone likelihood module, a pitch likelihood module, and/or other modules.
The third set of computer program modules may be configured to determine, based on the pitch likelihood metric, estimated pitch and estimated fractional chirp rate for the individual time sample windows. The third set of computer program modules may be configured (i) to divide the time sample windows into groups within processing time windows, (ii) to identify a primary time sample window within a given processing time window, and (iii) to determine pitch for time sample windows within the given processing time window by iterating through the processing time window from the primary time sample window toward one or both of the boundaries of the processing time window and determining the estimated pitch and estimated fractional chirp rate for a given time sample window from values of pitch likelihood metric weighted based on parameters of sound determined for a time sample window adjacent to the given time sample window. The third set of one or more computer program modules may be configured such that the parameters of sound within the adjacent time sample window used to weight pitch likelihood metric in the given time sample window include one or more of estimated pitch, estimated chirp, or harmonic envelope. The third set of computer program modules may include one or more of a processing time window module, a primary window module, a pitch estimation module, a pitch prediction module, an envelope vector module, an envelope correlation module, a weighting module, an estimated pitch aggregation module, a voiced section module, and/or other modules.
The fourth set of computer program modules may be configured to reconstruct sound represented the audio signal based on one or both of the transformed audio information generated by the first set of computer program modules and/or the estimated pitch and estimated chirp rate determined by the third set of computer program modules. The fourth set of computer program modules may include one or more of a resynthesis module, a noise subtraction module, a fence model module, a reconstruction module, and/or other modules.
The fifth set of computer program modules may be configured to classify one or more speakers that voiced one or more sounds represented in the audio signal based on one or both of the transformed audio information generated by the first set of computer program modules and/or the estimated pitch and estimated chirp rate determined by the third set of computer program modules. The fifth set of computer program module may include one or more of an audio feature module, an information reduction module, a spectral slope module, a noise estimator module, an aggregate score module, a training module, and/or other modules.
These and other objects, features, and characteristics of the system and/or method disclosed herein, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the invention. As used in the specification and in the claims, the singular form of “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise.
The processor 12 may be configured to execute one or more computer program modules. The computer program modules may include one or more of a signal module 18, a time sample window module 20, a transform module 22, a tone likelihood module 24, a pitch likelihood module 26, a processing window module 28, a peak likelihood module 30, a pitch estimation module 32, a pitch prediction module 34, an envelope vector module 36, an envelope correlation module 38, a weighting module 40, an estimated pitch aggregation module 42, a voiced section module 44, a resynthesis module 46, a noise subtraction module 48, a fence model module 50, a reconstruction module 52, an audio features module 54, an information reduction module 56, a spectral slope module 58, a signal-to-noise ratio (SNR) estimator module 60, an aggregate score module 62, a training module 64, and/or other modules.
The signal module 18 may be configured to obtain sound signals for processing. The signal module 18 may be configured to obtain a sound signal from electronic storage 14, from user interface 16 (e.g., a microphone, a transducer, and/or other user interface components), from an external source, and/or from other sources. The sound signals may include electronic analog and/or digital signals that represents sounds generated by sources and/or noise. As used herein, a “source” may refer to an object or set of objects that operate to produce a sound. For example, a stringed instrument, such as a guitar may be considered as an individual source even though it may itself include a plurality of objects cooperating to generate sounds (e.g., a plurality of strings, the body, and/or other objects). Similarly, a group of singers may generate sounds in concert to produce a single, harmonic sound.
The signal module 18 may be configured such that the obtained sound signals may specify signal intensity as a function of time. An individual sound signal may have a sampling rate at which amplitude is represented. The sampling rate may correspond to a sampling period. The spectral density of a sound signal may be represented, for example, in a spectrogram. By way of illustration,
In a sound signal, contributions attributable to a single sound and/or source may be arranged at harmonic (e.g., regularly spaced) intervals. These spaced apart contributions to the sound signal may be referred to as “harmonics” or “overtones”. For example, spectrogram 70 includes a first set of overtones (labeled in
Referring back to
The parameters of the processing performed by time sample window module 20 may include the type of peaked window function (e.g. Gaussian), the width of this function (for a Gaussian, the standard deviation), the total width of the window (for a Gaussian, typically 6 standard deviations total), the arrangement of the time sample windows (e.g., consecutively, overlapping, spaced apart, and/or other arrangements), and/or other parameters. One or more of these parameters may be set based on user selection, preset settings, the sound signal being processed, and/or other factors. By way of non-limiting example, the time sample windows may correspond to a period of time that is between about 5 milliseconds and about 50 milliseconds, between about 5 milliseconds and about 30 milliseconds, between about 5 milliseconds and about 15 milliseconds, and/or in other ranges. Since the processing applied to sound signals by system 10 accounts for the dynamic nature of the sound signals in the signal portions the time sample windows may correspond to an amount of time that is greater than in conventional sound processing systems. For example, the time sample windows may correspond to an amount of time that is greater than about 15 milliseconds. In some implementations, the time sample windows may correspond to about 10 milliseconds.
The chirp rate variable may be a metric derived from chirp rate (e.g., or rate of change in frequency). For example, In some implementations, the chirp rate variable may be the fractional chirp rate. The fractional chirp rate may be expressed as:
χ=X/ω; (1)
where χ represents fractional chirp rate, X represents chirp rate, and ω represents frequency.
The processing performed by transform module 22 may result in a multi-dimensional representation of the audio. This representation, or “space,” may have a domain given by frequency and (fractional) chirp rate. The representation may have a co-domain (output) given by the transform coefficient. As such, upon performance of the transform by transform module 22, a transformed signal portion may specify a transform coefficient as a function of frequency and fractional chirp rate for the time sample window associated with the transformed signal portion. The transform coefficient for a specific frequency and fractional chirp rate pair may represent the complex number directly produced by the transform, the modulus of this complex number, or the square of this modulus, for the specific frequency and fractional chirp rate within the time sample window associated with the transformed signal portion.
By way of illustration,
Referring back to
where i is the imaginary number, t represents time, f represents the center frequency of the filter, c represents the chirp rate of the filter, and σ represents the standard deviation (e.g., the width) of the time sample window of the filter.
The filters applied by transform module 22 may be complex exponentials. This may result in the transform coefficients produced by the filters including both real and imaginary components. As used herein, the “transform coefficient” may refer to a complex number including both real and imaginary components, a modulus of a complex number, the square of a modulus of a complex number, and/or other representations of complex numbers and/or components thereof. Applying the filters to a signal portion may be accomplished, for example, by taking the inner product of the time data of the signal portion and the complex filter. The parameters of the filters, such as central frequency, and chirp rate, may be set based on user selection, preset settings, the sound signal being processed, and/or other factors.
Transforming the audio signal into the frequency-chirp domain may facilitate the identification of signal intensity contributions of individual sounds (e.g., harmonic sounds) within the signal portions. As a given sound changes pitch, the change in frequency (or chirp rate) of a harmonic of the given sound may be characterized as a function of the rate at which the pitch is changing and the current frequency of the harmonic. This may be characterized for the nth harmonic as:
Δφ=ω1(Xn/ωn) (1)
where Δφ represents the rate of change in pitch (φ), or “pitch velocity” of the sound, Xn represents the chirp rate of the nth harmonic, ωn represents the frequency of the nth harmonic, and ω1 represents the frequency of the first harmonic (e.g., the fundamental tone). By referring to equations (1) and (2), it may be seen that the rate of change in pitch of a sound and fractional chirp rate(s) of the nth harmonic of the sound are closely related, and that equation (2) can be rewritten as:
Δφ=ω1·χn. (2)
Since the rate of change in pitch is a sound-wide parameter that holds for the sound as a whole, with all of its underlying harmonics (assuming a harmonic sound/source), it can be inferred from equation (3) that the fractional chirp rate may be the same for all of the harmonics of the sound. The system 10 may be configured to leverage this phenomenon to identify contributions of individual sounds in transformed signal portions.
By way of illustration, referring back to
Similarly, a plurality of sounds present in a single signal portion may be distinguished in the frequency-chirp domain because they would likely have different fractional chirp rates. By way of non-limiting example,
It will be appreciated that a two dimensional view of transformed audio information may be taken, for example, along a slice through the frequency-chirp domain that corresponds to a specific fractional chirp rate (e.g., along one of fractional chirp rate rows 78, 82, or 84, shown in
Other spikes (e.g., spikes 94 and/or 96) may be present in the transformed audio information. These spikes may not be associated with harmonic sound corresponding to spikes 92. The difference between spikes 92 and spike(s) 94 and/or 96 may not be amplitude, but instead frequency, as spike(s) 94 and/or 96 may not be at a harmonic frequency of the harmonic sound. As such, these spikes 94 and/or 96, and the rest of the amplitude between spikes 92 may be a manifestation of noise in the audio signal. As used in this instance, “noise” may not refer to a single auditory noise, but instead to sound (whether or not such sound is harmonic, diffuse, white, or of some other type) other than the harmonic sound associated with spikes 92.
As was mentioned previously, the transformation that yields the transformed audio information from the audio signal may result in the coefficient related to energy being a complex number. In such implementations, the complex number for the coefficient generated by the transform may be preserved. In such implementations, for example, the real and imaginary portions of the coefficient may be analyzed separately, at least at first. By way of illustration, plot 90 may represent the real portion of the coefficient, and a separate plot (not shown) may represent the imaginary portion of the coefficient as a function of frequency. The plot representing the imaginary portion of the coefficient as a function of frequency may have spikes at the harmonics of the harmonic sound that corresponds to spikes 92.
Referring back to
Referring back to
Determination of the tone likelihood metric for a given frequency may be based on a correlation between the transformed audio information at and/or near the given frequency and a peak function having its center at the given frequency. The peak function may include a Gaussian peak function, a χ2 distribution, and/or other functions. The correlation may include determination of the dot product of the normalized peak function and the normalized transformed audio information at and/or near the given frequency. The dot product may be multiplied by −1, to indicate a likelihood of a peak centered on the given frequency, as the dot product alone may indicate a likelihood that a peak centered on the given frequency does not exist.
By way of illustration,
Determination of the tone likelihood metric as a function of frequency may result in the creation of a new representation of the data that expresses a tone likelihood metric as a function of frequency. By way of illustration,
Referring back to
The logarithm sub-module may be configured to take the logarithm (e.g., the natural logarithm) of the real and imaginary tone likelihood metrics. This may result in determination of the logarithm of each of the real tone likelihood metric and the imaginary tone likelihood metric as a function of frequency. The aggregation sub-module may be configured to sum the real tone likelihood metric and the imaginary tone likelihood metric for common frequencies (e.g., summing the real tone likelihood metric and the imaginary tone likelihood metric for a given frequency) to aggregate the real and imaginary tone likelihood metrics. This aggregation may be implemented as the tone likelihood metric, the exponential function of the aggregated values may be taken for implementation as the tone likelihood metric, and/or other processing may be performed on the aggregation prior to implementation as the tone likelihood metric.
The pitch likelihood module 26 may be configured to determine, based on the determination of tone likelihood metrics by tone likelihood module 24, a pitch likelihood metric as a function of pitch for the audio signal within the time sample window. The pitch likelihood metric for a given pitch may be related to the likelihood that a sound represented by the audio signal has the given pitch during the time sample window. The pitch likelihood module 26 may be configured to determine the pitch likelihood metric for a given pitch by aggregating the tone likelihood metric determined for the tones that correspond to the harmonics of the given pitch.
By way of illustration, referring back to
Returning to
The logarithm sub-module may be configured to take the logarithm (e.g., the natural logarithm) of the tone likelihood metrics. In implementations in which tone likelihood module 24 generates the tone likelihood metric in logarithm form (e.g., as discussed above), pitch likelihood module 26 may be implemented without the logarithm sub-module. The aggregation sub-module may be configured to sum, for each pitch (e.g., φk, for k=0 through n) the logarithms of the tone likelihood metric for the frequencies at which harmonics of the pitch would be expected (e.g., as represented in
Operation of pitch likelihood module 26 may result in a representation of the data that expresses the pitch likelihood metric as a function of pitch. By way of illustration,
As was mentioned above, in some implementations, the transformed audio information may have been transformed to the frequency-chirp domain. In such implementations, the transformed audio information may be viewed as a plurality of sets of transformed audio information that correspond to separate fractional chirp rates (e.g., separate one-dimensional slices through the two-dimensional frequency-chirp domain, each one-dimensional slice corresponding to a different fractional chirp rate). These sets of transformed audio information may be processed separately by modules 24 and/or 26 illustrated in
By way of illustration,
Referring back to
Referring again to
Turning back to
The pitch estimation module 32 may be configured to determine, for the individual time sample windows in the processing time window, estimated pitch and estimated fractional chirp rate. For the primary time sample window, this may be performed by determining the estimated pitch and the estimated fractional chirp rate randomly, through an analysis of the pitch likelihood metric, by rule, by user selection, and/or based on other criteria. For other time sample windows in the processing time window, the pitch estimation module may be configured to determine estimated pitch and estimated fractional chirp rate by iterating through the processing time window from the primary time sample window and determining the estimated pitch and/or estimated fractional chirp rate for a given time sample window based on one or more of the pitch likelihood metric specified by the transformed audio information for the given time sample window, a predicted pitch for the given time sample window determined based on an estimated pitch and an estimated fractional chirp rate for another time sample window, a correlation between harmonic envelope at different pitches in the given time sample window and the harmonic envelope at an estimated pitch for a time sample window adjacent to the given time sample window, and/or other criteria.
The pitch prediction module 34 may be configured to determine predicted pitch for time sample windows. This may include, for a first time sample window in the processing time window, determining a predicted pitch for the first time sample window. The pitch prediction module may be configured determine the predicted pitch for the first time sample window based on an estimated pitch and an estimated chirp rate previously determined (e.g., by pitch prediction module 34) for a second time sample window. The second time sample window may be adjacent to the first time sample window, near the first time sample window, and/or having other relationships with the first time sample window. The second time sample window may be the primary time sample window, and/or some other time sample window in the processing time window. The second time sample window may come before or after the first time sample window with respect to the timeline of the audio signal.
Determining the predicted pitch for the first time sample window may include, for example, incrementing the pitch from the estimated pitch determined for the second time sample window by an amount that corresponds to the estimated fractional chirp rate determined for the second time sample window and a time difference between the first time sample window and the second time sample window. For example, this determination of a predicted pitch may be expressed mathematically for some implementations as:
where φt1 represents the estimated pitch for the first time sample window, φτ2 represents the predicted pitch for the second time sample window, Δt represents the time difference between the first time sample window and the second time sample window, and
represents an estimated fractional chirp rate of the fundamental frequency of the pitch φτ2 (which can be determined from the estimated fractional chirp rate for the second time sample window).
Harmonic sounds may have a feature referred to herein as “harmonic envelope”. By way of illustration, turning back to
Harmonic envelope 150 may be characterized, at least in part, by an envelope vector. The envelope vector of the sound corresponding to spikes 92 may be defined as a vector having coordinates that correspond to values for the coefficient related to signal intensity at two or more of the harmonic frequencies of the sound (e.g., at two or more of spikes 92).
Referring back to
The envelope correlation module 38 may be configured to values of a correlation metric for time sample windows as a function of pitch and fractional chirp rate. The values of the correlation metric may indicate correlation between the envelope vector in a time sample window for a specific pitch and fractional chirp rate with the envelope vector in another time sample window for the estimated pitch and estimated fractional chirp rate of the other time sample window. The other time sample window may be, for example, the time sample window having the estimated pitch and estimated fractional chirp rate implemented by pitch prediction module 34 to determine predicted pitch for the time sample window.
For example, envelope correlation module 38 may be configured to determine values of the correlation metric for the first time sample window as a function of pitch and fractional chirp rate based on the envelope vector for the estimated pitch and the estimated fractional chirp rate of the second time sample window. The value of the correlation metric for a given pitch and a given fractional chirp rate in the first time sample window may indicate a level of correlation between the envelope vector for the given pitch and the given fractional chirp rate in the first time sample window and the envelope vector for the estimated pitch and the estimated fractional chirp rate in the second time sample window.
By way of illustration,
Determination of values of a correlation metric for the first time sample window may include determining values of a metric that indicates correlation between the envelope vectors 168, 170, and 172 for the individual pitches in the first time sample window with the envelope vector 164 for the estimated pitch of the second time sample window. Such a correlation metric may include one or more of, for example, a distance metric, a dot product, a correlation coefficient, and/or other metrics that indicate correlation.
In the example provided in
It will be appreciated that the illustration of the envelope vectors in
Turning back to
In implementations in which weighting module 40 is configured to weight based on the predicted pitch determined by pitch estimation module 32, the weighting may apply relatively larger weights to the pitch likelihood metric for pitches in the next time sample window at or near the predicted pitch and relatively smaller weights to the pitch likelihood metric for pitches in the next time sample window that are further away from the predicted pitch. For example, this weighting may include multiplying the pitch likelihood metric by a weighting function that varies as a function of pitch and may be centered on the predicted pitch. The width, the shape, and/or other parameters of the weighting function may be determined based on user selection (e.g., through settings and/or entry or selection), fixed, based on noise present in the audio signal, based on the range of fractional chirp rates in the sample, and/or other factors. As a non-limiting example, the weighting function may be a Gaussian function.
In implementations in which weighting module 40 is configured to weight based on the correlation metric determined by the envelope correlation module 38, relatively larger weights may be applied to the pitch likelihood metric at pitches having values of the correlation metric that indicate relatively high correlation with the envelope vector for the estimated pitch in the other time sample window. The weighting may apply relatively smaller weights to the pitch likelihood metric at pitches having correlation metric values in the next time sample window that indicate relatively low correlation with the envelope vector for the estimated pitch in the other time sample window.
The pitch estimation module 32 may be configured such that for at least some of the time sample windows other than the primary time sample window, an estimated pitch and an estimated fractional chirp rate may be determined from the weighted pitch likelihood metric generated by weighting module 40. For example, pitch estimation module 32 may be configured such that determination of the estimated pitch for the first time sample window may include identifying a maximum in the weighted pitch likelihood metric and determining the pitch and/or fractional chirp rate corresponding to this maximum as the estimated pitch and/or the estimated fractional chirp rate for the first time sample window.
It will be appreciated that as modules 28, 30, 32, 34, 36, 38, and 40 process a plurality of overlapping processing time windows in an iterative fashion, a plurality of estimated pitches and/or estimated fractional chirp rates may be determined for individual time sample windows that lie within two or more of the overlapping processing time windows. The estimated fractional chirp rates and/or estimated fractional chirp rates determined for an individual time sample window may not all be the same, as different pitch paths through the different processing time windows may result in discrepancies. Estimated pitch aggregation module 42 may be configured to aggregate, over individual time sample windows determinations of estimated pitch and/or estimated fractional chirp rate. By way of non-limiting example, determining an aggregated estimated pitch for a given time sample window may include determining a mean estimated pitch, determining a median estimated pitch, selecting an estimated pitch that was determined most often for the time sample window, and/or other aggregation techniques. The estimated pitch aggregation module 42 may be configured such that the determination of a mean, a selection of a determined estimated pitch, and/or other aggregation techniques may be weighted. For example, the individually determined estimated pitches for the given time sample window may be weighted according to their corresponding pitch likelihood metrics. These pitch likelihood metrics may include the pitch likelihood metrics determined by pitch likelihood module 26, the weighted pitch likelihood metric determined for the given time sample window by weighting module 40, and/or other pitch likelihood metrics for the given time sample window.
The voiced section module 44 may be configured to divide individual time sample windows into voiced and unvoiced categories. The voiced time sample windows may be time sample windows during which the sounds represented in the audio signal are harmonic or “voiced” (e.g., spoken vowel sounds). The unvoiced time sample windows may be time sample windows during which the sounds represented in the audio signal are not harmonic or “unvoiced” (e.g., spoken consonant sounds).
In some implementations, voiced section module 44 may be configured to make this determination based on a harmonic energy ratio. The harmonic energy ratio for a given time sample window may be determined based on the transformed audio information for given time sample window. The harmonic energy ratio may be determined as the ratio of the sum of the magnitudes of the coefficient related to energy at the harmonics of the estimated pitch (or aggregated estimated pitch) in the time sample window to the sum of the magnitudes of the coefficient related to energy at the harmonics across the spectrum for the time sample window. The transformed audio information implemented in this determination may be specific to an estimated fractional chirp rate (or aggregated estimated fractional chirp rate) for the time sample window (e.g., a slice through the frequency-chirp domain along a common fractional chirp rate). The transformed audio information implemented in this determination may not be specific to a particular fractional chirp rate.
For a given time sample window if the harmonic energy ratio is above some threshold value, a determination may be made by voiced section module 44 that the audio signal during the time sample window represents voiced sound. If, on the other hand, for the given time sample window the harmonic energy ratio is below the threshold value, a determination may be made that the audio signal during the time sample window represents unvoiced sound. The threshold value may be determined, for example, based on user selection (e.g., through settings and/or entry or selection), fixed, based on noise present in the audio signal, based on the fraction of time the harmonic source tends to be active (e.g. speech has pauses), and/or other factors.
In some implementations, voiced section module 44 may be configured to divide individual time sample windows based on the pitch likelihood metric for estimated pitch (or aggregated estimated pitch). For example, for a given time sample window if the pitch likelihood metric is above some threshold value, a determination may be made that the audio signal during the time sample window represents voiced sound. If, on the other hand, for the given time sample window the pitch likelihood metric is below the threshold value, a determination may be made that the audio signal during the time sample window represents unvoiced sound. The threshold value may be determined, for example, based on user selection (e.g., through settings and/or entry or selection), fixed, based on noise present in the audio signal, based on the fraction of time the harmonic source tends to be active (e.g. speech has pauses), and/or other factors.
Responsive to a determination that the audio signal during a time sample window represents unvoiced sound, voiced section module 44 may be configured to set the estimated pitch (or aggregated estimated pitch) for the time sample window to some predetermined value. For example, this value may be set to 0, or some other value. This may cause the tracking of pitch accomplished by system 10 to designate that harmonic speech may not be present or prominent in the time sample window.
The resynthesis module 46 may be configured to resynthesize the audio signal based on individual harmonics and corresponding pitches determined from the transformed audio information (e.g., estimated pitches determined by pitch estimation module 32 and/or estimated pitch aggregation module 42). According to some implementations, resynthesizing the audio signal may include tracking one or more pitches of the sound to estimate individual pitch and corresponding amplitudes as a function of time for sound s. Individual harmonics of a sound may be synthesized using oscillators corresponding to individual harmonics. Synthesizing individual harmonics may include, for a given harmonic, integrating a corresponding pitch over time to determine the unwrapped phase of the given harmonic. Individual ones of the oscillators may be based on a cosine function. The synthesized harmonics may be summed to obtain the resynthesized audio signal.
According to some implementations, the output y as a function of time t of the ith oscillator may be expressed as, or similar to,
yi(t)=cos ∫0tiφ(τ)dτ,
where φ is pitch (first harmonic) as a function of time. This equation may be fixed, so the entire representation of a sound is stored in the pitch and harmonic amplitude parameters. Time courses may be represented sparsely because pitch and envelope (the set of harmonic amplitudes) change slowly per time relative to the sampling rate. For example, a cubic spline with 20 knots may provide an accurate fit to the pitch time course over one second for a human voice. Similarly, the harmonic amplitudes may be represented with about 10 knots along the frequency dimension and 20 per second in time to form an “amplitude surface” (e.g., amplitude as a function of frequency and time, and/or transformed audio information) expressing the changing envelope. Some or all harmonic amplitudes and envelopes for synthesizing consonants with a white noise source may be shaped by such an amplitude surface.
In some implementations, resynthesis module 46 may be configured to solve any phase problems because the audio signal may be built through integration, where phase is a consequence of the audio signal and not something that needs to be factored in. Also, the degree of compression of the resynthesized audio signal may go below a kB per second for voice, which is far better than the current mp3 standard.
The resynthesized audio signal may be built from oscillators and parameters that specify pitch and harmonic amplitudes as a function of time (e.g., the envelope vectors for the estimated pitch determined by envelope vector module 36 for the time sample windows over time). One or more of these parameters may be adjusted independently of the others without altering the phase and without harmonics suddenly dropping out.
In some implementations, individual ones of the oscillators may include a white noise source to simulate a whispered version of the voice that retains word shaping and speech rhythms. Parameters may be altered to adjust for known channel distortions. For example, cell phones vary subtly in their pass-band, but generally have the same approximate high and low roll-offs. A correction may be made by dividing the transformed audio information by the roll-off transfer function.
The noise subtraction module 48 may be configured to subtract noise from the transformed audio information. Subtracting noise may include interpolating across peak points of harmonic pitch paths through the transformed audio information. The peak points may lie along harmonic frequencies in the transformed audio information, and may be determined as a function of frequency and time for a given harmonic. In some implementations, interpolation across the peak points may include polynomial interpolation, use of splines, and/or other interpolation techniques.
Subtracting noise may further include interpolating across trough points of harmonic pitch paths through the transformed audio information. The trough points may be positioned midway between peak points of adjacent harmonic frequencies in the transformed audio information, and may be determined as a function of frequency and time. In some implementations, interpolation across the trough points may include polynomial interpolation, use of splines, and/or other interpolation techniques. Such splines may include linear, quadratic, cubic, and/or other splines. Values associated with individual ones of the trough point interpolations may be subtracted from values associated with individual ones of the peak point interpolations to yield noise-reduced transformed audio information.
The fence model module 50 may be configured to suppress noise between harmonics of the sound in the transformed audio information by centering functions at individual harmonics in the transformed audio information. The functions may serve to suppress noise between the harmonics in order to yield noise-reduced transformed audio information. The width of a given function may be based on a bandwidth of a corresponding harmonic.
In some implementations, individual ones of the functions utilized by fence model module 50 may include a Gaussian function. Such a Gaussian function may be configured to suppress information between the harmonics. The Gaussian function may be configured to replace information associated with individual harmonics with Gaussian (or other) curves to provide noise-reduced transformed audio information. A given Gaussian curve may be fitted to a corresponding harmonic.
An audio signal may be reconstructed from the noise-reduced transformed audio information, as discussed in connection with the reconstruction module 52. Such a reconstructed audio signal may closely resemble the undistorted original audio signal, even down to 3 dB noise. Additionally, the reconstructed audio signal may be more compact relative to the original audio signal because only the harmonic frequencies and corresponding amplitudes need to be transmitted to resynthesize the reconstructed audio signal.
According to some implementations, individual ones of the functions may include a rectangular fence. Such a fence may be configured to zero information between the harmonics while preserving information associated with the harmonics. In some implementations, one or more functions utilized by fence model module 50 may be separately applied to real and imaginary components of the transformed audio information.
The reconstruction module 52 may be configured to reconstruct an audio signal and/or portions of an audio signal (e.g., vowel and/or consonant sounds). In some implementations, one or more reverse transformations may be performed on transformed audio information and/or other non-time-domain information to obtain a reconstructed audio signal. That is, reconstruction may include converting a frequency domain representation and/or frequency-chirp domain representation to a time-domain representation, according to some implementations. The reconstruction module 52 may be configured to reconstruct noise-reduced transformed audio information obtained from noise subtraction module 48, fence model module 50, and/or another source of noise-reduced transformed audio information. A reverse transformation used by reconstruction module 52 may correspond to a reverse and/or inverse of a transform performed on the original audio signal to produce the transformed audio information.
The audio features module 54 may be configured to obtain features associated with the audio signal from the transformed audio information. One or more separate features may correspond to individual methods and/or techniques of obtaining the features such as, for example, the methods and/or techniques described in connection with information reduction module 56, spectral slope module 58, and/or elsewhere herein. A feature may include a set of numbers that determine a vector, for individual segments of the audio signal, in a feature space. The feature space may be a multi-dimensional space where each dimension corresponds to one component (dimension) of the feature vector. Feature vectors corresponding to two or more methods and/or techniques of obtaining features may be combined (e.g., by vector summation) as an aggregate feature vector. Individual feature vectors and/or the aggregate feature vector may be used in the feature space for segmentation and/or classification.
According to some implementations, the features may include reduced transformed audio information, spectral slope information, mel-frequency cepstral coefficient vectors, harmonic amplitude vectors, harmonic amplitude surfaces, pitched harmonic amplitude surfaces, time derivatives of the harmonic amplitude surfaces, Gabor transforms of the fractional harmonic derivatives, and/or other features. In some implementations, a separate harmonic amplitude vector may correspond to individual time windows. Harmonic amplitude surfaces may be defined over multiple time windows (e.g., about 200 msec or 20 time windows). Pitched harmonic amplitude surfaces may have time-corresponding pitch values appended to the amplitude surface feature vector. Time derivatives of the harmonic amplitude surfaces may be normalized for harmonic amplitude (e.g., fractional harmonic derivatives). Gabor transforms of the fractional harmonic derivatives may serve to compress features information.
Individual features may be associated with a feature score relative to a predetermined speaker model. A speaker model may include a set of speech characteristics and/or other speech information unique to a specific person, a group of people, and/or other source of audible speech. A feature score may facilitate a comparison between a corresponding feature to one or more speech characteristics of a speaker model.
The information reduction module 56 may be configured to obtain reduced transformed audio information as a feature associated with the audio signal. For example, a harmonic of a sound may be described as a vector defined as function of time. Such a vector may be one-dimensional, two-dimensional, or three-dimensional, or may have another dimensionality. According to some implementations, a one-dimensional vector may describe frequency versus time, amplitude versus time, chirp rate versus time, and/or other information. A two-dimensional vector may describe frequency and amplitude versus time, frequency and chirp rate versus time, amplitude and chirp rate versus time, and/or other information. A three-dimensional vector may describe frequency, chirp rate, and amplitude versus time, and/or other information. Such information may be determined over the span of a portion of the audio signal (e.g., the span of one vowel), a fixed time span (e.g., 200 msec), the entire audio signal, and/or another duration of time.
Obtaining the reduced transformed audio information may include determining an amplitude value for individual harmonics at individual time windows. The amplitude value may correspond to a magnitude of a coefficient related to energy amplitude of a given harmonic at a given time. The reduced transformed audio information may be constructed using single points that correspond to the harmonic positions for individual time windows. Those points may describe the center frequency, chirp rate, time position, amplitude (or some other coefficient related to signal intensity) value, and/or other information for individual harmonics. In some implementations, values within the reduced transformed audio information between the amplitude values may be set to zero. A filter bank may be applied to the reduced transformed audio information. The filter bank may include one or more mel filters and/or other filters.
The spectral slope module 58 may be configured to obtain spectral slope information based on the transformed audio information as a feature associated with the audio signal. In some implementations, obtaining the spectral slope information may include determining an instantaneous slope of the harmonic envelope (e.g., harmonic envelope 150 shown in
The noise estimator module 60 may be configured to obtain an SNR estimation as a time-varying quantity associated with the audio signal (e.g., for individual time sample windows of the audio signal). In accordance with some implementations, obtaining the SNR estimation may include determining the harmonic energy ratio. In some implementations, the ratio may be averaged over time. The noise estimator module 60 may be configured to compare a determined harmonic energy ratio to a threshold value.
Several considerations may exist in determining the threshold. For example, setting a specific threshold may result in a specific ratio of time when the ratio between harmonic energy and total energy breaches the threshold (e.g., voice is “on”) versus time when the ratio does not breach the threshold (e.g., voice is “off”). The threshold value may be selected such that the ratio of time (e.g., (voice on)/(voice off)) equals an average for a given classification of speaker (e.g., by nationality, gender, age, and/or other speaker classifications) being considered. As another exemplary consideration, for a given classification procedure, a quantity VLL may be derived that equals a variance of log-likelihood ratios across a plurality of speaker models. Regardless of a final decision as to selected speaker or that there is no match, the VLL may tend to be greater when non-voiced time segments are included in the classification. Therefore, the threshold may be adjusted to minimize the VLL.
The aggregate score module 62 may be configured to obtain an aggregate classification score based on the feature scores according to a weighting scheme. The weighting scheme may be associated with a noise and/or SNR estimation. The aggregate classification score may be used to determine a likely speaker model to identify a source of the sound in the audio signal. In some implementations, aggregate score module 62 may be configured to obtain the aggregate classification score by summing weighted feature scores of one or more features obtained by audio features module 54. The weighting performed on the individual feature scores may be performed such that for a given audio signal, the features scores more likely to be accurate are weighted with relatively larger weights. Likelihood of accuracy, and hence feature score weighting, may be determined based on, for example, noise level, acoustic setting in which the audio signal was acquired, language being spoken, and/or other factors.
In accordance with some implementations, an aggregate classification score may be determined for individual speaker models of a set of speaker models. A speaker within the audio signal may be identified based on the aggregates scores of the speaker models in the set. For example, a speaker associated with a speaker model having the maximum aggregate classification score of the set may be identified as the speaker in the audio signal. In some implementations, an aggregate classification score 2 may be expressed as
where zij is the feature score from the ith method pertaining to the jth speaker and/or speaker model, and ci is a positive real weighting coefficient associated with a noise estimate of the ith method. The ith method may correspond to methods described in connection with
The training module 64 may be configured to determine the weighing scheme utilized by aggregate score module 62. In accordance with some implementations, training module 64 may perform training operations on the audio signal and/or other information related to the audio signal to determine characteristics of the audio signal and/or conditions pertaining to the recording of the audio signal (e.g., type of microphone, closed room versus open air, and/or other conditions) that indicate a particular set of score weights that results in more accurate classification compared to other weighting schemes.
Processor 12 may be configured to provide information processing capabilities in system 10. As such, processor 12 may include one or more of a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information. Although processor 12 is shown in
It should be appreciated that although modules 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, and 64 are illustrated in
Electronic storage 14 may comprise electronic storage media that stores information. The electronic storage media of electronic storage 14 may include one or both of system storage that is provided integrally (i.e., substantially non-removable) with system 10 and/or removable storage that is removably connectable to system 10 via, for example, a port (e.g., a USB port, a firewire port, etc.) or a drive (e.g., a disk drive, etc.). Electronic storage 14 may include one or more of optically readable storage media (e.g., optical disks, etc.), magnetically readable storage media (e.g., magnetic tape, magnetic hard drive, floppy drive, etc.), electrical charge-based storage media (e.g., EEPROM, RAM, etc.), solid-state storage media (e.g., flash drive, etc.), and/or other electronically readable storage media. Electronic storage 14 may include virtual storage resources, such as storage resources provided via a cloud and/or a virtual private network. Electronic storage 14 may store software algorithms, information determined by processor 12, information received via user interface 16, and/or other information that enables system 10 to function properly. Electronic storage 14 may be a separate component within system 10, or electronic storage 14 may be provided integrally with one or more other components of system 10 (e.g., processor 12).
User interface 16 may be configured to provide an interface between system 10 and users. This may enable data, results, and/or instructions and any other communicable items, collectively referred to as “information,” to be communicated between the users and system 10. Examples of interface devices suitable for inclusion in user interface 16 include a keypad, buttons, switches, a keyboard, knobs, levers, a display screen, a touch screen, speakers, a microphone, an indicator light, an audible alarm, and a printer. It is to be understood that other communication techniques, either hard-wired or wireless, are also contemplated by the present invention as user interface 16. For example, the present invention contemplates that user interface 16 may be integrated with a removable storage interface provided by electronic storage 14. In this example, information may be loaded into system 10 from removable storage (e.g., a smart card, a flash drive, a removable disk, etc.) that enables the user(s) to customize the implementation of system 10. Other exemplary input devices and techniques adapted for use with system 10 as user interface 14 include, but are not limited to, an RS-232 port, RF link, an IR link, modem (telephone, cable or other). In short, any technique for communicating information with system 10 is contemplated by the present invention as user interface 14.
At an operation 184, the sound signal may be separated into a set of signal portions. The signal portions may be associated with individual time sample windows. The time sample windows may be consecutive across time, may overlap, may be spaced apart, and/or may be arranged over time in other ways. An individual time sample window may correspond to a period of time that is greater than the sampling period of the sound signal being separated into signal portions. As such, the signal potion associated with a time sample window may include a plurality of signal samples. In some implementations, operation 184 may be performed by a time window module that is the same as or similar to time window module 20 (shown in
At an operation 186, the signal portions may be transformed into the frequency-chirp domain. The frequency-chirp domain may be given by frequency and (fractional) chirp rate. The frequency-chirp domain may have a co-domain (output) given by the transform coefficient. The chirp rate variable may be a metric derived from chirp rate (e.g., or rate of change in frequency). As such, upon performance of the transform at operation 186, a transformed signal portion may specify a transform coefficient as a function of frequency and fractional chirp rate for the time window associated with the transformed signal portion. In some implementations, operation 186 may be performed by a transform module that is the same as or similar to transform module 22 (shown in
At an operation 194, a tone likelihood metric may be determined based on the obtained transformed audio information. This determination may specify the tone likelihood metric as a function of frequency for the audio signal within the time sample window. The tone likelihood metric for a given frequency may indicate the likelihood that a sound represented by the audio signal has a tone at the given frequency during the time sample window. In some implementations, operation 194 may be performed by a tone likelihood module that is the same as or similar to tone likelihood module 24 (shown in
At an operation 196, a pitch likelihood metric may be determined based on the tone likelihood metric. Determination of the pitch likelihood metric may specify the pitch likelihood metric as a function of pitch for the audio signal within the time sample window. The pitch likelihood metric for a given pitch may be related to the likelihood that a sound represented by the audio signal has the given pitch. In some implementations, operation 196 may be performed by a pitch likelihood module that is the same as or similar to pitch likelihood module 26 (shown in
In some implementations, the transformed audio information may include a plurality of sets of transformed audio information. Individual ones of the sets of transformed audio information may correspond to individual fractional chirp rates. In such implementations, operations 192, 194, and 196 may be iterated for the individual sets of transformed audio information. At an operation 198, a determination may be made as to whether further sets of transformed audio information should be processed. Responsive to a determination that one or more further sets of transformed audio information are to be processed, method 190 may return to operation 192. Responsive to a determination that no further sets of transformed audio information are to be processed (or if the transformed audio information is not divide according to fractional chirp rate), method 190 may end. In some implementations, operation 198 may be performed by a processor that is the same as or similar to processor 12 (shown in
At an operation 204, a plurality of processing time windows may be defined across the signal duration. A processing time window may include a plurality of time sample windows. The processing time windows may correspond to a common time length. In some implementations, operation 204 may be performed by a processing window module that is the same as or similar to processing window module 28 (shown in
At an operation 206, for a processing time window defined at operation 204, a primary time sample window within the processing time window may be identified. In some implementations, the primary time sample window may be identified randomly, based on some analysis of pitch likelihood, by rule or parameter, based on user selection, and/or based on other criteria. In some implementations, identifying the primary time sample window may include identifying a maximum pitch likelihood. In some implementations, operation 206 may be performed by a primary window module that is the same as or similar to primary window module 30 (shown in
At an operation 48, an estimated pitch for the primary time sample window may be determined. In some implementations, the estimated pitch may be selected randomly, based on an analysis of pitch likelihood within the primary time sample window, by rule or parameter, based on user selection, and/or based on other criteria. As was mentioned above, the audio information may indicate, for a given time sample window, the pitch likelihood metric as a function of pitch. As such, the estimated pitch for the primary time sample window may be determined as the pitch for exhibiting a maximum for pitch likelihood metric for the primary time sample window.
As was mentioned above, in the audio information the pitch likelihood metric may further be specified as a function of fractional chirp rate. As such, the pitch likelihood metric may indicate chirp likelihood as a function of the pitch likelihood metric and pitch. At operation 208, in addition to the estimated pitch, an estimated fractional chirp rate for the primary time sample window may be determined. The estimated fractional chirp rate may be determined as the chirp rate corresponding to a maximum for the pitch likelihood metric on the estimated pitch.
At operation 208, an envelope vector for the estimated pitch of the primary time sample window may be determined. As is described herein, the envelope vector for the predicted pitch of the primary time sample window may represent the harmonic envelope of sound represented in the audio signal at the primary time sample window having the predicted pitch. In some implementations, operation 48 may be performed by a pitch estimation module and/or an envelope vector module that are the same as or similar to pitch estimation module 32 and/or envelope vector module 36, respectively (shown in
At an operation 210, a predicted pitch for a next time sample window in the processing time window may be determined. This time sample window may include, for example, a time sample window that is adjacent to the time sample window having the estimated pitch and estimated fractional chirp rate determined at operation 210. The description of this time sample window as “next” is not intended to limit the this time sample window to an adjacent or consecutive time sample window (although this may be the case). Further, the use of the word “next” does not mean that the next time sample window comes temporally in the audio signal after the time sample window for which the estimated pitch and estimated fractional chirp rate have been determined. For example, the next time sample window may occur in the audio signal before the time sample window for which the estimated pitch and the estimated fractional chirp rate have been determined. In some implementations, operation 201 may be performed by a pitch prediction module that is the same as or similar to pitch prediction module 34 (shown in
At an operation 212, an envelope vector may be determined for the next time sample window as a function of pitch within the next time sample window. In some implementations, operation 212 may be performed by an envelope vector module that is the same as or similar to envelope vector module 36 (shown in
At an operation 214, values of a correlation metric for the next time sample window may be determined as a function of pitch. In some implementations, operation 214 may be performed by an envelope correlation module that is the same as or similar to envelope correlation module 38 (shown in
At an operation 216, for the next time sample window, the pitch likelihood metric may be weighted. This weighting may be performed based on one or more of the predicted pitch determined at operation 210, the correlation metric determined at operation 214, and/or other weightings metrics. In some implementations, operation 216 may be performed by a weighting module that is the same as or similar to weighting module 40 (shown in
At an operation 218, an estimated pitch for the next time sample window may be determined based on the weighted pitch likelihood metric for the next sample window. Determination of the estimated pitch for the next time sample window may include, for example, identifying a maximum in the weighted pitch likelihood metric and determining the pitch corresponding to this maximum as the estimated pitch for the next time sample window.
At operation 218, an estimated fractional chirp rate for the next time sample window may be determined. The estimated fractional chirp rate may be determined, for example, by identifying the fractional chirp rate for which the weighted pitch likelihood metric has a maximum along the estimated pitch for the time sample window. In some implementations, operation 218 may be performed by a pitch estimation module that is the same as or similar to pitch estimation module 34 (shown in
At operation 220, a determination may be made as to whether there are further time sample windows in the processing time window for which an estimated pitch and/or an estimated fractional chirp rate are to be determined. Responsive to there being further time sample windows, method 200 may return to operations 210 and 212, and operations 210, 212, 214, 216, and/or 218 may be performed for a further time sample window. In this iteration through operations 210, 212, 214, 216, and/or 218, the further time sample window may be a time sample window that is adjacent to the next time sample window for which operations 210, 212, 214, 216, and/or 218 have just been performed. In such implementations, operations 210, 212, 214, 216, and/or 218 may be iterated over the time sample windows from the primary time sample window to the boundaries of the processing time window in one or both temporal directions. During the iteration(s) toward the boundaries of the processing time window, the estimated pitch and estimated fractional chirp rate implemented at operation 210 may be the estimated pitch and estimated fractional chirp rate determined at operation 208, or may be an estimated pitch and estimated fractional chirp rate determined at operation 210 for a time sample window adjacent to the time sample window for which operations 210, 212, 214, 216, and/or 218 are being iterated.
Responsive to a determination at operation 220 that there are no further time sample windows within the processing time window, method 200 may proceed to an operation 222. At operation 222, a determination may be made as to whether there are further processing time windows to be processed. Responsive to a determination at operation 222 that there are further processing time windows to be processed, method 200 may return to operation 206, and may iterate over operations 206, 208, 210, 212, 214, 216, 218, and/or 220 for a further processing time window. It will be appreciate that iterating over the processing time windows may be accomplished in the manner shown in
Responsive to a determination at operation 222 that there are no further processing time windows to be processed, method 200 may proceed to an operation 224. Operation 224 may be performed in implementations in which the processing time windows overlap. In such implementations, iteration of operations 206, 208, 210, 212, 214, 216, 218, and/or 220 for the processing time windows may result in multiple determinations of estimated pitch for at least some of the time sample windows. For time sample windows for which multiple determinations of estimated pitch have been made, operation 224 may include aggregating such determinations for the individual time sample windows to determine aggregated estimated pitch for individual the time sample windows. In some implementations, operation 224 may be performed by an estimated pitch aggregation module that is the same as or similar to estimated pitch aggregation module 42 (shown in
At an operation 226, individual time sample windows may be divided into voiced and unvoiced categories. In some implementations, operation 226 may be performed by a voiced section module that is the same as or similar to voiced section module 44 (shown in
Responsive to a determination at operation 226 that the audio signal during a time sample window represents unvoiced sound, the estimated pitch (or aggregated estimated pitch) for the time sample window may be set to some predetermined value at an operation 228. For example, this value may be set to 0, or some other value. This may cause the tracking of pitch accomplished by method 200 to designate that harmonic speech may not be present or prominent in the time sample window.
It will be appreciated that the description above of estimating an individual pitch for the time sample windows is not intended to be limiting. In some implementations, the portion of the audio signal corresponding to one or more time sample window may represent two or more harmonic sounds. In such implementations, the principles of pitch tracking above with respect to an individual pitch may be implemented to track a plurality of pitches for simultaneous harmonic sounds without departing from the scope of this disclosure. For example, if the audio information specifies the pitch likelihood metric as a function of pitch and fractional chirp rate, then maxima for different pitches and different fractional chirp rates may indicate the presence of a plurality of harmonic sounds in the audio signal. These pitches may be tracked separately in accordance with the techniques described herein.
At an operation 244, one or more pitches of the sound may be tracked to determine individual pitches and corresponding amplitudes as a function of time for individual harmonics of the sound. In some implementations, operation 244 may include some or all of the operations of method 200 (shown in
At an operation 246, individual harmonics may be synthesized using oscillators corresponding to individual harmonics. According to some implementations, only harmonics associated with a desired sound (e.g., speech from a particular speaker) within the original audio signal may be synthesized, thus excluding unwanted sounds. In some implementations, operation 246 may be performed by a resynthesis module that is the same as or similar to resynthesis module 46 (shown in
At an operation 248, the synthesized harmonics may be summed to obtain the resynthesized audio signal. In some implementations, operation 248 may be performed by a resynthesis module that is the same as or similar to resynthesis module 46 (shown in
At an operation 254, peak points of harmonic pitch paths may be interpolated through the transformed audio information. The peak points may lie along harmonic frequencies in the transformed audio information, and may be determined as a function of frequency and time for a given harmonic. In some implementations, operation 254 may be performed by a noise subtraction module that is the same as or similar to noise subtraction module 48 (shown in
At an operation 256, trough points of harmonic pitch paths may be interpolated through the transformed audio information. The trough points may be positioned midway between peak points of adjacent harmonic frequencies in the transformed audio information, and may be determined as a function of frequency and time. In some implementations, operation 256 may be performed by a noise subtraction module that is the same as or similar to noise subtraction module 48 (shown in
At an operation 258, values associated with individual ones of the trough point interpolations may be subtracted from values associated with individual ones of the peak point interpolations to yield noise-reduced transformed audio information. In some implementations, operation 258 may be performed by a noise subtraction module that is the same as or similar to noise subtraction module 48 (shown in
At an operation 260, the audio signal may be reconstructed based on a reverse transformation of the noise-reduced transformed audio information. In some implementations, operation 260 may be performed by a reconstruction module that is the same as or similar to reconstruction module 52 (shown in
At an operation 274, noise between harmonics of the sound may be suppressed in the transformed audio information by centering functions at individual harmonics in the transformed audio information. The functions may serve to suppress noise between the harmonics to yield noise-reduced transformed audio information. The width of a given function may be based on a bandwidth of a corresponding harmonic. In some implementations, operation 274 may be performed by a fence model module that is the same as or similar to fence model module 50 (shown in
At operation 276, the audio signal may be reconstructed based on a reverse transformation of the noise-reduced transformed audio information. In some implementations, operation 276 may be performed by a reconstruction module that is the same as or similar to reconstruction module 52 (shown in
At an operation 284, features associated with the audio signal may be obtained from the transformed audio information. Individual ones of the features may be associated with a feature score relative to a predetermined speaker model. According to some implementations, one or more of the features may be obtained from modules that are the same as or similar to information reduction module 58 (shown in
At an operation 286, an aggregate score may be obtained based on the feature scores according to a weighting scheme. The weighting scheme may be associated with a noise and/or SNR estimation. The aggregate score may be used in conjunction with the predetermined speaker model to identify a source of the sound (e.g., a specific person) in the audio signal. In some implementations, operation 286 may be performed by an aggregate score module that is the same as or similar to aggregate score module 62 (shown in
At an operation 294, harmonic paths may be determined for individual harmonics of the sound based on fractional chirp rate and harmonic number. In some implementations, operation 294 may be performed by an information reduction module that is the same as or similar to information reduction module 58 (shown in
At an operation 296, an amplitude value for individual harmonics may be determined at individual time windows. In some implementations, operation 296 may be performed by an information reduction module that is the same as or similar to information reduction module 58 (shown in
At an operation 298, the reduced transformed audio information may be constructed to include the amplitude values for individual harmonics as a function of frequency and time. Values within the reduced transformed audio information between the amplitude values may be set to zero. In some implementations, operation 298 may be performed by an information reduction module that is the same as or similar to information reduction module 58 (shown in
At an operation 300, a filter bank may be applied to the reduced transformed audio information. In some implementations, operation 300 may be performed by an information reduction module that is the same as or similar to information reduction module 58 (shown in
At an operation 314, instantaneous slope of the harmonic envelope of a sound having a specific pitch may be obtained as a function of frequency as the spectral slope information. In accordance with some implementations, the instantaneous slope may be determined by taking a derivative over frequency of the harmonic envelope at the specific pitch. In some implementations, operation 314 may be performed by a spectral slope module that is the same as or similar to spectral slope module 60 (shown in
At an operation 316, a filter bank may be applied to the spectral slope information. In some implementations, operation 316 may be performed by a spectral slope module that is the same as or similar to spectral slope module 60 (shown in
At an operation 324, a harmonic energy of individual harmonics may be determined as a function of frequency and time based on the transformed audio information. In some implementations, operation 322 may be performed by a noise estimator module that is the same as or similar to noise estimator module 62 (shown in
At an operation 326, a total energy of the audio signal may be determined based on the transformed audio information. In some implementations, operation 324 may be performed by a noise estimator module that is the same as or similar to noise estimator module 62 (shown in
At an operation 328, a ratio between the harmonic energy and the total energy may be determined. In some implementations, operation 328 may be performed by a noise estimator module that is the same as or similar to noise estimator module 62 (shown in
At an operation 330, the ratio between the harmonic energy and the total energy may be averaged over time. In some implementations, operation 330 may be performed by a noise estimator module that is the same as or similar to noise estimator module 62 (shown in
At an operation 332, noise may be estimated as portions of the transformed audio information where the ratio breaches a predetermined threshold. In some implementations, operation 332 may be performed by a noise estimator module that is the same as or similar to noise estimator module 62 (shown in
The operations of methods 180, 190, 200, 240, 250, 270, 280, 290, 310, and 320 presented herein are intended to be illustrative. In some embodiments, one or more of methods 180, 190, 200, 240, 250, 270, 280, 290, 310, and/or 320 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of methods 180, 190, 200, 240, 250, 270, 280, 290, 310, and 320 are illustrated in
In some embodiments, one or more of methods 180, 190, 200, 240, 250, 270, 280, 290, 310, and/or 320 may be implemented in one or more processing devices (e.g., a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information). The one or more processing devices may include one or more devices executing some or all of the operations of one or more of methods 180, 190, 200, 240, 250, 270, 280, 290, 310, and/or 320 in response to instructions stored electronically on an electronic storage medium. The one or more processing devices may include one or more devices configured through hardware, firmware, and/or software to be specifically designed for execution of one or more of the operations of one or more of methods 180, 190, 200, 240, 250, 270, 280, 290, 310, and/or 320.
Although the system(s) and/or method(s) of this disclosure have been described in detail for the purpose of illustration based on what is currently considered to be the most practical and preferred implementations, it is to be understood that such detail is solely for that purpose and that the disclosure is not limited to the disclosed implementations, but, on the contrary, is intended to cover modifications and equivalent arrangements that are within the spirit and scope of the appended claims. For example, it is to be understood that the present disclosure contemplates that, to the extent possible, one or more features of any implementation can be combined with one or more features of any other implementation.
Number | Name | Date | Kind |
---|---|---|---|
3617636 | Ogihara | Nov 1971 | A |
3649765 | Rabiner et al. | Mar 1972 | A |
4454609 | Kates | Jun 1984 | A |
4797923 | Clarke | Jan 1989 | A |
5054072 | McAulay et al. | Oct 1991 | A |
5195166 | Hardwick et al. | Mar 1993 | A |
5216747 | Hardwick et al. | Jun 1993 | A |
5226108 | Hardwick et al. | Jul 1993 | A |
5321636 | Beerends | Jun 1994 | A |
5548680 | Cellario | Aug 1996 | A |
5684920 | Iwakami et al. | Nov 1997 | A |
5812967 | Poceleon et al. | Sep 1998 | A |
5815580 | Craven et al. | Sep 1998 | A |
6356868 | Yuschik et al. | Mar 2002 | B1 |
6477472 | Qian et al. | Nov 2002 | B2 |
6526376 | Villette et al. | Feb 2003 | B1 |
7003120 | Smith et al. | Feb 2006 | B1 |
7016352 | Chow et al. | Mar 2006 | B1 |
7117149 | Zakarauskas | Oct 2006 | B1 |
7249015 | Jiang et al. | Jul 2007 | B2 |
7389230 | Nelken | Jun 2008 | B1 |
7596489 | Kovesi et al. | Sep 2009 | B2 |
7660718 | Padhi et al. | Feb 2010 | B2 |
7664640 | Webber | Feb 2010 | B2 |
7668711 | Chong et al. | Feb 2010 | B2 |
7672836 | Lee et al. | Mar 2010 | B2 |
7774202 | Spengler et al. | Aug 2010 | B2 |
7991167 | Oxford | Aug 2011 | B2 |
8189576 | Ferguson | May 2012 | B2 |
8212136 | Shirai et al. | Jul 2012 | B2 |
8332059 | Herre et al. | Dec 2012 | B2 |
8447596 | Avendano et al. | May 2013 | B2 |
8548803 | Bradley et al. | Oct 2013 | B2 |
8620646 | Bradley et al. | Dec 2013 | B2 |
8666092 | Zavarehei | Mar 2014 | B2 |
8767978 | Bradley et al. | Jul 2014 | B2 |
20020152078 | Yuschik et al. | Oct 2002 | A1 |
20030014245 | Brandman | Jan 2003 | A1 |
20030055646 | Yoshioka et al. | Mar 2003 | A1 |
20040128130 | Rose et al. | Jul 2004 | A1 |
20040133424 | Ealey et al. | Jul 2004 | A1 |
20040176949 | Wenndt et al. | Sep 2004 | A1 |
20040220475 | Szabo et al. | Nov 2004 | A1 |
20050114128 | Hetherington et al. | May 2005 | A1 |
20050149321 | Kabi et al. | Jul 2005 | A1 |
20060080088 | Lee et al. | Apr 2006 | A1 |
20060100866 | Alewine et al. | May 2006 | A1 |
20060122834 | Bennett | Jun 2006 | A1 |
20060149558 | Kahn et al. | Jul 2006 | A1 |
20060262943 | Oxford | Nov 2006 | A1 |
20070010997 | Kim | Jan 2007 | A1 |
20070299658 | Wang et al. | Dec 2007 | A1 |
20080082323 | Bai et al. | Apr 2008 | A1 |
20080183473 | Nagano et al. | Jul 2008 | A1 |
20080270440 | He et al. | Oct 2008 | A1 |
20090012638 | Lou | Jan 2009 | A1 |
20090076822 | Sanjaume | Mar 2009 | A1 |
20090091441 | Schweitzer, III et al. | Apr 2009 | A1 |
20090228272 | Herbig et al. | Sep 2009 | A1 |
20100042407 | Crockett | Feb 2010 | A1 |
20100215191 | Yoshizawa et al. | Aug 2010 | A1 |
20100260353 | Ozawa | Oct 2010 | A1 |
20100262420 | Herre et al. | Oct 2010 | A1 |
20100332222 | Bai et al. | Dec 2010 | A1 |
20110016077 | Vasilache et al. | Jan 2011 | A1 |
20110060564 | Hoge | Mar 2011 | A1 |
20110286618 | Vandali et al. | Nov 2011 | A1 |
20120243694 | Bradley et al. | Sep 2012 | A1 |
20120243705 | Bradley et al. | Sep 2012 | A1 |
20120243707 | Bradley et al. | Sep 2012 | A1 |
20120265534 | Coorman et al. | Oct 2012 | A1 |
20130041489 | Bradley et al. | Feb 2013 | A1 |
20130041656 | Bradley et al. | Feb 2013 | A1 |
20130041657 | Bradley et al. | Feb 2013 | A1 |
20130041658 | Bradley et al. | Feb 2013 | A1 |
20140037095 | Bradley et al. | Feb 2014 | A1 |
20140086420 | Bradley et al. | Mar 2014 | A1 |
Number | Date | Country |
---|---|---|
101027543 | Aug 2007 | CN |
101394906 | Mar 2009 | CN |
1744 305 | Jan 2007 | EP |
01-257233 | Oct 1989 | JP |
WO 2012129255 | Sep 2012 | WO |
WO 2012134991 | Oct 2012 | WO |
WO 2012134993 | Oct 2012 | WO |
WO 2013022914 | Feb 2013 | WO |
WO 2013022918 | Feb 2013 | WO |
WO 2013022923 | Feb 2013 | WO |
WO 2013022930 | Feb 2013 | WO |
Entry |
---|
L. Weruaga and M. Kepsi, “Speech Analysis with the Fast Chirp Transform”, 2004, Euispco, www.eurasip.org/Proceedings/Eusipco/Eusipco2004/.../cr1374.pdf. |
L. Weruaga and M. Kepesi, “Adaptive Chirp-based time-frequency analysis of speech signals”, 2006, Speech communication, vol. 48, No. 5, pp. 474-492. |
C. Ioana, A. Quinquis, “The Adaptive Time-Frequency Distribution Using the Fractional Fourier Transform”, 18 Colloque sur le traitement du signal et des images, 2001; p. 52-55. |
“Fast Maximum Likelihood Joint Estimation of Frequency and Frequency Rate”, T.J. Abatzoglou, IEEE Transactions on Aerospace and Electronic Systems, vol. AES-22, Issue 6, pp. 708-715. |
X. Xia, “Discrete Chirp-Fourier Transform and Its Application to Chirp Rate Estimation”, IEEE Transcations on Signal Processing, vol. 48, No. 11, Nov. 2000. |
C. Capus et al., “Short-time fractional Fourier methods for the time-frequency representation of chirp signals”, J. Acoust. Soc. Am. 113 (6), Jun. 2003. |
Badeau et al., “Expectation-Maximization Algorithm for Multi-Pitch Estimation and Separation of Overlapping Harmonic Spectra”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2009, 4 pages. |
Camacho et al., “A Sawtooth Waveform Inspired Pitch Estimator for Speech and Music”, Journal of the Acoustical Society of America, vol. 124, No. 3, Sep. 2008, pp. 1638-1652. |
Adami et al., “Modeling Prosodic Dynamics for Speaker Recognition,” Proceedings of IEEE International Conference in Acoustics, Speech and Signal Processing (ICASSP '03), Hong Kong, 2003. |
Cooke et al., “Robust Automatic Speech Recognition with Missing and Unreliable Acoustic Data,” Speech Communication, vol. 34, Issue 3, pp. 267-285, Jun. 2001. |
Cycling 74, “MSP Yutorial 26: Frequency Domain Signal Processing with pfft˜” Jul. 6, 2008 (Captured via Internet Archive) http://www.cycling74.com. |
Kamath et al, “Independent Component Analysis for Audio Classification”, IEEE 11th Digital Signal Processing Workshop & IEEE Signal Processing Education Workshop, 2004, [retrieved on: May 31, 2012], retrieved from the Internet: http://2002.114.89.42/resource/pdf/1412.pdf, pp. 352-355. |
Kumar et al., “Speaker Recognition Using GMM”, International Journal of Engineering Science and Technology, vol. 2, No. 6, 2010, [retrieved on: May 31, 2012], retrieved from the Internet: http://www.ijest.info/docs/IJEST10-02-06-112.pdf, pp. 2428-2436. |
Serra, “Musical Sound Modeling with Sinusoids plus Noise”, 1997, pp. 1-25. |
Vargas-Rubio et al., “An Improved Spectrogram Using the Multiangle Centered Discrete Fractional Fourier Transform”, Proceedings of International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, 2005 [retrieved on Jun. 24, 2012], retrieved from the internet: <URL: http://www.ece.unm.edu/faculty/beanthan/PUB/ICASSP-05-JUAN.pdf>, 4 pages. |
Doval et al., “Fundamental Frequency Estimation and Tracking Using Maximum Likelihood Harmonic Matching and HMMs,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings, New York, NY, 1:221-224 (Apr. 27, 1993). |
Extended European Search Report mailed Feb. 12, 2015, as received in European Patent Application No. 12 821 868.2. |
Extended European Search Report mailed Oct. 9, 2014, as received in European Patent Application No. 12 763 782.5. |
Extended European Search Report mailed Mar. 12, 2015, as received in European Patent Application No. 12 822 218.9. |
Goto, “A Robust Predominant-FO Estimation Method for Real-Time Detection of Melody and Bass Lines in CD Recordings,” Acoustics, Speech, and Signal Processing, Piscataway, NJ, 2(5):757-760 (Jun. 5, 2000). |
International Search Report and Written Opinion mailed Jul. 5, 2012, as received in International Application No. PCT/US2012/030277. |
International Search Report and Written Opinion mailed Jun. 7, 2012, as received in International Application No. PCT/US2012/030274. |
International Search Report and Written Opinion mailed Oct. 23, 2012, as received in International Application No. PCT/US2012/049901. |
International Search Report and Written Opinion mailed Oct. 19, 2012, as received in International Application PCT/US2012/049909. |
Mowlaee et al., “Chirplet Representation for Audio Signals Based on Model Order Selection Criteria,” Computer Syaytems and Applications, AICCSA 2009, IEEE/ACSInternational Conference on IEEE, Piscataway, NJ, pp. 927-934 (May 10, 2009). |
Weruaga et al., “The Fan-Chirp Transform for Non-Stationary Harmonic Signals,” Signal Processing, Elsevier Science Publishers B.V. Amsterdam, NL, 87(6): 1504-1522 (2007). |
Xia, Xiang-Gen, “Discrete Chirp-Fourier Transform and Its Application to Chirp Rate Estimation”, IEEE Transactions on Signal Processing, vol. 48, No. 11, Nov. 2000, pp. 3122-3133. |
Boashash, Boualem, “Time-Frequency Signal Analysis and Processing: A Comprehensive Reference”, [online], Dec. 2003, retrieved on Sep. 26, 2012 from http://gspace.qu.edu.qa/bitstream/handle/10576/10686/Boashash% 20book-part1 tfsap concepts.pdf?seq. ., 103 pages. |
Yin et al., “Pitch- and Formant-Based Order Adaptation of the Fractional Fourier Transform and Its Application to Speech Recognition”, EURASIP Journal of Audio, Speech, and Music Processing, vol. 2009, Article ID 304579, [online], Dec. 2009, Retrieved on Sep. 26, 2012 from http://downloads.hindawi.com/journals/asmp/2009/304579.pdf, 14 pages. |
Weruaga, Luis, et al., “Speech Analysis with the Fast Chirp Transform”, Eusipco, www.eurasip.org/Proceedings/Eusipco/Eusipco2004/.../cr1374.pdf, 2004, 4 pages. |
Kepesi, Marian, et al., “Adaptive Chirp-Based Time-Frequency Analysis of Speech Signals”, Speech Communication, vol. 48, No. 5, 2006, pp. 474-492. |
Ioana, Cornel, et al., “The Adaptive Time-Frequency Distribution Using the Fractional Fourier Transform”, 18° Colloque sur le traitement du signal et des images, 2001, pp. 52-55. |
Abatzoglou, Theagenis J., “Fast Maximum Likelihood Joint Estimation of Frequency and Frequency Rate”, IEEE Transactions on Aerospace and Electronic Systems, vol. AES-22, Issue 6, Nov. 1986, pp. 708-715. |
Rabiner, Lawrence R., “On the Use of Autocorrelation Analysis for Pitch Detection”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-25, No. 1, Feb. 1977, pp. 24-33. |
Lahat, Meir, et al., “A Spectral Autocorrelation Method for Measurement of the Fundamental Frequency of Noise-Corrupted Speech”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-35, No. 6, Jun. 1987, pp. 741-750. |
Robel, A., et al., “Efficient Spectral Envelope Estimation and Its Application to Pitch Shifting and Envelope Preservation”, Proc. of the 8th Int. Conference on Digital Audio Effects (DAFx'05), Madrid, Spain, Sep. 20-22, 2005, 6 pages. |
Kepesi, Marian, et al., “High-Resolution Noise-Robust Spectral-Based Pitch Estimation”, 2005, 4 pages. |
Hu, Guoning, et al., “Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation”, IEEE Transactions on Neural Networks, vol. 15, No. 5, Sep. 2004, 16 pages. |
Roa, Sergio, et al., “Fundamental Frequency Estimation Based on Pitch-Scaled Harmonic Filtering”, 2007, 4 pages. |
Number | Date | Country | |
---|---|---|---|
20140037095 A1 | Feb 2014 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13205535 | Aug 2011 | US |
Child | 14040418 | US |