This disclosure relates to estimating pitch in audio signals based on symmetry characteristics independent of harmonic amplitudes.
Existing speech- and speaker-recognition technology is typically based on a feature space related to a cepstrum. A cepstrum may result from taking an inverse Fourier transform (IFT) of the logarithm of the power spectrum of a signal. There may be a complex cepstrum, a real cepstrum, a power cepstrum, and/or phase cepstrum. The power cepstrum in particular finds applications in the analysis of human speech, essentially as a smoothed energy profile reflecting the power spectrum without the peaks. Feature vectors may contain values of the power cepstrum at discrete points. Occasionally feature vectors may be extended with a pitch estimate to enhance speaker-specific information. In such cases, pitch may be referred to as a “prosodic” feature, meaning it conditions or nuances the speech. Ironically, if the pitch was known with any accuracy, cepstral features may generally not be used in the first place because harmonic amplitudes would have been used instead. The set of complex harmonic amplitudes may contain most of the information in a voice. The cepstral profile may be described as a crude approximation of this set of amplitudes. But to know the amplitudes, generally speaking, the harmonic frequencies must be known, which means the pitch must be known. The prosodic pitch estimates appended to cepstral vectors may have nowhere near the precision needed to specify the harmonic frequencies.
One aspect of the disclosure relates to a system configured to estimate pitch in audio signals based on symmetry characteristics independent of harmonic amplitudes. According to some implementations, such independence may be important because co-estimating pitch and amplitudes may lead to bias in both types of estimate. In some implementations, the system may include a computing platform and/or other components. The computing platform may be configured to execute computer program instructions. The computer program instructions may include one or more of a magnitude spectrum component, a partition prediction component, a normalization component, a local frequency domain component, a pitch estimation component, and/or other components.
The magnitude spectrum component may be configured to provide a magnitude spectrum of an audio signal. The magnitude spectrum may be provided based on a Fourier transform, a spectral motion transform, and/or other transforms.
The partition prediction component may be configured to partition the magnitude spectrum by dividing a frequency axis into equal-sized cells. Individual cells may be centered on corresponding harmonic frequencies of a hypothesized pitch. In some implementations, the partition may include between eight and twelve cells, inclusive. Other values for the number of cells may be used. Individual cells may span a range of approximately fifty to 300 Hertz.
The normalization component may be configured to normalize the magnitude spectrum contained in individual cells to have equal mean magnitudes and equal standard deviations. The magnitude spectrum contained in individual cells may be normalized to have mean magnitudes of zero and standard deviations of one.
The local frequency domain component may be configured to define local frequency values such that individual cells have a local frequency domain centered at zero. The normalized magnitude spectrum of a given cell may be compared to its mirror obtained about a vertical line at zero-frequency of the given cell in order to determine a symmetry of the magnitude spectrum in the given cell. The comparison may be based on a product-moment correlation.
The pitch estimation component may be configured to determine a likelihood that the hypothesized pitch is an actual pitch of the audio signal based on symmetries of magnitude spectra contained in individual cells. Determining the likelihood that the hypothesized pitch is the actual pitch of the audio signal may be based on a commonality of shapes of individual magnitude spectra in the cells.
These and other features, and characteristics of the present technology, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the invention. As used in the specification and in the claims, the singular form of “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise.
The comb approach in exemplary implementations may not codetermine pitch φ and amplitude c. Comb techniques for pitch estimation may take many forms, but the underlying mathematical model has generally been the Fourier series. To estimate the fundamental, some evidence may be sought that repeats at some fixed interval. The evidence may be associated with one or more of energy, probability density, magnitude, logic (e.g., on or off), energy and/or magnitude relative to surrounding locations, definition (e.g., information with respect to abscissa), and/or other evidence. The Fourier series, as a model, may predict more than just presence at these frequencies. It may also predict that the object at each frequency is a sinusoid. From this insight, two more predictions may be made.
First, the Fourier transform of a sinusoid is a delta function. Some implementations may use a Gaussian time window, whose Fourier transform is also a Gaussian. Therefore, a given harmonic may be the convolution, in the frequency domain, of a delta and a Gaussian, and thus may be a Gaussian. Individual harmonics may be predicted as being symmetric about corresponding center frequencies. According to some implementations, complex data may be converted to magnitudes, so all values are positive. Now, if all harmonics are normalized to the same amplitude scale, they may look the same: a fixed-amplitude version of the transform of the time window.
The second prediction may be that the harmonics should be interchangeable. That is, any operation on the spectrum as a whole (i.e., the harmonics as a set) may evaluate the same regardless of how the harmonics are arranged. These predictions may not reflect reality, but they are testable, nontrivial predictions. It may be possible to construct a series with wave components at evenly-spaced frequencies, for which none of the above predictions apply.
In some implementations, system 100 may include a computing platform 102 and/or other components. By way of non-limiting example, computing platform 102 may include a mobile communications device such as a smart phone, according to some implementations. Other types of computing platforms are contemplated by the disclosure, as described further herein. The computing platform 102 may be configured to execute computer program instructions 104. The computer program instructions 104 may include one or more of a magnitude spectrum component 106, a partition prediction component 108, a normalization component 110, a local frequency domain component 111, a pitch estimation component 112, and/or other components.
The magnitude spectrum component 106 may be configured to provide a magnitude spectrum of an audio signal. A magnitude spectrum may be expressed as:
m(ω)=|{circumflex over (x)}(ω)| EQN. 1
where x(t) is the audio time series and {circumflex over (x)}(ω) is its Fourier transform. In some implementations, instead of the Fourier transform, the magnitude spectrum may be provided based on a spectral motion transform and/or other transforms. Examples of spectral motion transforms are described in U.S. patent application Ser. No. 13/205,424 filed on Aug. 8, 2011 and entitled “SYSTEM AND METHOD FOR PROCESSING SOUND SIGNALS IMPLEMENTING A SPECTRAL MOTION TRANSFORM,” which is incorporated herein by reference.
The partition prediction component 108 may be configured to partition the magnitude spectrum by dividing a frequency axis into equal-sized cells. Individual cells may be centered on corresponding harmonic frequencies of a hypothesized pitch. According to various implementations, the cells may number between eight and twelve cells, inclusive. However, other amounts of cells may be used. The cells may span a range encompassing approximately fifty to 300 Hertz—the range of the human voice.
In a maximum-likelihood analysis, pitch may be treated as a hypothesis, sweeping it across values, in each case predicting something specific, then determining the probability that the prediction was compatible with the data. The prediction may begin with the harmonic frequencies, followed by something expected to happen at these frequencies (e.g., large amplitude). Exemplary implementations may, instead, predict a partition because what events occur at harmonic frequencies may be inconsequential for many purposes.
Some implementations may define Φ={φkε+}K=1K, as an indexed set of hypotheses. Individual hypotheses may be a different pitch. The hypotheses may span the human range of approximately fifty to 300 Hertz. The increments may be small. In some implementations, Δφ=0.2 Hz. A given hypothesis φk may define a partition as
π=[(p−½)φk,(p−½)φk),p=1,2, . . . ,P EQN. 2
where P is the number of partitions to be established. The partitions may divide the frequency axis into equal-size cells. Individual cells may be centered on one of the predicted harmonic frequencies.
Within individual cells, the magnitudes may be z-scored, such as by:
where zj is the score of the jth value in the cell,
The normalization component 110 may be configured to normalize the magnitude spectrum contained in individual cells to have equal mean magnitudes and equal standard deviations. The magnitude spectrum contained in individual cells may be normalized to have mean magnitudes of zero and standard deviations of one, so the cells are normalized to scale.
The local frequency domain component 111 may be configured to define local frequency values such that individual cells have a local frequency domain centered at zero. The normalized magnitude spectrum of a given cell may be compared to its mirror obtained about a zero-frequency line of the given cell in order to determine a symmetry of the magnitude spectrum in the given cell. The comparison may be based on a product-moment correlation.
According to some implementations, local frequency values for the cells may be defined such that:
wj=ωj−pφk EQN. 4
which may cause each cell to have a local frequency domain centered at zero. Individual harmonics may therefore be defined as:
where z is the jth magnitude of the pth harmonic under a given pitch hypothesis. The mirror image of the pth harmonic may be expressed as:
That is, the order of the coupling between frequencies and magnitudes may simply be reversed. This mirror image may create a “new” harmonic in the sense of an observation to be compared with a nontrivial model prediction—namely that the mirror image transformation should not change the harmonic shape. This is a consequence of the symmetry of the model with respect to each harmonic.
Because individual harmonics may be normalized to the same magnitude scale, and their respective frequency domains may be centered, a given harmonic definition {circumflex over (x)}p(wj) may be entirely local. Individual harmonics may be effectively encapsulated. Therefore, notation for global and local frequency variables may be unnecessary. Instead, the harmonic function {circumflex over (x)}p(wj) may be abbreviated as {circumflex over (x)}p. For the correlations discussed below, the original harmonics may not be distinguished from the mirror images.
Any two functions {circumflex over (x)}i and {circumflex over (x)}j may be compared with a product-moment correlation, which may be defined for a population as:
where i and j denote two different partition cells. For a sample, ρ may be estimated as:
where xiT is the transpose of vector xi and n is the number of points in each vector.
Given a total of P {circumflex over (x)}p, functions plus P mirror images, the total number of non-redundant correlations possible may be Np=P2. In implementations involving twelve partitions, there may be 144 coefficients. Individual coefficients may be “symmetric” in the sense of meaning the same thing regardless of position. That is, small-amplitude harmonics do not count less and harmonics at high frequencies do not pull harder.
The pitch estimation component 112 may be configured to determine a likelihood that the hypothesized pitch is an actual pitch of the audio signal based on symmetries of magnitude spectra contained in individual cells. Determining the likelihood that the hypothesized pitch is the actual pitch of the audio signal may be based on a commonality of shapes of individual magnitude spectra in the cells.
Indeed, a final pitch estimate may not depend so much on the shape of the harmonics as it does the commonality of shape. As such, in some implementations, each harmonic of interest may be correlated with every other harmonic of interest, along with the mirror images. According to a Fourier series model, correlation operations may not make any difference. To the extent that they do, the model may be failing. For example, correlation operations may make a difference, and the model predictions may fail, responsive to the φ value inserted in EQN. 4 being far from the true value. Some implementations may involve a maximum-likelihood (ML) approach. A measure of success may be assigned to each φ hypothesis. The most successful hypothesis may be chosen to be the pitch. An ML estimator may be based on a probabilistic measure of success.
For a single correlation coefficient rj, the Fisher transformation may be expressed as:
The Fisher transformation may be Gaussian distributed with standard error SE=1/√{square root over (n−3)}. The Fisher score (i.e., the output of the Fisher transformation) may be approximately linear with r over most of the ±1 range (see, e.g.,
may be within 10% of the value one whenever |r|≦0.3. The value of |r| may rarely exceed 0.1. Thus, the Fisher transformation may be approximated as F(r)≈r. In such a case, the probability density of r may be approximated as:
Determining the density ƒ(rc) for every correlation rc, c=1, 2, . . . , P2, the overall probability may be expressed as:
This equality may hold only for uncorrelated rc's. This may be justified because the scales have been normalized, so amplitude trends along the frequency dimension would not be preserved.
Because the r values {r1, r2, . . . , rp
The probability density of the data, given the parameter φ at some value, may be indicated as ƒφ(X). The value of this function may be called the likelihood, and may be the same as the value of the likelihood function LX(φ). While these functions may produce the same value, they are different functions because the likelihood function sees the data X as a constant parameter, and the likelihood function treats φ like an independent variable—a changing argument. For individual values of the argument, L may call ƒφ(X) to acquire the likelihood value. The function ƒφ(X) may be determined in two stages. First, the r's may be derived from the raw data X in a way parameterized by the pitch hypothesis, such as:
The second stage of determining the function ƒφ(X) may be expressed as:
The probability determination of EQN. 11 may be with respect to a null hypothesis distribution with mean zero and standard error (n−3)−1/2. And yet, when a pitch hypothesis φ is accurate, r values may be expected to move away from zero. Thus, when φ approaches the true value, the likelihood L may start to fall, not rise. This may cause the function LX(φ) to reach a minimum, not a maximum, when φ aligns with the true pitch. This may be viewed as a technicality; the nadir may be singular and may accurately signal the pitch. The area under LX(φ) versus may be normalized to unity. This may be achieved by dividing by the negative of the area.
Likelihoods may be converted to log-likelihoods as:
From EQN. 11, it may be approximated that:
Thus, it may be written that:
where r2 is the coefficient of determination.
In some implementations, computing platform 102 may be operatively linked via one or more electronic communication links to one or more other components of system 100 (e.g., other computing platforms not depicted). For example, such electronic communication links may be established, at least in part, via a network such as the Internet, a telecommunications network, and/or other networks. It will be appreciated that this is not intended to be limiting, and that the scope of this disclosure includes implementations in which one or more components of system 100 may be operatively linked via some other communication media.
The computing platform 102 may include electronic storage 116, one or more processors 118, and/or other components. The computing platform 102 may include communication lines, or ports to enable the exchange of information with a network and/or other platforms. Illustration of computing platform 102 in
The electronic storage 116 may comprise electronic storage media that electronically stores information. The electronic storage media of electronic storage 116 may include one or both of system storage that is provided integrally (i.e., substantially non-removable) with computing platform 102 and/or removable storage that is removably connectable to computing platform 102 via, for example, a port (e.g., a USB port, a firewire port, etc.) or a drive (e.g., a disk drive, etc.). The electronic storage 116 may include one or more of optically readable storage media (e.g., optical disks, etc.), magnetically readable storage media (e.g., magnetic tape, magnetic hard drive, floppy drive, etc.), electrical charge-based storage media (e.g., EEPROM, RAM, etc.), solid-state storage media (e.g., flash drive, etc.), and/or other electronically readable storage media. The electronic storage 116 may include one or more virtual storage resources (e.g., cloud storage, a virtual private network, and/or other virtual storage resources). The electronic storage 116 may store software algorithms, information determined by processor(s) 118, information received from a remote device, information received from source 114, and/or other information that enables computing platform 102 to function as described herein.
The processor(s) 118 may be configured to provide information processing capabilities in computing platform 102. As such, processor(s) 118 may include one or more of a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information. Although processor(s) 118 is shown in
It should be appreciated that although modules 106, 108, 110, 111, and 112 are illustrated in
In some implementations, method 300 may be implemented in one or more processing devices (e.g., a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information). The one or more processing devices may include one or more devices executing some or all of the operations of method 300 in response to instructions stored electronically on an electronic storage medium. The one or more processing devices may include one or more devices configured through hardware, firmware, and/or software to be specifically designed for execution of one or more of the operations of method 300.
At an operation 302, a magnitude spectrum of an audio signal may be provided.
At an operation 304 (see
At an operation 306 (see
At an operation 308 (see
At an operation 310, a likelihood that the hypothesized pitch is an actual pitch of the audio signal may be determined based on symmetries of magnitude spectra contained in individual cells.
According to some implementations, invariance may be required in an operation associated with mirror imaging, or rotating the harmonic about its vertical midline. That operation is not illustrated in the figures, but the correlation results discussed below are based in all cases on the base set of partition cells and the mirror image of each included as a separate observation.
Although the present technology has been described in detail for the purpose of illustration based on what is currently considered to be the most practical and preferred implementations, it is to be understood that such detail is solely for that purpose and that the technology is not limited to the disclosed implementations, but, on the contrary, is intended to cover modifications and equivalent arrangements that are within the spirit and scope of the appended claims. For example, it is to be understood that the present technology contemplates that, to the extent possible, one or more features of any implementation can be combined with one or more features of any other implementation.
Number | Name | Date | Kind |
---|---|---|---|
5261007 | Hirsch | Nov 1993 | A |
5953696 | Nishiguchi | Sep 1999 | A |
6496797 | Redkov | Dec 2002 | B1 |
6963833 | Singhal | Nov 2005 | B1 |
7286980 | Wang | Oct 2007 | B2 |
7315812 | Beerends | Jan 2008 | B2 |
8219390 | Laroche | Jul 2012 | B1 |
20020177994 | Chang | Nov 2002 | A1 |
20040167775 | Sorin | Aug 2004 | A1 |
20040193407 | Ramabadran | Sep 2004 | A1 |
20050091045 | Oh | Apr 2005 | A1 |
20060080088 | Lee | Apr 2006 | A1 |
20090030690 | Yamada | Jan 2009 | A1 |
20120243707 | Bradley | Sep 2012 | A1 |
Number | Date | Country |
---|---|---|
1538667 | Oct 2004 | CN |
WO 2014130571 | Aug 2014 | WO |
Entry |
---|
Translation of CN 1538667 A. |
WO2014130571A1. |