1. Statement of the Technical Field
The present application relates generally to the perception and recognition of an audio signal input and, more particularly, to a signal processing method and apparatus for providing a nonlinear frequency analysis of structured audio signals which mimics the operation of the human ear.
2. Description of the Related Art
In general, there are many well-known signal processing techniques that are utilized in signal processing applications for extracting spectral features, separating signals from background sounds, and finding periodicities at the time scale of music and speech rhythms. Generally, features are extracted and used to generate reference patterns (models) for certain identifiable sound structures. For example, these sound structures can include phonemes, musical pitches, or rhythmic meters.
Referring now to
Typically, an acoustic front end (not shown) includes a microphone or some other similar device to convert acoustic signals into analog electric signals having a voltage that varies over time in correspondence to the variation in air pressure caused by the input sounds. The acoustic front end also includes an analog-to-digital (ND) converter for digitizing the analog signal by sampling the voltage of the analog waveform at a desired sampling rate and converting the sampled voltage to a corresponding digital value. The sampling rate is typically selected to be twice the highest frequency component in the input signal.
In processing system 100, spectral features can be extracted in a transform module 102 by computing a wavelet transform of the acoustic signal. Alternatively, a sliding window Fourier transform may be used for providing a time-frequency analysis of the acoustic signals. Following the initial frequency analysis performed by transform module 102, one or more analytic transforms may be applied in an analytic transform module 103. For example, a “squashing” function (such as square root and sigmoid functions) may be applied to modify the amplitude of the result. Alternatively, a synchro-squeeze transform may be applied to improve the frequency resolution of the output. Transforms of this type are described in U.S. Pat. No. 6,253,175 to Basu et al. Next, a cepstrum may be applied in a cepstral analysis module 104 to recover or enhance structural features (such as pitch) that may not be present or resolvable in the input signal. Finally, a feature extraction module 105 extracts from the fully transformed signal those features that are relevant to the structure(s) to be identified. The output of this system may then be passed to a recognition system that identifies specific structures (e.g. phonemes) given the features thus extracted from the input signal. Processes for the implementation of each of the aforementioned modules are well-known in the art of signal processing.
The primarily linear foregoing audio processing techniques have proven useful in many applications. However, they have not addressed some important problems. For example, as is now known in the art, the ear and brain process sound in a nonlinear manner utilizing nonlinear oscillation. Inputs are received at the cochlea, dorsal cochlear nucleus, inferior colliculus and other brain areas where they are processed as a function of excitatory and inhibitory processes in interaction with each other to produce nonlinear neural oscillations to provide outputs to be processed by still other brain areas. The prior art suffers from the shortcoming that it utilizes a linear oscillation model to mimic the nonlinear processing of sound required to mimic the brain's processing of complex signals. As a result, these conventional approaches are not always effective for determining the structure of a time varying input signal because they do not effectively recover components that are not present or fully resolvable in the input signal. Therefore, the full range of audio responses cannot be mimicked.
To overcome these shortcomings, it is known from U.S. Pat. No. 7,376,562 (Large) to process audio signals using networks of nonlinear oscillators. This is conceptually similar to signal processing by a bank of linear oscillators, with the important difference that the processing units are nonlinear and can resonate nonlinearly. Nonlinear resonance provides a wide variety of behaviors that are not observed in linear resonance (e.g., neural oscillations). Moreover, oscillators can be connected into complex networks.
As seen from
A common signal processing operation is frequency decomposition of a complex input signal, for example by a Fourier transform. Often this operation is accomplished via a bank of linear bandpass filters processing an input signal, s(t). For example, a widely used model of the cochlea is a gammatone filter bank (Patterson, et al., 1992). For comparison with the Large model, it can be written as a differential equation
ż=z(α+iω)s(t) (1)
where the overdot denotes differentiation with respect to time (for example, dz/dt), z is a complex-valued state variable (function of time), ω, is radian frequency (ω=2πf, f in Hz), α, for which α<0 in the prior art model is a linear damping parameter. The term, s(t), denotes linear forcing by a time-varying external signal. For simplicity, in the above and following equations, we write z for the ith filter or oscillator zi. Because z is a complex number at every time, t, it can be rewritten in polar coordinates revealing system behavior in terms of amplitude, r, and phase, φ. Resonance in a linear system means that the system oscillates at the frequency of stimulation, with amplitude and phase determined by system parameters. As stimulus frequency, ω0, approaches the oscillator frequency, ω, oscillator amplitude, r, increases, providing band-pass filtering behavior.
Recently, nonlinear models of the cochlea have been proposed to simulate the nonlinear responses of outer hair cells. It is important to note that outer hair cells are thought to be responsible for the cochlea's extreme sensitivity to soft sounds, excellent frequency selectivity and amplitude compression (e.g., Eguiluz, Ospeck, Choe, Hudspeth, & Magnasco, 2000). Models of nonlinear resonance that explain these properties have been based on the Hopf normal form for nonlinear oscillation, and are generic. Normal form (truncated) models have the form and as known from Large may be expressed as
ż=z(α+iω+β|z|2)+s(t)+h.o.t. (2)
Note the surface similarities between this form and the linear oscillator of Equation 1. Again, z is the state of an oscillator represented by the real and imaginary parts of z at a point of time within a cycle, ω is radian frequency, and α is again a linear damping parameter. However in this nonlinear formulation, α becomes a bifurcation parameter which can assume both positive and negative values, as well as α=0. The value α=0 is termed a bifurcation point. β<0 is a nonlinear damping parameter, which prevents amplitude from blowing up when α>0. Again, s(t) denotes linear forcing by an external signal. The term h.o.t. denotes higher-order terms of the nonlinear expansion that are truncated (i.e., ignored) in normal form models. Like linear oscillators, nonlinear oscillators come to resonate with the frequency of an auditory stimulus; consequently, they offer a sort of filtering behavior in that they respond maximally to stimuli near their own frequency. However, there are important differences in that nonlinear models address behaviors that linear ones do not, such as extreme sensitivity to weak signals, amplitude compression and high frequency selectivity. The compressive gammachirp filterbank exhibits nonlinear behaviors similar to Equation 2, but is formulated within a signal processing framework (Irino & Patterson, 2006).
Although the application of nonlinear oscillators and nonlinear modeling lends itself to mimic and produce outputs which represent very complex behaviors, previously unobtainable with linear models, the Large system suffers from the disadvantage that it too did not adequately process the entire frequency spectrum. The high order terms were not fully expanded. Rather, it was required that the characteristics of the wave form be known in advance, particularly the frequencies, so that only the most significant higher order terms are processed while the less significant terms are ignored even if their values do not go to 0. Therefore, a system for processing nonlinear oscillators to take advantage of and mimic substantially the entire complexity of an audio sound input is desired.
The present invention is directed to systems and methods designed to ascertain the structure of acoustic signals. The approach involves an alternative transform of an acoustic input signal, utilizing a network of nonlinear oscillators in which each oscillator is tuned to a distinct frequency; referred to as the natural or intrinsic frequency. Each oscillator receives input and interacts with the other oscillators in the network, yielding nonlinear resonances that are used to identify structure in an acoustic input signal. The output of the nonlinear frequency transform can be used as input to a system that will provide further analysis of the signal. According to one embodiment, the nonlinear responses are defined as a network of n expanded canonical oscillators zi, with an input, for each oscillator as a function of an external stimulus. In this way, the response of oscillators to inputs that are not close to its natural frequency are accounted for.
Other objects, features and advantages of the present invention will be apparent from the written description and the drawings in which:
a is a diagram illustrating the basic structure of a nonlinear neural network showing an input signal;
b shows the graphical representation of an individual oscillator in a nonlinear oscillator network;
a and
In the current invention a canonical model is utilized to solve for and account for all of the frequencies for the higher order terms. In this way, in order to model the response of the nonlinear neural network, it is not required to know anything about the wave form because, rather than in the nonlinear operation of Large which selects only the consequential significant high order terms, the present method solves for all of the high order terms.
This enables efficient computation of gradient frequency networks of nonlinear oscillators, representing a radical improvement to the technology. The canonical model (Equation 3, below) is related to the normal form (Equation 2; see e.g., Hoppensteadt & Izhikevich, 1997; Murdock, 2003), but it has properties beyond those of Hopf normal form models because the underlying, more realistic oscillator model is fully expanded, rather than truncated. The complete expansion of higher-order terms produces a model of the form
Equation 3 describes a network of n nonlinear oscillators, and as will be discussed, solves for the response of each oscillator, i.e., the response at each frequency of the system. Equation 3 oscillatory dynamics follow well known cases such as Andronov-Hopf and generalized Andronov-Hopf (Bautin) bifurcations (Guckenheimer & Holmes, 1983; Guckenheimer & Kuznetsov, 2007; Wiggins, 1990; Murdock, 2003).
There are surface similarities with the models of Equations 2 and 3. The parameters, ω, α and β1 correspond to the parameters of the truncated model of Equation 2. However, β2 is an additional amplitude compression parameter. Two frequency detuning parameters δ1 and δ2 are new in this formulation, and make oscillator frequency dependent upon amplitude to better mimic real world behavior of the hair cell inputs found in the ear. The parameter ε controls the amount of nonlinearity in the system.
RT (resonant terms) represents a general expression mainly consisting of nonlinear (resonant) monomials. These nonlinearities are critical for pattern recognition and auditory scene analysis capabilities. In general, the canonical model given by Equation 3 is more general than the Hopf normal form and encompasses a wide variety of behaviors that are observed neither in the Large use of Hopf normal form, nor in linear oscillators (filters).
Higher order terms of the normal form are necessary to capture the response of an oscillator to input that is not close to its natural frequency. In Large, coupling terms were written as sums of higher order terms based on normal form theory, which is known in the art. The present invention employs the linear relationship, or resonance, given by Equation 4 in terms of the system's eigenvalues. The behavior of the system is a function of the intrinsic frequency of each oscillator in the system; this method automatically accounts for those values which go to zero, and those which remain with significant resonance. Note that near an Andronov-Hopf bifurcation, the absolute values of the eigenvalues of a canonical oscillator system are the same as their natural frequencies {ω1, . . . , ωn} (Hoppensteadt & Izhikevich, 1996, 1997). In this case, the resonance relationship satisfies:
ωr=m1ω1+ . . . +mnωn
n∈; m1 . . . , mn∈
; ωr, ω1, . . . ωn∈
(4)
Wherein =set of all integers,
=set of all positive integers, and
=set of all real numbers.
The number ωr is known as the resonant frequency and is typically restricted to be positive.
These considerations lead to an expanded canonical oscillator model (e.g., Equation 3) for a nonlinear neural oscillator z under the influence of input x(t). In the expanded model, the resonant terms RT include all monomials obtained (as described above) satisfying Equation 4. Including all resonant monomials in RT allows the model to respond appropriately to external stimuli, regardless of frequency, because only the monomials that are resonant with the stimulus will have a significant effect on oscillator dynamics in the long term.
We can now define a network of n expanded canonical oscillators zi, with external input x(t). From now on, to avoid notational complexity and depending on the context, it is assumed that x represents a function of time t, that is, x=x(t). In most applications, either x=an input signal s(t) or x is a signal originating from other oscillators. In more general cases, x may represent a set of parameters and functions of time.
As a first case, we consider an expansion of RT for a sinusoidal external stimulus of unknown frequency, x(t)=Fe2πift+φ; F,f,φ∈.
Wherein F is the force (amplitude) of the signal, f is the frequency of the signal, and φ is the phase.
Equation 5 contains infinite geometric series that converge (see Equation 6) when |z|<1/√{square root over (ε)} and |x|<1/√{square root over (ε)}. Thus, the choice of ε constrains both the magnitude of the input and the magnitude of the oscillation.
The series converge as follows,
Consider the relation between Equation 3 and the result shown in Equation 6 derived in the prior Large art. Equation 6 suggests, here presented as new art, a generalization for RT defined as a product of a coupling factor c and two functions; one a passive factor (ε, x) and the other an active factor
(ε, z). We can write Equation 6 as
In the above case, x represents a single component frequency (sinusoidal) signal. In this new art we generalize RT. In the general case, x can represent an external input (e.g., a sound) of any complexity, or x can represent a coupling matrix, A, times a vector of oscillators, z. In the latter case,
x=Σαjzj
where αj ranges over a row of the matrix A (i.e., αj is a row vector) and zj is the jth oscillator in a column vector representing the network state. Note that in both cases, x is a complex input signal to an oscillator. Also, in both cases x(t) can be written as a sum of frequency components
where xj represents a frequency component of the input signal defined as
x
j(t)=Fje2πif.
Here, Fj represents the forcing amplitude, fj the components frequency, φj the phase, and t is time. Given the general definition of x and xj above, (ε, x) can be formulated as a function consisting of (resonant) monomials from a set
.
={ε(−1+Σ
,n∈
,ε∈
}
where the coefficient ε(−1+Σ
The formulation of the passive factor (ε, x) in Equation 7 can be generalized to include other components as follows.
The generalized form of the passive nonlinearity (ε, x) consists of a sum of expressions formed from elements of the set
above. More specifically,
(ε, x) consists of the sum of all monomials which correspond to positive frequencies ωr in the resonance relation Equation 4. It is expressed as:
(ε,x)=Σε(−1+Σ
To clarify, a monomial from the set is included in the sum of Equation 8 if the following four conditions are satisfied. 1) n is the number of (frequency) components of a signal or of oscillators, etc. 2) The p's and q's are positive integers or 0, at least one of the p's is not zero. 3) The total number of nonzero p's and q's is less than or equal to n. 4) The resonance relation Equation 4 is satisfied with a positive resonant frequency, i.e.,
ωr=p1ω1+ . . . +pnωn−(q1ω1+ . . . +qnωn)>0
and by rewriting we get
ωr=(p1−q1)ω1+ . . . +(pn−qn)ωn>0
where the coefficients m1, . . . , mn of Equation 4 become
m
1=(p1−q1), . . . , mn=(pn−qn)
Using this form of the passive part (ε, x) provides a very general form of RT where RT=c
(ε, x)
(ε, z).
A more explicit way of expressing this form of the passive nonlinearity (ε, x) follows.
Let n=Number of oscillators in a network or frequency components of a signal and let:
={1, 2, 3, . . . , n}
{ω1, . . . , ωn}=The set of the natural frequencies of the oscillators or components.
N()=Power Set of
\{{ }, {1}, . . . , {n}}=Set of all subsets of
minus the empty set and singleton sets.
Recall that a partition of a set S is a set of nonempty subsets of S such that every element x in S is in exactly one of these subsets. Whereas, a k-partition of a set S is a partition of S of cardinality k. Also let:
P()=A partition of
P(,k)=A k-partition of
,1≦k≦n
Now we can write the passive part as:
where l is an index set and
h1 and h2 are frequency correcting factors.
Equation 9 provides a method for computing coupling within and/or between gradient frequency oscillator networks. The expression
contained in Equation 9 represents the complete set of harmonics present in a stimulus to which oscillators, e.g., in a GFNN, can resonate. Similarly, S1 and S2 represent a complete set of combination and difference frequencies. Thus, all higher order resonances are accounted for in this formulation.
There is another form of (ε, x) similar to the one above (Equation 9) which simplifies further and reduces to a real valued expression because S1 and S2 are complex conjugates. For this case, the frequency correcting factors H1 and H2 are not used.
Since the geometric series converge, S1 and S2 simplify further to produce:
where
Equation 10 provides a method for computing coupling within and/or between gradient frequency oscillator networks when there is no frequency correction on the resonant monomials. In this case (ε, x) consists of finite expressions and is a real valued signal.
The above are complicated expressions for the passive part of RT. They contain infinite sums as described above or large numbers of partitions to sum over for large n's. In practice these forms of RT may be difficult to use. The precise form of these expressions depends upon the frequencies present in the stimulus or frequencies of oscillators. To compute with the above expressions, one would have to obtain the frequency components of an input signal by Fourier analysis or some other technique. Moreover, because the computation is expensive in both space and time, one would have to limit the number of components and truncate the expansion of resonant monomials in Equation 9. This leads us to seek suitable approximations. One approximation is given by:
where x=Σxi or an input signal x=s(t).
Equation 11 provides a method for computing coupling within and/or between gradient frequency oscillator networks. It has the advantage that it can be applied to 1) external input comprised of any number of unknown frequency components 2) input from other oscillators within the same GFNN, or 3) input from oscillators in another GFNN. It is also far more efficient to compute than Equations 9 and 10, and it approximates Equation 9 quite closely.
An example comparing this approximation (gray curves) and the generalized RT (black dashed curves) is shown in
x1=0.1e2πi200t,x2=0.1e2πi300t,x3=0.1e2πi400t
From
Finally, we write RT in a general abstract form covering the entire class of scenarios including separate coupling terms for inputs from different sources. This includes internal couplings, external input and input from other networks as illustrated in
(t, xk) is the kth passive part,
(ε, z) is the kth active part, ck corresponds to the strength of coupling, and l is some index set. As an example employing this generalized RT, Equation 3 can be restated to include network layers and external input signals as in
where ω is the oscillator frequency in radians, α is a linear damping parameter, β is a nonlinear damping parameter, δ is the nature in which the oscillator frequency is dependent upon amplitude.
Each Rk has a unique passive nonlinearity corresponding to the internal, external, afferent, and efferent couplings respectively. The active nonlinearities are as in Equation 7.
Reference is now made to
The oscillators may be in the form of a computer which generates at least one frequency output useful for describing the time bearing structure of the input signal s(t) oscillator network 704. A transmitter 706 receives the signal and transmits it to an audio or visual display output. The computing device can be any computing device capable of analyzing a mathematical representation of a sound signal such as a computer processing unit (CPU), a field programmable gate array (FPGA) or an ASIC chip.
As can be seen from the above, it is possible to analyze complex wave signals utilizing an array of nonlinear oscillators in a manner which takes into account much more of the signal. By accounting for resonant terms and analyzing the acoustic signal in a nonlinear manner, the analysis may more closely mimic the manner in which the brain and auditory system actually operates on signals so that more of the full range of audio responses can be mimicked. It is understood that modifications can be made to the described preferred embodiments of the invention by those skilled in the art. Therefore, it is intended that all matters in the foregoing description and shown in the accompanied drawings, be interpreted as illustrative and not in a limiting sense. Thus, the scope of the invention is determined by the appended claims.
This application claims priority to U.S. Provisional patent Application No. 61/299,743 filed Jan. 29, 2010, in the entirety hereby incorporated by reference.
The United States Government has rights in this invention pursuant to Contract No. FA9550-07-00095 between Air Force Office of Scientific Research and Circular Logic, LLC and Contract No. FA9550-07-C-0017 between Air Force Office of Scientific Research and Circular Logic, LLC.
Number | Date | Country | |
---|---|---|---|
61299743 | Jan 2010 | US |