Field of the Invention
The present invention relates to acoustics, and, in particular but not exclusively, to techniques for the capture of the spatial sound field on mobile devices, such as laptop computers, cell phones, and cameras.
Description of the Related Art
This section introduces aspects that may help facilitate a better understanding of the invention. Accordingly, the statements of this section are to be read in this light and are not to be understood as admissions about what is prior art or what is not prior art.
Due to the low cost of high-performance matched microphones and the commensurate increase in digital signal processing capabilities in mobile communication devices, realistic high-quality spatial audio pick-up from mobile devices is now becoming possible. Recording of spatial audio signals has been known since the invention of stereo recording at Bell Labs in the early 1930's. Gibson, Christensen, and Limberg in 1972, gave a fundamental description of three-dimensional audio spatial playback. See J. J. Gibson, R. M. Christensen, and A. L. R. Limberg, “Compatible FM Broadcasting of Panoramic Sound,” J. Audio Eng. Soc., vol. 20, pp. 816-822, December 1972, the teachings of which are incorporated herein by reference in their entirety. It is interesting that these authors discussed higher-order playback systems.
A first-order three-dimensional spatial recording was later proposed by Fellgett and Gerzon in 1975 who described a first-order “B-format ambisonic” SoundField® microphone array constructed of four cardioid capsules mounted in a tetrahedral arrangement. See Peter Fellgett, “Ambisonics, Part One: General System Description,” Studio Sound, vol. 17, no. 8, pp. 20-22, 40, August 1975; Michael Gerzon, “Ambisonics, Part Two: Studio Techniques,” Studio Sound, vol. 17, no. 8, pp. 24, 26, 28-30, August 1975; and U.S. Pat. No. 4,042,779, the teachings of all three of which are incorporated by reference in their entirety.
Later, Elko proposed a spherical microphone array with six pressure microphones mounted on a rigid sphere that utilized first-order spherical harmonics. See G. W. Elko, “A steerable and variable first-order differential microphone array,” IEEE ICASSP proceedings, April 1997, and U.S. Pat. No. 6,041,127, the teachings of both of which are incorporated herein by reference in their entirety.
More-accurate spatial recording using higher-order spherical harmonics or, equivalently, Higher-Order Ambisonics (HOA) was thought to be difficult to construct due to the required measurement of higher-order spatial derivative signals of the acoustic pressure field. The measurement of higher-order spatial derivatives is problematic due to the loss of SNR due to the natural high-pass nature of the acoustic pressure derivative signals and the commensurate need in post-processing to equalize these high-pass signals with a corresponding low-pass filter. Since the uncorrelated microphone self-noise and electrical noises of preamplifiers are invariant under differential processing, the low-pass equalization filter can amplify these noise components greatly, especially at lower frequencies and higher differential orders. One practical solution to extracting the higher-order differential modes by employing many pressure microphones mounted on a rigid spherical baffle and associated signal processing to extract the higher-order spatial spherical harmonics was proposed and patented by Meyer and Elko. See U.S. Pat. No. 7,587,054 (the “'054 patent”) and U.S. Pat. No. 8,433,075 (the “'075 patent”), the teachings of both of which are incorporated herein by reference in their entirety.
A mathematical series representation of a three-dimensional (3D) scalar pressure field is based on signals that are proportional to the zero-order and the higher-order pressure gradients of the field up to the desired highest order of the field series expansion. The basic zero-order omnidirectional term is the scalar acoustic pressure that can be measured by one or more of the pressure microphone elements. For all three first-order components, the acoustic pressure field is sufficiently sampled so that the three Cartesian orthogonal differentials can be resolved along with the acoustic pressure. Three first-order spatial derivatives in mutually orthogonal directions can be used to estimate the first-order gradient of the scalar pressure field. The smallest number of pressure microphones that span 3D space for up to first-order operation is therefore four microphones, preferably in a tetrahedral arrangement.
Certain embodiments of the present invention relate to a technique that processes audio signals from multiple microphones to generate a basis set of signals that are used for further post-processing for the manipulation or playback of spatial audio signals. Playback can be either over one or more loudspeakers or binaurally rendered over headphones.
Embodiments of the invention will become more fully apparent from the following detailed description, the appended claims, and the accompanying drawings in which like reference numerals identify similar or identical elements.
Detailed illustrative embodiments of the present invention are disclosed herein. However, specific structural and functional details disclosed herein are merely representative for purposes of describing example embodiments of the present invention. The present invention may be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein. Further, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments of the invention.
As used herein, the singular forms “a,” “an,” and “the,” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It further will be understood that the terms “comprises,” “comprising,” “includes,” and/or “including,” specify the presence of stated features, steps, or components, but do not preclude the presence or addition of one or more other features, steps, or components. It also should be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
As used in this specification, the term “acoustic signals” refers to sounds, while the term “audio signals” refers to the analog or digital electronic signals that represent sounds, such as the electronic signals generated by microphones based on incoming acoustic signals and/or the electronic signals used by loudspeakers to render outgoing acoustic signals.
As used in this specification, the term “loudspeaker” refers to any suitable transducer for converting electronic audio signals into acoustic signals (including headphones), while the term “microphone” refers to any suitable transducer for converting acoustic signals into electronic audio signals. The electronic audio signal generated by a microphone is also referred to herein as a “microphone signal.”
Spatial Sound Fields
An acoustic scalar pressure sound field can be expressed as the superposition of acoustic waves that obey the acoustic wave equation, which can be written for spherical coordinates according to Equation (1) as follows:
where c is the speed of sound, and the pressure field p is a function of radial distance r, polar angle θ, azimuthal angle ϕ, and time t. For 3D sound fields, it is convenient (but not necessary) to express the wave equation in spherical coordinates.
The general solution for the scalar acoustic pressure field can be written as a separation of variables according to Equation (2) as follows:
p(r,θ,ϕ,t)=R(r)Θ(θ)Φ(ϕ)T(t), (2)
The general solution contains the radial spherical Hankel function R (r), the angular functions Θ(θ) and Φ(ϕ), as well as the time function T (t). If it is assumed that the time signal is periodic, then the time dependence can be dropped from Equation (2) without losing generality where the periodicity is now represented as a spatial frequency (or wavenumber) k=ω/c=2π/λ where ω is the angular frequency and λ is the acoustic wavelength. The angular functions include the associated Legendre function Θ(θ) in terms of the standard spherical polar angle θ (that is, the angle from the z-axis) and the complex exponential function Φ(ϕ) in terms of the standard spherical azimuthal angle φ (that is, the longitudinal angle in the x-y plane from the x-axis, where the counterclockwise direction is the positive direction).
The angular component (Θ(θ)Φ(ϕ)) of the solution is often condensed and written in terms of the complex spherical harmonics Ynm(θ, ϕ) that are defined according to Equation (3) as follows:
where the index n is the order and the index m is the degree of the function (flipped from conventional terminology), the term under the square-root is a normalization factor to maintain orthonormality of the spherical harmonic functions (i.e., the inner product is unity for two functions with the same order and degree and zero for any other inner product of two functions where the order and/or the degree are not the same), Pnm(cos θ) is the Legendre polynomial of order n and degree m, and i is the square root of −1.
The radial term (R(r)) of the solution can be written according to Equation (4) as follows:
R(r)=A h(1)(kr)+B h(2)(kr), (4)
where A and B are general weighting coefficients and h(1)(kr) and h(2)(kr) are the spherical Hankel functions of the first and second kind. The first term on the right-hand side (RHS) of Equation (4) indicates an outgoing wave, while the second RHS term contains the form for incoming waves. The use of either Hankel function depends on the type of acoustic field problem that is being solved: either the first kind for the exterior field problem or the second kind for the solution to an interior field problem. An exterior problem determines an equation for the sound propagating from a region containing a sound source. An interior problem determines an equation for sound entering a region from one or more sound sources located outside the region of interest, like sound impinging on a microphone array from the farfield.
By completeness of the spherical harmonic functions, any traveling wave solution p(r,θ,Ø,ω) that is continuous and mean-square integrable can be expanded as an infinite series according to Equation (5) as follows:
p(r,θ,ϕ,ω)=Σn=0∞Σm=−nn[Amnhn(1)(kr)+Bmnhn(2)(kr)]Ynm(θ,ϕ). (5)
For an interior problem with all sources outside the region of interest, the solution of Equation (5) can be reduced to a solution containing only the incoming wave component according to Equation (6) as follows:
p(r,θ,ϕ,ω)=Σn=0∞Σm=−nnBmnjn(kr)Ynm(θ,ϕ)., (6)
where the incoming wave represented by h(2)(kr) has to be finite at the origin and therefore the solution reduces to the spherical Bessel function jn. At radius r0, which defines the outer boundary of the surface of the interior region, the values of the weighting coefficients Bmn are computed according to Equation (7) as follows:
where the * indicates the complex conjugate. The terms Bmn are the complex spherical harmonic Fourier coefficients, sometimes referred to as the multipole coefficients since they are related to the strength of the various “poles” that are represented by terms of a multipole expansion (monopole, dipole, quadrupole, etc.). Thus, the complete interior solution for any point (r,θ,ϕ) within the measurement radius (r≤r0) can be written according to Equation (8) as follows:
From the above equations, it can be seen that a scalar acoustic sound field can be represented by an infinite number of weighted spherical harmonic functions. Equation (9) shows a collection of the complex spherical harmonics up through first order as follows:
The zeroth order of the field represents the “omnidirectional” component in that this spherical harmonic does not have any dependency on θ or ϕ. The first-order terms contain three components that are equivalent to three orthogonal dipoles, one along each Cartesian axis. The weighting of each spherical harmonic in the representation depends on the actual acoustic field. Additionally, as mentioned previously, the solution to the wave equation also contains frequency-dependent weighting terms that are the spherical Bessel functions of the first kind, which are related to the Hankel functions of the first kind.
If the sound field is sampled on a small sphere of radius a<r0, then the above field equations can be used to compute any of the spherical harmonic components at radius a from only the knowledge of the acoustic pressure on the surface defined by r=r0. If it is assumed that (i) the signal is from a farfield source and can be modeled as an incident plane wave with wavevector k and (ii) r is defined as the radius vector from the origin of the coordinate system, then the solution can be simplified according to Equation (10) as follows:
eik·r=4πΣn=0∞injn(kr)Σm=−nnYnm(θr,Ør)Ynm(θk,Øk)*. (10)
See Earl G. Williams, Fourier Acoustics: Sound Radiation and Nearfield Acoustic Holography, Academic Press, 1999, the teachings of which are incorporated herein by reference in their entirety.
The spherical Bessel function jn(kr) near the origin (where kr<<1) can be approximated by the small-argument approximation according to Equation (11) as follows:
where the double factorial indicates the product of only odd integers up to and including the argument. Equation (11) shows that a spherical harmonic expansion of an incident plane wave around the origin contains frequency-dependent terms that are proportional to ωn (recall that k=ω/c) where n is the order. Only the zeroth-order term is non-zero in the limit as r→0, which is intuitive since this would represent the case of a single pressure microphone which can sample only the zeroth-order component of the incident wave. It should also be noted that the frequency-response term (kr)n in Equation (11) is identical to that of an nth-order differential microphone. Differential microphone arrays are closely related to the multipole expansion of sound fields where the source is modeled in terms of spatial derivatives along the Cartesian axes. The spherical harmonic expansion is not the same as the multipole expansion since the multipole expansion cannot be represented as a set of orthogonal polynomials beyond first order. For first-order expansions, both the multipole and the spherical harmonic expressions contain the zeroth-order pressure term and three orthogonal dipoles with the dipole terms having a first-order high-pass response for spatial sampling when kr<<1.
From the previous discussion, first-order scalar acoustic field decomposition requires only the zeroth-order monopole and three first-order orthogonal dipole components as defined in Equation (9). These four basis signals define the Ambisonics “B-Format” spatial audio recording scheme. Thus, spatial recording of a soundfield with a small device (a device that can be smaller than the acoustic wavelength) can involve the measurement of signals that are related to spatial pressure and pressure differentials of at least first order. The next section describes how to measure the first-order pressure differential. Higher-order decompositions are described in the '054 patent, the '075 patent, and Boaz Rafaely, Fundamentals of Spherical Array Processing, Springer 2015, the teachings of which are incorporated herein by reference in their entirety.
Differential Microphone Arrays
Differential microphones respond to spatial differentials of a scalar acoustic pressure field. The highest order of the differential components that the microphone responds to denotes the order of the microphone. Thus, a microphone that responds to both the acoustic pressure and the first-order difference of the pressure is denoted as a first-order differential microphone. One requisite for a microphone to respond to the spatial pressure differential is the implicit constraint that the microphone size is smaller than the acoustic wavelength. Differential microphone arrays can be seen as directly analogous to finite-difference estimators of continuous spatial-field derivatives along the direction of the microphone elements. Differential microphones also share strong similarities to superdirectional arrays used in electromagnetic antenna design and multipole expansions used to model acoustic radiation. The well-known problems with implementation of superdirectional arrays are the same as those encountered in the realization of differential microphone arrays. It has been found that a practical limit for differential microphones using currently available transducers is at third order. See G. W. Elko, “Superdirectional Microphone Arrays,” Acoustic Signal Processing for Telecommunication, Kluwer Academic Publishers, Chapter 10, pp. 181-237, March, 2000, the teachings of which are incorporated herein by reference in their entirety.
First-Order Dual-Microphone Array
The output mi(t) of each microphone spaced at distance d for a time-harmonic plane wave of amplitude So and frequency co incident from angle θ can be written according to Equation (12) as follows:
m1(t)=Soejωt−jkd cos(θ)/2
m2(t)=Soejωt+jkd cos(θ)/2, (12)
where j is the square root of −1.
The output E(θ,t) of a weighted addition of the two microphones can be written according to Equation (13) as follows:
where w1 and w2 are weighting values applied to the first and second microphone signals, respectively, and “h.o.t.” denotes higher-order terms.
When kd<<π, the higher-order terms can be neglected. If w1=−w2, then we have the pressure difference between two closely spaced microphones. This specific case results in a dipole directivity pattern cos (θ) as can easily be seen in Equation (13), which is also the pattern of the first-order spherical harmonic. Any first-order differential microphone beampattern can be written as the sum of a zero-order (omnidirectional) term and a first-order dipole term (cos(θ)). Thus, a first-order differential microphone has a normalized directional pattern E that can be written according to Equation (14) as follows:
E(θ)=α±(1−α)cos(θ), (14)
where typically 0≤α≤1, such that the response is normalized to have a maximum value of 1 at θ=0°, and for generality, the ± indicates that the pattern can be defined as having a maximum either at θ=0° or θ=π. One implicit property of Equation (14) is that, for 0≤α≤1, there is a maximum at θ=0° and a minimum at an angle between π/2 and π. For values of 0.5<α≤1, the response has a minimum at π, although there is no zero in the response. A microphone with this type of directivity is typically called a “sub-cardioid” microphone.
When α=0.5, the parametric algebraic equation has a specific form called a cardioid. The cardioid pattern has a zero response at θ=180°. For values of 0≤α≤0.5, there is a null at angle θnull as given by Equation (15) as follows:
A computationally simple and elegant way to form a general first-order differential microphone is to form a scalar combination of forward-facing and backward-facing cardioid signals. These signals can be obtained by using both solutions in Equation (14) and setting α=0.5. The sum of these two cardioid signals is omnidirectional (since the cos (θ) terms subtract out), and the difference is a dipole pattern (since the constant term α subtracts out).
A practical way to realize the back-to-back cardioid arrangement shown in
By combining the microphone signals defined in Equation (12) with the delay and subtraction as shown in
CF(kd,θ)=−2jSo sin(kd[1+cos θ]/2). (16)
Similarly, the backward-facing cardioid signal CB(kd, θ) can similarly be written according to Equation (17) as follows:
CB(kd,θ)=−2jSosin(kd[1−cos θ]/2). (17)
If both the forward-facing and backward-facing cardioid signals are averaged together, then the resulting output is given according to Equation (18) as follows:
Ec-omni(kd,θ)=½[CF(kd,θ)+CB(kd,θ)]=−2jSosin(kd/2)cos([kd/2]cos θ). (18)
For small kd, Equation (18) has a frequency response that is a first-order high-pass function, and the directional pattern is omnidirectional.
The subtraction of the forward-facing and backward-facing cardioids yields the dipole response according to Equation (19) as follows:
Ec-dipole(kd,θ)=CF(kd,θ)−CB(kd,θ)=−2jSocos(kd/2)sin([kd/2]cos θ). (19)
A dipole constructed by subtracting the two pressure microphone signals has the response given by Equation (20) as follows:
Edipole(kd,θ)=−2jSosin(kd/2]cos θ). (20)
One observation to be made from Equation (20) is that, for signals arriving along the axis of the microphone pair, the dipole's first zero occurs at twice the value of the cardioid-derived omnidirectional term (kd=2π) (i.e., for an omnidirectional signal formed by summing two back-to-back cardioids), while the dipole's first zero occurs at the value of the cardioid-derived dipole term (kd=π) (i.e., for a dipole signal formed by differencing two back-to-back cardioids).
Diffractive Differential Beamformer
In real-world implementation design constraints, it is usually not possible to place a pair of microphones on the device such that a simple delay filter as discussed above can be used to form the desired cardioid base beampatterns. Devices like laptop computers, tablets, and cell phones are typically thin and do not support a baseline spacing of the microphones to support good endfire dual-microphone operation. As the inter-microphone spacing decreases, the commensurate loss in SNR (similar to small kr in spherical beamforming as shown in Equation (11)) and increase in sensitivity to microphone-element mismatch can severely limit the performance of the beamformer. However, it is possible to exploit the acoustic scattering and diffraction by properly placing the microphones on thin devices.
It is well known that acoustic diffraction and scattering can dramatically change the phase and amplitude differences between pressure microphones as the sound propagates around a device. The resulting phase and magnitude differences are also dependent on frequency and angle of incidence of the impinging sound wave. Acoustic diffraction and filtering is a complicated process, and a full closed-form mathematical solution is possible with only a few limited diffractive bodies (infinite cylinder, sphere, disk, etc.). However, at frequencies where the acoustic wavelength is much larger than the body on which the microphones are mounted, it is possible to make general statements as to how the magnitude and phase delay will change as a result of the diffraction and scattering of an impinging sound wave.
In general, at frequencies where the device body is much smaller than the acoustic wavelength, the amplitude differences will be small and the phase delay is typically (but not necessarily) a monotonically increasing function as the frequency increases (just like the on-axis phase for microphones that are not mounted on any device). The phase delay can depend greatly on the positions of the microphones on the supporting device body, the angle of sound incidence, and the geometric shape of the boundaries.
The resulting equalized signals 6071 and 6072 are respectively applied to diffraction filters 6081 and 6082, which apply respective transfer functions h12 and h21, where the transfer function h12 represents the effect that the device has on the acoustic pressure for a first acoustic signal arriving at microphone 6021 along a first propagation axis and propagating around and through the device to microphone 6022, and transfer function h21 represents the affect that the device has on the acoustic pressure for a second acoustic signal arriving at microphone 6022 along a second propagation axis and propagating around and through the device to microphone 6021. The transfer functions may be based on measured impulse responses. For an adaptive beamformer, the first and second propagation axes should be collinear with the line passing through the two microphones, with the first and second acoustic signals arriving from opposite directions. Note that, in other implementations, the first and second propagation axes may be non-collinear. Diffraction filters 6081 and 6082 may be implemented using finite impulse response (FIR) filters whose order (e.g., number of taps and coefficients) is based on the timing of the measured impulse responses around the device. The length of the filter could be less than the full impulse response length but should be long enough to capture the bulk of the impulse response energy. Although the causes of the impact of the physical device on the characteristics of the acoustic signals are referred to as diffraction and scattering, it will be understood that, since the diffraction filters 608 are derived from actual measurements, the diffraction filters take into account any effects on the acoustic signals resulting from the device including, but not necessarily limited to, acoustic diffraction, acoustic scattering, and acoustic porting.
Subtraction node 6101 subtracts the filtered signal 6091 received from the diffraction filter 6081 from the equalized signal 6072 received from the matching filter 6062 to generate a first difference signal 6111. Similarly, subtraction node 6102 subtracts the filtered signal 6092 received from the diffraction filter 6082 from the equalized signal 6071 received from the matching filter 6061 to generate a second difference signal 6112. Equalization filters 6121 and 6122 apply equalization functions h1eq and h2eq, respectively, to the difference signals 6111 and 6112 to generate the backward and forward base beampatterns 6131 (cB(n)) and 6132 (cF(n)). Measurements of the two transfer functions h12 and h21 made on cell phone and tablet bodies for on-axis sound for both the forward and backward directions have shown that it is possible to form the first-order cardioid base beampatterns cB(n) and cF(n) at lower frequencies. Equalizers h1eq and h2eq are post filters that set the desired frequency responses for the two output beampatterns.
Beampattern selection block 614 generates the scale factor β that is applied to the backward base beampattern 6131 by the multiplication node 616. The resulting scaled signal 617 is subtracted from the forward base beampattern 6132 at the subtraction node 618, and the resulting beampattern difference signal 619 is applied to output equalizer 620 to generate the output beampattern signal 621. The parameter β is used to control the desired output beampattern. To obtain the zero-order omnidirectional component, the parameter is set to β=−1, and to β=1 for the pressure differential dipole term. Output equalizer 620 applies an output equalization filter hL that compensates for the overall output beamformer frequency response. See U.S. Pat. Nos. 8,942,387 and 9,202,475, the teachings of which are incorporated herein by reference in their entirety.
Although the beampattern selection block 614 can generate β=−1 for the omni component or β=1 for the dipole term, the beampattern selection block 614 can also generate values for β that are between −1 and 1. Positive values of β can be used to control where the single conical null in the beampattern will be located. For a diffuse sound field, the directivity index (DI), which is the directional gain in a diffuse noise field for a desired source direction, reaches a maximum (i.e., maximum DI is 6 dB) for a two-element beamformer when β is 0.5, where the maximum DI is 6 dB. The front-to-rear power ratio is maximized (i.e., DI is 5.8 dB) when β is about 0.26.
When there is wind noise, self-noise (e.g., low external acoustic energy), or some other type of noise not associated with the soundfield (like mechanical structural noise or noise from someone touching a microphone input port), β may be selected to be negative. If β is between 0 and −1, then the beampattern will have a “subcardioid” shape that does not have a null. As β approaches −1, the beampattern moves toward the omnidirectional pattern that is achieved when β==1. If there is a relatively small amount of noise, then some advantages in beamformer gain can be achieved by selecting a negative value for β other than −1.
Note that, in certain implementations, the output filter 620 can be embedded into the front-end matching filters 6061 and 6062. For certain implementations in which the microphones 6021 and 6022 are sufficiently matched, the front-end matching filters 6061 and 6062 can be omitted. For certain implementations, such as the symmetric case where the transfer functions h12 and h21 are substantially equal, the equalization filters 6121 and 6122 can be omitted.
As the sound wave frequency increases, at some frequency, the smooth monotonic phase delay and amplitude variation impact of the device body on the diffraction and scattering of the sound begins to deviate from a generally smooth function into a more-varying and complex spatial response. This is due to the onset of higher-order modes becoming significant relative to the lower-order modes that dominate the response at lower frequencies where the wavelength is much larger than the device body size. The term “higher-order modes” refers to the higher-order spatial response terms. These modes can be decomposed as orthogonal eigenmodes in a spatial decomposition of the sound field either through a closed-form expansion, a spatial singular value decomposition, or a similar orthogonal decomposition of the sound field. These modes can be also thought of as higher-order components of a closed-form or series approximation of the acoustic diffraction and scattering process.
As noted above, closed-form solutions for diffraction and scattering are not usually available for arbitrary diffracting body shapes. Instead, approximations or numerical solutions based on measurements or computer models may be used. These solutions can be represented in matrix form where the eigenvectors are representative of an orthonormal (or at least orthogonal) modal spatial decomposition of the scattering and diffraction physics. The eigenvectors represent the complex spatial responses due to diffraction and scattering of the sound around the body of the device. Spatial modes can be sorted into orders that move from simple smooth functions to ones that show increasing variation in their equivalent spatial responses. Smoothly fluctuating modes are those associated with low-frequency diffraction and scattering effects, and the rapidly varying modes are representative of the response at frequencies where the wavelength is smaller than or similar in size to the device body. Decomposition of the sound field into underlying modes is a classic analytical approach and is related to previous work by Meyer and Elko on the use of spherical harmonics and a rigid sphere baffle and brings up a general approach that could be utilized to obtain the desired first-order B-format and higher-order decompositions of the sound field that can be used as input signals to a general spatial playback system. See U.S. Pat. No. 7,587,054, the teachings of which are incorporated herein by reference in their entirety. The general approach based on using all microphones on a device to implement spatial decomposition is discussed below.
The placement of microphones on the device surface does not have to be symmetric. There are, however, microphone positions that are preferential to others for improved operation. Symmetrical positioning of microphone pairs on opposing surfaces of a device is preferred since that will result, for each microphone pair, in the two back-to-back beams that are formed having similar output SNR and frequency responses. A microphone pair is said to be symmetrically positioned when the microphones are located on opposite sides of a device along a line that is substantially normal to those two sides. A possible advantageous result of the process of diffraction and scattering can be obtained when the microphone axis (i.e., the line connected a pair of microphones) is not aligned to the normal of the device. The angular dependence of scattering and diffraction has the effect of moving the main beam axis towards the axis determined by the line between the two microphones. Another advantage that results from exploiting diffraction and scattering is that the phase delay between the microphone pairs can be much larger than the phase delay between the two microphones in an acoustic free field as determined by the line connecting the two microphones. The increase in the phase delay can result in a large increase in the output SNR relative to what would be obtained without a diffracting and scattering body between the microphone pairs.
The two back-to-back equalized beamformers that are derived as described above can then be used to form a general beampattern by combining the two output signals as described above using cardioid beampatterns. One can also use the above measurement to define where the position of the null is in the first-order differential beampattern. If only one directional beam is desired, then one could save computational cost and form only the desired beampattern. One could also store multiple transfer function measurements and then enable multiple simultaneous beams and/or the ability to select the desired beampattern.
Gradient Differential Beamformer and B-Format
The previous discussion has shown that, by appropriately combining the outputs of back-to-back cardioid signals or, equivalently, the combination of an omnidirectional microphone and a dipole microphone with matched frequency responses, any general first-order pattern can be obtained. However, the main lobe response is limited to the microphone pair axis since the pair can deduce the scalar pressure differential only along the pair axis. It is straightforward to extend the one-dimensional differential to 3D by measuring the true field gradient and not just one component of the gradient.
Fortunately, this problem can be effectively dealt with by increasing the number of microphones used to derive the three orthogonal dipole signals (that are also the first-order spherical harmonics) and the omnidirectional pressure signal (i.e., the zero-order spherical harmonic) (recall Equation (9)). As mentioned previously, computing a B-format set of signals requires a minimum of four “closely spaced” pressure signals, where “closely spaced” means that the inter-microphone distances are smaller than the shortest acoustic wavelength of interest. Vectors that are defined by the lines that connect the four spatial locations must span the three-dimensional space so that the spatial acoustic pressure gradient signals can be derived (in other words, all microphones are not coplanar).
More microphones can be used to increase the accuracy and SNR of the derived spatial acoustic derivative signals. For instance, a simple configuration of six microphones spaced along the Cartesian axes with the origin between each orthogonal pair allows all dipole and monopole signals to have a common phase center (meaning that all four B-Format signals are in phase relative to each other) as well as increasing the resulting SNR for all signals. However, it is not required that all orthogonal pairs have a common phase center, but it is desirable to have the phase centers of each pair relatively close to each other (e.g., the spacing between phase centers should be less than ½ of the wavelength at the upper frequency where precise 3D spatial control is required).
Implementation
For most practical cases, only the four microphones 705-708 at the top of the device are used to derive the B-format signals. The x-axis component can be obtained by forming an x-axis dipole signal using only microphones 705 and 706, while the z-axis component can be obtained by forming a z-axis dipole signal using only microphones 707 and 708. The y-axis component can be obtained using any three or all four microphones 705-708. For example, the audio signals from microphones 705 and 706 can be averaged to obtain a signal that has a pressure response with a phase center midway between the two microphones. This averaged signal can then be combined with the audio signal from either microphone 707 or microphone 708 (or a weighted average of the audio signals from microphones 707 and 708) to obtain a dipole signal that has a pressure response that is aligned with the y axis.
It should be noted that all three computed dipole component signals can have different sensitivities as well as different frequency responses, and that these differences can be compensated for with an appropriate equalization post-filter on each dipole signal. Similarly, the zero-order pressure term will also need to be compensated to match the responses of the three-dipole signals. For a practical implementation, these post-filters are extremely important. Moreover, for best performance, the post-filters are “complex,” such that both amplitude and phase are equalized to match the amplitude and phase of the omnidirectional response along the axes.
Note also that, in
The zero-order (omni) term can be computed as a pressure average over some or all of the microphones 705-708 or can even be formed from a single microphone. When using all four microphones 705-708, the omni component will advantageously provide a phase center that is “the closest” possible to the phase centers of the x, y, and z axes defined by microphones 705-708. Any other omni component formed from fewer microphones will be a poorer center to the y and z axes. Choosing a “good” phase center will help when the components are equalized for matching.
Similar processing can also be performed using the bottom microphone sub-array consisting of microphones 701-704 so that one could have the output of two B-format signals with a spatial offset in their respective phase centers. This arrangement might be useful in rendering a different spatial playback when using the device in landscape mode since one could exploit the impact of having a binaural signal with angularly dependent phase delay, which may improve the spatial playback quality of the sound field when rendering the playback signal. Alternatively, all eight microphones 701-708 could be used to generate a single B-format signal having greater SNR. In some cases, the signal processing for lower frequencies can be based on one set of microphones, while the signal processing for higher frequencies can be based on a different set of microphones. For low frequencies where the wavelengths are much larger than the dimensions of the device, using microphones that are spaced as far apart as possible is preferred (due to output signal level). As the frequency increases, it is preferable to use microphones that are closer together to satisfy the differential processing requirement that the microphones be spaced apart by less than ½ wavelength. In general, SNR and estimation of the pressure field spatial gradients can both be improved by increasing the number of microphones.
Here, the x-axis component can be obtained by forming an x-axis dipole signal using only microphones 752 and 753, the y-axis component can be obtained by forming a y-axis dipole signal using only microphones 751 and 752, and the z-axis component can be obtained by forming a z-axis dipole signal using only microphones 754 and 755.
One potential advantage for this microphone configuration is that the y-axis microphones are on the same side of the device 750, and therefore the diffraction effects would be smaller than for the arrangement shown in
One can further “tune” the design such that the z-axis pair (microphones 754 and 755) and thus make the unprocessed dipole signal SNR and frequency response better matched before post-processing. By matching the three orthogonal raw dipole responses as close as possible in terms of sensitivity and response, the outputs can be of similar SNR, which is highly desirable. Again, the zero-order (omni) term can be computed as a pressure average over some or all of the microphones or can even be formed from a single microphone. Furthermore, averaging of microphones can be done differently depending on frequency. For example, it could be advantageous to use more or even all microphones for low frequencies while using fewer or even just one microphone for high frequencies.
Although device 750 of
Although
For the microphone configuration of
For the microphone configuration of
Note that one or more of the microphones can be used in multiple pairs as would be the case for the microphone arrangement shown in
For the B-format dipole outputs, βi=1, while the zero-order component can be the average of one or more of the three zero-order components (obtained by using βi=−1). Note that, here too, βi can have values between −1 and 1.
In certain implementations, all of the processing shown in
While
When used herein to refer to directions, the term “orthogonal” implies that the directions are at right angles to one another. Thus, the x, y, and z axes of a Cartesian coordinate system are mutually orthogonal, and three pairs of microphones, each pair configured parallel to a different Cartesian axis, are said to be mutually orthogonal. When used herein to refer to beampatterns, the term “orthogonal” implies that the spatial integration of the product of one beampattern with another different beampattern is zero (or at least substantially close to zero). Thus, the four beampatterns (i.e., x, y, and z component dipole beampatterns and one omnidirectional beampattern) of a set of first-order B format ambisonics are mutually orthogonal. Mutually orthogonal beampatterns are also referred to as eigen or modal beampatterns.
While the previous development has been focused on the first-order spherical harmonic decomposition of the incident sound field (B-Format signals), it is possible that more microphones could be used to resolve higher-order spherical harmonics. For Nth-order spherical harmonics, the minimum number Nmin of microphones is given by Equation (21) as follows:
Nmin=(N+1)2, (21)
where N is the highest desired order. Thus, for second-order spherical harmonics, the minimum number of microphones is nine, sixteen for third-order, and so on. The next section discusses the concept of using all microphones simultaneously to derive a practical implementation of first- and higher-order beamformers.
General Beamformer Decomposition Approach
As mentioned earlier, it is also possible to form a general decomposition of the incident sound field by using all microphones and not just pairs or simple combinations of pairs of microphones to obtain a set of desired modal beampatterns. This approach has been used for a spherical microphone array where the spherical geometry led to a relatively simple and elegant way to obtain the desired “eigenbeam” modal beampatterns. For a more-general diffractive case where the geometry does not fit into one of the separable coordinate systems to enable a closed-form solution, one can use a least-squares or other approximate numerical beamformer design to best resolve the desired eigenbeams for further processing or for the natural representation that allows for easy post-processing manipulation that may be in a standard format like the natural spherical harmonic expansion.
To find the “best” filter weights that result in a spatial response (beampattern) that matches a desired response involves many, independent diffraction measurements around the device. It is preferable to have a somewhat uniform sampling of the spherical angular space. The measured diffraction response, relative to the acoustic pressure at a selected spatial reference point or the actual broadband signal that is used to insonify the device for the diffraction transfer function measurement, is used to build a matrix of directional diffraction measurements. The resulting diffraction measurement data matrix is then used with an optimization algorithm to find the filter weights that best approximate a set of desired eigenbeam beampatterns. When these optimum weights are applied to measurement diffraction matrix, the output beampattern is an approximation of the desired eigenbeam beampattern.
A unique set of weights is designed for each desired eigenbeam beampattern as a function of frequency. Thus, if L diffractive impulse response measurements are made around the device with J microphones, then the diffraction data matrix is of size L*J for each frequency. It should be noted that, typically, L>>J so that the solution for the optimum filter weights is for an overdetermined set of equations.
y(k)=wHm(k), (22)
where H represents the Hermitian conjugate matrix operator and the overall filter weight vector w of length J*M is defined as a set of J concatenated FIR filter weight vectors wi, each of length M, according to Equation (23) as follows:
w=[w1,w2, . . . ,wJ]T. (23)
where T is the transpose matrix operator. The i-th filter weight vector wi is given according to Equation (24) as follows:
wi(k)=[wi(1),wi(2), . . . ,wi(M)], i=1,J (24)
Similarly, the overall microphone input signal vector m(k) can be written according to Equation (25) as follows:
m(k)=[m1(k),m2(k), . . . ,mJ(k)]T, (25)
where the overall microphone vector m(t) contains the J concatenated microphone signal slices of M samples each from the incident acoustic signal, where the i-th microphone signal mi(k) is given according to Equation (26) as follows:
mi(k)=[mi(k),mi(k−1), . . . ,mi(k−M−1)], (26)
For simplicity and without loss of generality, we can convert to the frequency domain and define the diffraction response function to a plane wave from the spherical angles as the vector d. The frequency-domain output {tilde over (b)}i(θ,ϕ,ω) of the i-th beamformer can be written according to Equation (27) as follows:
{tilde over (b)}i(θ,ϕ,ω)=dH(θ,ϕ,ω)hi(ω), (27)
where the diffraction response function (i.e., the microphone output signal vector) d(θ,ϕ,ω) is given by Equation (28) as follows:
d(θ,ϕ,ω)=[a1(θ,ϕ,ω)eiωr
and the complex, frequency-domain weight vector hi(ω) contains the Fourier coefficients for L=M/2+1 frequencies, generated by taking the Fourier transform of the overall weight vector w of Equation (23). The frequency-domain band center frequencies are defined by the sampling rate used in the A/D conversion and the length of the discrete FIR filter used in the beamformer. The amplitude coefficients αi(θ,ϕ,ω) and time delay functions τi (θ,ϕ,ω) are the amplitudes and phase delays due to the diffraction process around the device.
As an example, in order to generate the four frequency-domain eigenbeam outputs Y00(θ,ϕ), Y1−1(θ,ϕ), Y10(θ,ϕ), and Y11(θ,ϕ) for a first-order spherical decomposition of the incoming soundfield, Equation (27) is applied four different times to the microphone output signals d(θ,ϕ,ω), once for each different eigenbeam output and using a different weight vector hi(ω) corresponding to the i-th eigenbeam output.
For a device having a complicated geometry that does not enable a straightforward closed-form solution of the diffraction around the device, the four weight vectors hi(ω) are computed from measured data generated by placing the device in an anechoic chamber and sequentially insonifying the device with different, appropriate acoustic signals from many different spherical angles around the device. At each direction θl and ϕl and frequency ωm, the microphone output signal vector d(θl,ϕl,ωm) is recorded. All of the measured diffraction filters are then represented as a matrix D whose rows are the transpose of the vectors d for each direction and frequency. The number of different directions chosen for sampling the spatial response measurements is dependent on the accuracy that is desired to compute the complex weights that meet a desired beamformer response design criterion. A minimum number of angles are needed in order to sufficiently sample the beampattern shape so that the optimization results in the desired eigenbeampattern. For order less than third order, spherical angles in increments of 5 degrees or less should be sufficient.
As an example, for each of the four different spherical harmonics of a first-order 3D decomposition, the corresponding weight vector h(ωl) can be numerically obtained by solving the following Equation (29), which expresses the mean square error between the desired beampattern bi(θl,ϕl) at the L measurement angles and the measured beampattern D(ω)Hhi(ωl) as follows:
where the “arg min” function returns a value for the weight vector hi(ωl) that minimizes the mean square error term.
The above optimization is done for each of the 1+M/2 frequencies in the frequency domain. The solution to the least-squares problem of Equation (29) can be derived using Equation (30) as follows:
hi(ω)=(D(ω)HD(ω))−1D(ω)Hbi. (30)
The least-squares solution of Equation (30) can lead to beamformer designs that are not robust since the problem can be ill-posed, resulting in the matrix DHD being singular or nearly singular due to the specific geometry and positioning of the microphones on the device. Robustness is of great importance since it directly relates to realization issues like microphone mismatch and self-noise as well as limitations due to the front-end electronics, and the solution typically becomes more sensitive at lower frequencies where the acoustic wavelength is much larger than the distance between pairs of microphones. To deal with the lack of robustness, it is common to either add an uncorrelated “diagonal noise” term sometimes referred to as regularization to the matrix D(ω)HD(ω) or to add specific constraints to force the solution towards something more robust. One such constraint is the White-Noise-Gain (WNG) constraint, which can be added to the optimization given in Equation (29) according to Equation (31) as follows:
where δ is a desired threshold value that is set to control the robustness of the solution. For practical implementations using off-the-shelf microphones, the threshold value is typically set to δ≥0.25, which means that the desired beamformer is allowed to lose 12 dB of SNR through the beamforming process in order to match the desired beampattern.
Additional linear and/or quadratic constraints can be added depending on the desired properties of the solution. It is also possible to bias the solution to be more precise at certain angles or angular regions by weighting the solution properly by assigning more weight to the fidelity of the solution at specific angles or angular regions. Assuming that the optimization problem as stated by Equations (29) and (31) is a convex problem, a solution to this quadratically constrained quadratic problem (QCQP) can be obtained by using numerical optimization software such as provided by the Matlab Optimization Toolbox or CVX. See Michael Grant and Stephen Boyd, “CVX: Matlab software for disciplined convex programming,” Version 2.0 beta (http://cvxr.com/cvx, September 2013), and Michael Grant and Stephen Boyd, “Graph implementations for nonsmooth convex programs,” Recent Advances in Learning and Control (a tribute to M. Vidyasagar), V. Blondel, S. Boyd, and H. Kimura, editors, pages 95-110, Lecture Notes in Control and Information Sciences (http://stanford.edu/˜boyd/graph_dcp.html, Springer, 2008), the teachings of both of which are incorporated herein by reference in their entirety. If D is positive semidefinite, then the problem as defined by Equations (29) and (31) is convex, since the function is convex and the quadratic constraint is convex.
Any number of desired beampatterns can be formed so it would be straightforward to form (N+1)2 beampatterns that are the spherical harmonics up to order N as represented by Equation (32) as follows:
bi(θl,ϕl)≈Ynm(θl,ϕl) for l=1,L and i=1,(N+1)2, (32)
where the vector Ynm(θl,ϕl) contains the samples of the spherical harmonics at the L measurement spherical angles used in the measurement of the diffraction and scattering transfer functions on the device on which the microphones are mounted.
Since any beampattern of order N can be formed using at least (N+1)2 microphones that have sufficient geometric sampling of the sound field, a selective subset of basis beampatterns can be formed. These basis beampatterns are desired to be spatially orthonormal (or at least orthogonal), but they could be non-orthogonal or approximately orthogonal. For instance, if it is desired to steer in only two dimensions, only three basis beampatterns would be required and not four as for a general first-order 3D decomposition. Similarly, it is possible to choose other subsets of the basis decomposition that have other implementation restrictions such as limited steering angles.
Although the above discussion has been focused on a spherical harmonic decomposition, it is also possible to use the method for other desired orthogonal expansions such as oblate and prolate spheroidal expansions, circular and elliptic cylinders, and conical and wedge expansions as well as non-orthogonal expansions.
When a device of the present invention is a handheld device such as a cell phone or a camera, the frame of reference of the audio data generated by the device relative to the ambient acoustic environment will move (i.e., translate and/or rotate) as the device moves. In certain situations, such as recording a live concert, it might be desired to keep the acoustic scene stable and independent of the device motion. In certain embodiments, devices of the present invention include motion sensors that can be used to characterize the motion of the device. Such motion sensors may include, for example, multi-axis accelerometers, magnetometers, and/or gyroscopes as well as one or more cameras, where the image data generated by the cameras can be processed to characterize the motion of the device. Such motion-sensor signals can be utilized to generate a steady, fixed audio scene even though the device was moving when the original audio data was generated. To allow for a fixed auditory scene perspective in this case, the spatial eigenbeam signal could be dynamically adjusted based on the motion-sensor signals to rotate the basis eigenbeam signals to compensate for the device motion. For instance, if the device has an initial or desired orientation, and the user rotates the device to some other direction such that the microphone axes have a different orientation, the motion-sensor signals can be used to electronically rotate the audio data to the original orientation directions to keep the audio frame of reference constant. In this way, electronic motion compensation of the underlying basis signals will keep the auditory perspective on playback fixed and stable with respect to the original recording position of the device. If the motion-sensor signals are also stored for later playback (either on or off the device), then the sound perspective relative to the device can also be stored using the unmodified basis signals, where the end user could still select a fixed auditory perspective by using the stored motion-sensor signals to adjust the unmodified basis signals.
In a single device, such as a camera, that has both an audio system for generating audio data as described herein and a video system for generating image data, motion of the camera is inherently synchronized to the geometry of the microphone array since both systems are part of the same device. In other situations, the device that generates the audio data may be different from and may move relative to the device that generates the image data. Here, too, motion-sensor signals from either or both devices can be used to correlate and adjust the audio frame of reference with respect to the video frame of reference. For example, signals from motion sensors in the camera can be used to post-process the audio data from a fixed microphone array to follow the translation and rotation of the camera. For instance, if the camera has been oriented in some new direction, then the motion-sensor signals can be used to rotate the audio device eigenbeamformers to align with the new camera orientation by electronically manipulating the audio signals from the fixed microphone array. Similarly, if the camera is fixed and the audio device containing the microphone array is moving, then motion sensors in the moving audio device can be used to modify the basis signals so that they maintain a fixed audio frame of reference that is consistent with the fixed orientation of the camera. In general, movement of one or both devices can be compensated to maintain a desired fixed perspective on the image and acoustic scenes that are being transmitted and/or recorded. It should be noted that one could also record the motion-sensor signals themselves and use these signals in post processing to affect the audio and image stabilization from the original recordings. One could also have the visual frame and acoustic frame rotated relative to each other is some desired offset.
Alternatively or in addition, two or more different audio devices of the present invention may be used to generate different sets of audio data in parallel. Here, too, motion-sensor signals from one or more of the audio devices can be used to compensate for relative motion between different audio devices and/or relative motion between the audio devices and the ambient acoustic environment. Whether or not the different sets of audio data are adjusted for motion, in some embodiments, the different sets of audio data generated by the different audio device can be combined to provide a single set of audio data. For example, the omni signals of multiple first-order B format outputs from the multiple devices can be combined (e.g., averaged) to form a single, higher-fidelity omni signal. Similarly, the different x-component dipole signals of those first-order B format outputs can be combined to form a single, higher-fidelity x-component dipole signal and similarly for the y and z components.
In step 1002, one or more sets of audio data are generated using one or more audio devices of the present invention, such as device 700 or 750 of
Equation (31) is an expression to compute the White-Noise-Gain (WNG) for any of the designed basis beampatterns. Since a general, desired spatial response beampattern for spatial rendering of the sound field typically involves all basis beampattern signals, it is undesirable to have widely varying noise between the basis beampatterns. Thus, the computed WNG can be used for each basis beampattern to identify issues related to widely varying WNG for each of the basis beampatterns. A widely varying WNG would indicate a spatially deficient microphone placement or geometry. It could be possible to use the varying WNG between basis beampatterns as a guide to what dimensions in the design are deficient in spatial sampling. Therefore, differences in the WNG could offer guidance on how the microphone positions might be adjusted to improve the design.
Due to the practical limitations on the number of microphones and the number of microphone positions, it might not be possible to realize all the basis beampatterns with similar WNG values. In this case, a noise suppression algorithm could be employed that would increase the amount of noise suppression on basis patterns that had lower WNG (i.e., noisier basis beampatterns). The amount of noise suppression could be directly related to the differences in WNG or some function of WNG. Noise suppression algorithms can also be tailored to exploit the known self-noise from the selected microphones and the associated electronics used in the device design.
Another possible method to deal with widely varying WNG between the basis beampatterns would be to form these basis beampatterns in other “directions” by choosing different directions for the underlying axes so that the WNGs between the various basis beampatterns are more closely matched. Finally, since the WNG variable is a strong function of frequency, the basis beampatterns could be identified with some metadata information that indicates at what frequencies the basis beampattern's WNG falls below some set threshold. If the WNG falls below that threshold at some frequency, then these basis signals would no longer be utilized below the cutoff frequency when forming a desired spatial beampattern or spatial playback signal. Thus, the maximum order of basis beampatterns as a function of frequency can be set by identifying at what frequencies the WNG falls below some desired minimum.
Another metric that can be used to identify possible design implementation issues is the least-square error (i.e., the term contained by the magnitude squared expression in Equation (29)) of the desired basis beampatterns as a function of frequency. Since spatial aliasing can become an issue at higher frequencies (where the average spacing between microphones exceeds a fraction of the acoustic wavelength), a change in the least-square error as frequency increases could be used to detect and therefore address the aliasing problem. If this problem is observed, then the designer can be alerted that the microphone spacings should be investigated due to a rapidly increasing error at higher frequencies. It should be possible to determine what microphones are improperly spaced by examining the error as a function of the basis beampatterns and the weights used to build the beampatterns.
As the frequency increases, at some higher frequency, acoustic spatial aliasing from beamforming with the spaced microphone array will become a design problem for the optimized basis beamformers, and either no solution for the desired basis beamformer can be found or the solution is non-robust to implementation or both. One possible way to deal with the eventual undesired effects of spatial aliasing at higher frequencies is to use the natural scattering and diffraction of the device's physical body to attain a higher directivity that could result in a relatively narrow beam in fixed directions. A subset of clustered microphones that utilize a different optimized beampattern designed to maximize directional gain from the subset could be realized to form beams in specific directions around the device. These angularly distinct beams could then be used to approximate the desired spatial signal coming from the beam directions. Using these multiple, high-frequency beams (which might not be related to the lower-frequency basis beampatterns) could allow one to virtualize these optimized diffractive beams into signals that could be used to extend the lower-frequency basis domain to increase the bandwidth of any spatial audio system that utilizes the basis signals' design approach.
Yet another potential issue that can dynamically impact proper operation of the optimized basis beamformer design is that the user's hand can drastically change the scattering and diffraction around the phone and even possibly occlude one or more microphones during operation. There is also the potential for one or more microphones to fail in a way that makes them unusable in processing. In order to address these possibilities, different sets of optimizations could be stored in the device that would be used when detrimental hand presence near the microphones or microphone failure is detected. Capacitive, ultrasonic transducers and cameras in the phone could be used to detect improper nearfield hand acoustic impact. For example, in the arrangement of
Therefore, an increased ratio of basis signal powers between different orders of the basis beampatterns can also be used to detect wind and structural handling noise. Comparison of the output energies could be utilized to detect these potential issues and either reduce the maximum order of the basis beampatterns or choose another set of weight optimizations based on measurements made that include the impact of the detrimental effects of hand presence near the microphones. Optimizations can also be obtained to deal with asymmetric wind ingestion or localized structural handling noise at some subset of microphones. Similarly, when an occluded or failed microphone is detected, another set of optimized basis beamformers can be utilized based on optimizations made during the design phase based on leaving out microphones in the optimization. Depending on the actual microphones that failed or were occluded, it could be optimum to reduce the highest-order basis beampatterns.
Other optimization techniques could be utilized to compute the optimum weights for the basis beampatterns such as iterative methods (e.g., Newton's method), genetic algorithms, simulated annealing, total least squares (TLS), and relaxation methods. See David G. Luenberger, Y. Ye, Linear and nonlinear programming: International Series in Operations Research & Management Science 116 (Third ed.), New York: Springer, 2008, the teachings of which are incorporated herein by reference in their entirety.
The use of multiple microphones on a mobile device like a cell phone, camera, or tablet can enable, through signal processing of the microphone signals, the decomposition of the incident spatial sound field into canonical spatial outputs (eigenbeams or equivalently Higher-Order Ambisonics (HOA)) that can be used later to render spatial audio playback. The eigenbeams can be processed by relatively straightforward transformations to allow the spatial playback to be rendered such that a listener or listeners can angularly move their heads and the rendering can be modified dependent on their individual head motion. The ability to render dynamic real-time spatially accurate binaural audio or playback on loudspeaker systems that can render spatialized audio can be used to enhance a listener's virtual auditory experience of a real event. Combining spatially realistic audio with spatially rendered and linked video (either stereoscopically or a screen display) that can be dynamically rotated, can significantly increase the impression of virtually being at the location where the recording was made.
Mobile devices such as tablets and cell phones are usually thin parallelepipeds with the screen area defining the two larger dimensions. For accurate spatial decomposition of the sound field, signals related to the first and higher-order pressure differences are employed. As shown above, the output SNR of a differential beamformer is directly related to the distance between the microphones. Since the device is much thinner in depth than the screen size, it is therefore commensurately difficult to obtain a signal with an SNR in a direction normal to the plane of the screen that is similar to the signals corresponding to the larger spacings that are supported by the two larger dimensions. One apparent problem is the very small geometric spacing (typically around 6 mm) between the microphones on opposite sides on the device in the front and back planes defined by the screen and the back of the device relative to the other pairs (having typical spacing of approximately 20 mm) that are mounted along the larger dimensions of the device. However, it is shown here that it is possible to exploit the effects of acoustic scattering and diffraction around the device to obtain a much higher SNR output than what could be obtained by the microphones without taking into account the body of the device. In fact, it is possible to obtain a higher SNR for pressure differentials along this normal axis than those along the other orthogonal axes with minimal diffraction effects that have larger geometric spacing between the microphones used to form the other orthogonal pressure differentials.
It was shown above how to form the first-order B-format decomposition by utilizing at least four microphones mounted on a mobile device surface by appropriately combining these microphones in a differential manner. One arrangement using five microphones was shown where one of the microphones was shared in the array to form three orthogonal first-order differential dipole signals. A numerical design method was described where the eigenbeam signals (e.g., HOA components) are computed from a number of microphones distributed on the surface of the device. The method involves the measurement of transfer functions taken at multiple spherical angles around a scattering and diffractive device and computing a constrained optimization solution for the corresponding weights that result in the desired spatial response such as the spherical harmonic eigenbeams (e.g., HOA). It was discussed that adding a White-Noise-Gain quadratic constraint to the optimal weights optimization problem can be used to control the solution robustness in a matrix inverse solution. There are also other methods that can be utilized to compute the “optimal” desired beampattern weights that include weighted least squares, total least squares, and optimization regarding various optimization norms such as the l1-norm and the l∞-norm.
Although the above development discussed forming a time-domain set of basis beampattern signals, the implementation can be equivalently realized in the frequency domain or subband domain. Also, the time- or frequency-domain signals can be recorded and used for later formation and editing to allow for non-realtime operation.
Although the invention has been described in the context of microphone arrays having arrangements for omnidirectional microphones, in other embodiments, the arrays can have one or more higher-order microphones instead of or in addition to omni pressure microphones.
Although the invention has been described in the context of mobile devices, such as cell phones and tablets, having general parallelepiped shapes, the invention can be applied to any devices having a non-spheroidal shape. For example, a camera (or camcorder) that records both acoustic and (motion or still) images can be configured with an array of microphones and an audio processing system in accordance with the present invention.
The present invention can be implemented for a wide variety of applications requiring spatial audio signals, including, but not limited to, consumer devices such as laptop computers, hearing aids, cell phones, tablets, and consumer recording devices such as audio recorders, cameras, and camcorders.
Although the present invention has been described in the context of air applications, the present invention can also be applied in other applications, such as underwater applications. The invention can also be useful for determining the location of an acoustic source, which involves a decomposition of the sound field into an orthogonal or desired set of spatial modes or spatial audio playback of the spatial sound field as a preprocessor step in more-standard source localization systems.
In certain embodiments, an article of manufacture comprises (i) a device body (e.g., 700, 750) having a non-spheroidal shape; (ii) a plurality of microphones (e.g., 701-708, 751-755, 9021-902J) configured at a plurality of different locations on the device body, each microphone configured to generate a corresponding microphone signal from an incoming acoustic signal; and (iii) a signal processing system (e.g., 800, 900) configured to process the microphone signals to generate a plurality of different output beampatterns (e.g., 8211-8213, 921) in at least two non-parallel directions (e.g., x, y, z). The signal processing system is configured to generate at least one of the output beampatterns based on effects of the device body on the incoming acoustic signal.
In at least some of the above embodiments, the device body has a general parallelepiped shape.
In at least some of the above embodiments, the signal processing system comprises, for the at least one output beampattern (e.g., 621; 8211), a signal processing subsystem (e.g., 600; 8011) comprising:
In at least some of the above embodiments, the signal processing subsystem further comprises one or more of:
In at least some of the above embodiments, the signal processing system comprises three instances (e.g., 8011-8013) of the signal processing subsystem for three mutually orthogonal output beampatterns (e.g., 8211-8213).
In at least some of the above embodiments:
In at least some of the above embodiments:
In at least some of the above embodiments:
In at least some of the above embodiments:
In at least some of the above embodiments, in generating the third output beampattern, the signal processing system applies (1) a corresponding diffraction filter (e.g., h56) that takes into account the effects of the device body on the incoming acoustic signal for the third microphone and (2) a different corresponding diffraction filter (e.g., h65) that takes into account the effects of the device body on the incoming acoustic signal for the fourth microphone.
In at least some of the above embodiments, in generating the first output beampattern, the signal processing system applies (1) a different corresponding diffraction filter (e.g., h12) that takes into account the effects of the device body on the incoming acoustic signal for the first microphone and (2) a different corresponding diffraction filter (e.g., h21) that takes into account the effects of the device body on the incoming acoustic signal for the second microphone.
In at least some of the above embodiments, in generating the second output beampattern, the signal processing system:
(a) combines the microphone signals from the first and second microphones to generate a first effective microphone signal; and
(b) applies (1) a different corresponding diffraction filter (e.g., h34) that takes into account the effects of the device body on the incoming acoustic signal for the first effective microphone and (2) a different corresponding diffraction filter (e.g., h43) that takes into account the effects of the device body on the incoming acoustic signal for at least the third microphone.
In at least some of the above embodiments, in generating the second output beampattern, the signal processing system combines the microphone signals from the third and fourth microphones to generate a second effective microphone signal, wherein the second output beampattern is based on the first and second effective microphone signals.
In at least some of the above embodiments:
In at least some of the above embodiments, the audio processing system comprises:
In at least some of the above embodiments, the plurality of different output beampatterns comprise three first-order beampatterns and a zeroth-order beampattern.
In at least some of the above embodiments, the plurality of different output beampatterns further comprises beampatterns of order two or greater.
In certain embodiments, a method comprises:
(a) receiving an incoming acoustic signal at a device body (e.g., 700, 750) having a non-spheroidal shape;
(b) generating, in response to the incoming acoustic signal, a microphone signal by each of a plurality of microphones (e.g., 701-708, 751-755, 9021-902J) configured at a plurality of different locations on the device body; and
(c) processing, by a signal processing system (e.g., 800, 900), the microphone signals to generate a plurality of different output beampatterns (e.g., 8211-8213, 921) in at least two non-parallel directions (e.g., x, y, z), wherein the signal processing system generates at least one of the output beampatterns based on effects of the device body on the incoming acoustic signal.
In at least some of the above embodiments, the method further comprises:
(d) generating motion-sensor signals characterizing motion of or with respect to the device body; and
(e) adjusting a frame of reference of one or more of the output beampatterns based on the motion-sensor signals.
In at least some of the above embodiments, step (e) comprises:
(e1) storing the output beampatterns of step (c) and the motion-sensor signals of step (d);
(e2) subsequently retrieving the stored output beampatterns and the stored motion-sensor signals; and
(e3) then adjusting the frame of reference of the one or more retrieved output beampatterns based on the retrieved motion-sensor signals.
In at least some of the above embodiments, the output beampatterns are combined with corresponding output beampatterns generated by one or more other devices to generate combined output beampatterns.
The present invention may be implemented as analog or digital circuit-based processes, including possible implementation on a single integrated circuit. As would be apparent to one skilled in the art, various functions of circuit elements may also be implemented as processing steps in a software program. Such software may be employed in, for example, a digital signal processor, micro-controller, or general-purpose computer.
The present invention can be embodied in the form of methods and apparatuses for practicing those methods. The present invention can also be embodied in the form of program code embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. The present invention can also be embodied in the form of program code, for example, whether stored in a storage medium, loaded into and/or executed by a machine, or transmitted over some transmission medium or carrier, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code segments combine with the processor to provide a unique device that operates analogously to specific logic circuits.
Unless explicitly stated otherwise, each numerical value and range should be interpreted as being approximate as if the word “about” or “approximately” preceded the value of the value or range.
Reference herein to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of other embodiments. The same applies to the term “implementation.”
The use of figure numbers and/or figure reference labels in the claims is intended to identify one or more possible embodiments of the claimed subject matter in order to facilitate the interpretation of the claims. Such use is not to be construed as necessarily limiting the scope of those claims to the embodiments shown in the corresponding figures.
It will be further understood that various changes in the details, materials, and arrangements of the parts which have been described and illustrated in order to explain the nature of this invention may be made by those skilled in the art without departing from the principle and scope of the invention as expressed in the following claims. Although the steps in the following method claims, if any, are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those steps, those steps are not necessarily intended to be limited to being implemented in that particular sequence.
Embodiments of the invention may be implemented as (analog, digital, or a hybrid of both analog and digital) circuit-based processes, including possible implementation as a single integrated circuit (such as an ASIC or an FPGA), a multi-chip module, a single card, or a multi-card circuit pack. As would be apparent to one skilled in the art, various functions of circuit elements may also be implemented as processing blocks in a software program. Such software may be employed in, for example, a digital signal processor, micro-controller, general-purpose computer, or other processor.
Also for purposes of this description, the terms “couple,” “coupling,” “coupled,” “connect,” “connecting,” or “connected” refer to any manner known in the art or later developed in which energy is allowed to be transferred between two or more elements, and the interposition of one or more additional elements is contemplated, although not required. Conversely, the terms “directly coupled,” “directly connected,” etc., imply the absence of such additional elements.
Signals and corresponding terminals, nodes, ports, or paths may be referred to by the same name and are interchangeable for purposes here.
As used herein in reference to an element and a standard, the term “compatible” means that the element communicates with other elements in a manner wholly or partially specified by the standard, and would be recognized by other elements as sufficiently capable of communicating with the other elements in the manner specified by the standard. The compatible element does not need to operate internally in a manner specified by the standard.
Embodiments of the invention can be manifest in the form of methods and apparatuses for practicing those methods. Embodiments of the invention can also be manifest in the form of program code embodied in tangible media, such as magnetic recording media, optical recording media, solid state memory, floppy diskettes, CD-ROMs, hard drives, or any other non-transitory machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. Embodiments of the invention can also be manifest in the form of program code, for example, stored in a non-transitory machine-readable storage medium including being loaded into and/or executed by a machine, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code segments combine with the processor to provide a unique device that operates analogously to specific logic circuits
Any suitable processor-usable/readable or computer-usable/readable storage medium may be utilized. The storage medium may be (without limitation) an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device. A more-specific, non-exhaustive list of possible storage media include a magnetic tape, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM) or Flash memory, a portable compact disc read-only memory (CD-ROM), an optical storage device, and a magnetic storage device. Note that the storage medium could even be paper or another suitable medium upon which the program is printed, since the program can be electronically captured via, for instance, optical scanning of the printing, then compiled, interpreted, or otherwise processed in a suitable manner including but not limited to optical character recognition, if necessary, and then stored in a processor or computer memory. In the context of this disclosure, a suitable storage medium may be any medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The functions of the various elements shown in the figures, including any functional blocks labeled as “processors,” may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
It should be appreciated by those of ordinary skill in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the invention. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
Embodiments of the invention can also be manifest in the form of a bitstream or other sequence of signal values stored in a non-transitory recording medium generated using a method and/or an apparatus of the invention.
Unless explicitly stated otherwise, each numerical value and range should be interpreted as being approximate as if the word “about” or “approximately” preceded the value or range.
It will be further understood that various changes in the details, materials, and arrangements of the parts which have been described and illustrated in order to explain embodiments of this invention may be made by those skilled in the art without departing from embodiments of the invention encompassed by the following claims.
In this specification including any claims, the term “each” may be used to refer to one or more specified characteristics of a plurality of previously recited elements or steps. When used with the open-ended term “comprising,” the recitation of the term “each” does not exclude additional, unrecited elements or steps. Thus, it will be understood that an apparatus may have additional, unrecited elements and a method may have additional, unrecited steps, where the additional, unrecited elements or steps do not have the one or more specified characteristics.
The use of figure numbers and/or figure reference labels in the claims is intended to identify one or more possible embodiments of the claimed subject matter in order to facilitate the interpretation of the claims. Such use is not to be construed as necessarily limiting the scope of those claims to the embodiments shown in the corresponding figures.
It should be understood that the steps of the exemplary methods set forth herein are not necessarily required to be performed in the order described, and the order of the steps of such methods should be understood to be merely exemplary. Likewise, additional steps may be included in such methods, and certain steps may be omitted or combined, in methods consistent with various embodiments of the invention.
Although the elements in the following method claims, if any, are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.
The embodiments covered by the claims in this application are limited to embodiments that (1) are enabled by this specification and (2) correspond to statutory subject matter. Non-enabled embodiments and embodiments that correspond to non-statutory subject matter are explicitly disclaimed even if they fall within the scope of the claims.
This application claims the benefit of the filing date of U.S. provisional application No. 62/350,240, filed on Jun. 15, 2016, the teachings of which are incorporated herein by reference in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2017/036988 | 6/12/2017 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2017/218399 | 12/21/2017 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
4042779 | Craven et al. | Aug 1977 | A |
5473701 | Cezanne et al. | Dec 1995 | A |
6041127 | Elko | Mar 2000 | A |
7587054 | Elko et al. | Sep 2009 | B2 |
8204252 | Avendano | Jun 2012 | B1 |
8433075 | Elko et al. | Apr 2013 | B2 |
8942387 | Elko et al. | Jan 2015 | B2 |
9202475 | Elko et al. | Dec 2015 | B2 |
9729994 | Eddins | Aug 2017 | B1 |
9980075 | Benattar | May 2018 | B1 |
10206040 | Kolb et al. | Feb 2019 | B2 |
20110235822 | Jeong et al. | Sep 2011 | A1 |
20120128160 | Kim et al. | May 2012 | A1 |
20140105416 | Huttunen et al. | Apr 2014 | A1 |
20150055796 | Nugent et al. | Feb 2015 | A1 |
20160066117 | Chen et al. | Mar 2016 | A1 |
20160071526 | Wingate et al. | Mar 2016 | A1 |
20160165341 | Benattar | Jun 2016 | A1 |
Number | Date | Country |
---|---|---|
2375276 | Nov 2002 | GB |
2495131 | Apr 2013 | GB |
WO 2014062152 | Apr 2014 | WO |
WO2014062152 | Apr 2014 | WO |
Entry |
---|
Written Opinion; dated May 7, 2015 for PCT Application No. PCT/US2017/036988. |
International Search Report and Written Opinion; dated Oct. 4, 2017 for PCT Application No. PCT/US2017/036988. |
Gibson, J. J., et al. “Compatible FM Broadcasting of Panoramic Sound.” IEEE Transactions on Broadcast and Television Receivers 4 (1973): 286-293. |
Fellgett, P., “Ambisonics. part one: General system description.” Studio Sound 17.8 (1975): 20-22. |
Gerzon, Michael A. “Ambisonics. part two: studio techniques.” studio sound 17.8 (1975): 24-26. |
Elko, G. W., “A Steerable and Variable First-Order Differential Microphone Array.” IEEE International Conference on In Acoustics, Speech, and Signal Processing,1997, vol. 1, pp. 223-226. |
Williams, Earl G. Fourier acoustics: sound radiation and nearfield acoustical holography. Academic press, 1999. |
McGowan, I,. “Microphone arrays: A tutorial.” Queensland University, Australia (2001), pp. 1-38. |
Grant, M., and S. Boyd. “The CVX Users' Guide. Release 2.1, 2017.”. |
Rafaely, Boaz. Fundamentals of spherical array processing. vol. 8. Berlin: Springer, 2015. |
Number | Date | Country | |
---|---|---|---|
20180227665 A1 | Aug 2018 | US |
Number | Date | Country | |
---|---|---|---|
62350240 | Jun 2016 | US |