With the advancement of technology, the use and popularity of electronic devices has increased considerably. Electronic devices are commonly used to capture and process audio data.
For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings.
Electronic devices may be used to capture audio and process audio data. The audio data may be used for voice commands and/or sent to a remote device as part of a communication session. To process voice commands from a particular user or to send audio data that only corresponds to the particular user, the device may attempt to isolate desired speech associated with the user from undesired speech associated with other users and/or other sources of noise, such as audio generated by loudspeaker(s) or ambient noise in an environment around the device. For example, the device may perform Sound Source Localization (SSL) processing to determine a direction associated with the user and may isolate the audio data associated with this direction.
To improve sound source localization processing, offered is a technique for determining a direction of arrival using a combination of timing information and energy information. For example, a device may decompose an observed sound field into directional components, then estimate a time-delay likelihood value and an energy-based likelihood value for each of the directional components. The time-delay likelihood value indicates a likelihood that a particular directional component has a shortest time delay (e.g., arrived at the device first) of the directional components, while the energy-based likelihood value indicates a likelihood that the particular directional component has a highest energy value of the directional components. The device may use these likelihood values to identify a dominant directional component that corresponds to a direct path (e.g., line-of-sight) and distinguish the dominant directional component from other directional components that correspond to acoustic reflections. For example, the device may determine aggregate likelihood values and select a direction of arrival (e.g., azimuth) that corresponds to a maximum aggregate likelihood value.
In some examples, the device may perform Acoustic Wave Decomposition (AWD) processing to decompose the observed sound field into directional components, although the disclosure is not limited thereto. In order to reduce a processing consumption associated with performing AWD processing, the device may optionally split this estimation into two phases: a search phase that selects a subset of a device dictionary to reduce a complexity, and a decomposition phase that solves an optimization problem using the subset of the device dictionary.
As illustrated in
The device 110 may include a microphone array 120 configured to generate microphone audio data 112. As is known and as used herein, “capturing” an audio signal and/or generating audio data includes a microphone transducing audio waves (e.g., sound waves) of captured sound to an electrical signal and a codec digitizing the signal to generate the microphone audio data 112.
As described in greater detail below with regard to
In some examples, the device acoustic characteristics data 114 may include frequencies of interest up to a threshold value (e.g., 8 kHz), although the disclosure is not limited thereto and the device acoustic characteristics data 114 may include all frequencies without departing from the disclosure. Additionally or alternatively, the device acoustic characteristics data 114 may discretize azimuth values (e.g., azimuth angles) and elevation values (e.g., elevation angles) in a three-dimensional space with a first angle resolution (e.g., less than 10°), which may result in a first number of entries (e.g., ˜800 entries), although the disclosure is not limited thereto.
As illustrated in
In some examples, the device 110 may reduce a processing consumption associated with performing AWD processing by splitting this estimation into two phases, as described in greater detail below with regard to
As illustrated in
Using the microphone audio data 112 and the device acoustic characteristics data 114, the device 110 may optionally select (134) a subset of the device acoustic characteristics data 114 and may perform (136) decomposition to determine the complex amplitude data 116, as described in greater detail below with regard to
To perform sound source localization, the device 110 may use the complex amplitude data 116 to determine a direction of arrival, as described in greater detail below with regard to
Using these likelihood values, the device 110 may determine (142) a direction of arrival associated with the sound source (e.g., user 5). In some examples, the device 110 may determine aggregate likelihood values and select a direction of arrival (e.g., azimuth value) that corresponds to a first acoustic plane-wave having a maximum aggregate likelihood value. Thus, the device 110 may use the aggregate likelihood values to identify a dominant acoustic plane-wave that corresponds to a direct path (e.g., line-of-sight) between the sound source (e.g., user 5) and the device 110, distinguishing this acoustic plane-wave from other acoustic plane-waves that correspond to acoustic reflections.
In some examples, the device 110 may determine the direction of arrival using a combination of spatial aggregation (e.g., local aggregation) and/or temporal aggregation (e.g., global aggregation), as described below with regard to
In some examples, the desired time window may correspond to an acoustic event, such as a time boundary associated with a wakeword detected by the device 110. For example, the device 110 may detect a wakeword, determine a time boundary associated with the wakeword, and perform temporal aggregation within the time boundary to generate a single azimuth value associated with the wakeword. However, the disclosure is not limited thereto, and in some examples the desired time window may instead correspond to a fixed duration of time. For example, the device 110 may perform temporal aggregation to determine an azimuth value for each individual audio frame (e.g., 8 ms) without departing from the disclosure. In this example, the device 110 may determine an azimuth value for a series of audio frames and, if a wakeword is detected within a time boundary, may optionally determine a final azimuth value associated with the wakeword based on the audio frames within the time boundary.
In some examples, the device 110 may determine the direction of arrival by determining an azimuth value. However, the disclosure is not limited thereto, and in other examples the device 110 may determine the direction of arrival by determining an azimuth value and an elevation value without departing from the disclosure.
Acoustic theory tells us that a point source produces a spherical acoustic wave in an ideal isotropic (uniform) medium such as air. Further, the sound from any radiating surface can be computed as the sum of spherical acoustic wave contributions from each point on the surface, including any relevant reflections. In addition, acoustic wave propagation is the superposition of spherical acoustic waves generated at each point along a wavefront. Thus, all linear acoustic wave propagation can be seen as a superposition of spherical traveling waves.
Additionally or alternatively, acoustic waves can be visualized as rays emanating from the source 212, especially at a distance from the source 212. For example, the acoustic waves between the source 212 and the microphone array can be represented as acoustic plane waves. As illustrated in
Acoustic plane waves are a good approximation of a far-field sound source (e.g., sound source at a relatively large distance from the microphone array), whereas spherical acoustic waves are a better approximation of a near-field sound source (e.g., sound source at a relatively small distance from the microphone array). For ease of explanation, the disclosure may refer to acoustic waves with reference to acoustic plane waves. However, the disclosure is not limited thereto, and the illustrated concepts may apply to spherical acoustic waves without departing from the disclosure. For example, the device acoustic characteristics data may correspond to acoustic plane waves, spherical acoustic waves, and/or a combination thereof without departing from the disclosure.
In some examples, the device 410 illustrated in
The acoustic wave equation is the governing law for acoustic wave propagation in fluids, including air. In the time domain, the homogenous wave equation has the form:
where p(t) is the acoustic pressure and c is the speed of sound in the medium. Alternatively, the acoustic wave equation may be solved in the frequency domain using the Helmholtz equation to find p(f):
where k≙2πf/c is the wave number. At steady state, the time-domain and the frequency-domain solutions are Fourier pairs. The boundary conditions are determined by the geometry and the acoustic impedance of the difference boundaries. The Helmholtz equation is typically solved using Finite Element Method (FEM) techniques, although the disclosure is not limited thereto and the device 110 may solve using boundary element method (BEM), finite difference method (FDM), and/or other techniques without departing from the disclosure.
To analyze the microphone array 412, the system 100 may determine device acoustic characteristics data 114 associated with the device 410. For example, the device acoustic characteristics data 114 represents scattering due to the device surface (e.g., acoustic plane wave scattering caused by a surface of the device 410). Therefore, the system 100 needs to compute the scattered field at all microphones 402 for each plane-wave of interest impinging on a surface of the device 410. The total wave-field at each microphone of the microphone array 412 when an incident plane-wave pi(k) impinges on the device 410 has the general form:
where pt is the total wave-field, pi is the incident plane-wave, and ps is the scattered wave-field.
The device acoustic characteristics data 114 may represent the acoustic response of the device 410 associated with the microphone array 412 to each acoustic wave of interest. The device acoustic characteristics data 114 may include a plurality of vectors, with a single vector corresponding to a single acoustic wave. The number of acoustic waves may vary, and in some examples the acoustic characteristics data may include acoustic plane waves, spherical acoustic waves, and/or a combination thereof. In some examples, the device acoustic characteristics data 114 may include 1024 frequency bins (e.g., frequency ranges) up to a maximum frequency (e.g., 8 kHz, although the disclosure is not limited thereto). Thus, the system 100 may use the device acoustic characteristics data 114 to generate RIR data with a length of up to 2048 taps, although the disclosure is not limited thereto.
The entries (e.g., values) for a single vector represent an acoustic pressure indicating a total field at each microphone (e.g., incident acoustic wave and scattering caused by the microphone array) for a particular background acoustic wave. Each entry of the device acoustic characteristics data 114 has the form {z(ω,ϕ,θ)}ω,ϕ,θ, which represents the acoustic pressure vector (at all microphones) at frequency ω, for an acoustic wave of azimuth θl and elevation ϕl. Thus, a length of each entry of the device acoustic characteristics data 114 corresponds to a number of microphones included in the microphone array.
These values may be simulated by solving a Helmholtz equation or may be directly measured using a physical measurement in an anechoic room (e.g., a room configured to deaden sound, such that there is no echo) with a distance point source (e.g., loudspeaker). For example, using techniques such as finite element method (FEM), boundary element method (BEM), finite difference method (FDM), and/or the like, the system 100 may calculate the total wave-field at each microphone. Thus, a number of entries in each vector corresponds to a number of microphones in the microphone array, with a first entry corresponding to a first microphone, a second entry corresponding to a second microphone, and so on.
In some examples, the system 100 may determine the device acoustic characteristics data 114 by simulating the microphone array 412 using wave-based acoustic modeling. For example,
The system 100 may calculate the total wave-field at all frequencies of interest with a background acoustic wave, where the surface of the device 410 is modeled as a sound hard boundary. If a surface area of an individual microphone is much smaller than a wavelength of the acoustic wave, the microphone is modeled as a point receiver on the surface of the device 410. If the surface area is not much smaller than the wavelength, the microphone response is computed as an integral of the acoustic pressure over the surface area.
Using the FEM model, the system 100 may calculate an acoustic pressure at each microphone (at each frequency) by solving the Helmholtz equation numerically with a background acoustic wave. This procedure is repeated for each possible acoustic wave and each possible direction to generate a full dictionary that completely characterizes a behavior of the device 410 for each acoustic wave (e.g., device response for each acoustic wave). Thus, the system 100 may simulate the device acoustic characteristics data 114 and may apply the device acoustic characteristics data 114 to any room configuration.
In other examples, the system 100 may determine the device acoustic characteristics data 114 described above by physical measurement 460 in an anechoic room 465, as illustrated in
To model all of the potential acoustic waves, the system 100 may generate the input using the loudspeaker 470 in all possible locations in the anechoic room 465. For example,
After determining the complex amplitude data 116, the device 110 may use the complex amplitude data 116 to perform a variety of functions. As illustrated in
The device 110 may also perform (516) acoustic mapping using the complex amplitude data 116. In some examples, the device 110 may perform acoustic mapping such as generating a room impulse response (RIR). The RIR corresponds to an impulse response of a room or environment surrounding the device, such that the RIR is a transfer function of the room between sound source(s) and the microphone array 120 of the device 110. For example, the device 110 may generate the RIR by using the complex amplitude data 116 to determine an output signal corresponding to the sound source(s) and/or an input signal corresponding to the microphone array 120. The disclosure is not limited thereto, and in other examples, the device 110 may perform acoustic mapping to generate an acoustic map (e.g., acoustic source map, heatmap, and/or other representation) indicating acoustic sources in the environment. For example, the device 110 may locate sound source(s) in the environment and/or estimate their strength, enabling the device 110 to generate an acoustic map indicating the relative positions and/or strengths of each of the sound source(s). These sound source(s) include users within the environment, loudspeakers or other device(s) in the environment, and/or other sources of audible noise that the device 110 may detect.
Finally, the device 110 may perform (518) sound field reconstruction using the complex amplitude data 116. For example, the device 110 may perform sound field reconstruction to reconstruct a magnitude of sound pressure at various points in the room (e.g., spatial variation of the sound field), although the disclosure is not limited thereto. While
As described above, the propagation of acoustic waves in nature is governed by the acoustic wave equation, whose representation in the frequency domain (e.g., Helmholtz equation), in the absence of sound sources, is illustrated in Equation [1b]. In this equation, p(ω) denotes the acoustic pressure at frequency ω, and k denotes the wave number. Acoustic plane waves are powerful tools for analyzing the wave equation, as acoustic plane waves are a good approximation of the wave-field emanating from a far-field point source. The acoustic pressure of a plane-wave with vector wave number k is defined at point r=(x, y, z) in the three-dimensional space as:
where k is the three-dimensional wavenumber vector. For free-field propagation, k has the form:
where c is the speed of sound, and ϕ and θ are respectively the elevation and azimuth of the vector normal to the plane wave propagation. Note that k in Equation [1b] is ∥k∥. A local solution to the homogenous Helmholtz equation can be approximated by a linear superposition of plane waves:
where Λ is a set of indices that defines the directions of plane waves {ϕ, θ}, each ψ(k) is a plane wave as in Equation [3] with k as in Equation [4], and {αl} are complex scaling factors (e.g., complex amplitude data 116) that are computed to satisfy the boundary conditions. In
When an incident plane wave ψ(k) impinges on a rigid surface, scattering takes effect on the surface. The total acoustic pressure at a set of points on the surface is the superposition of incident acoustic pressure (e.g., free-field plane wave) and scattered acoustic pressure caused by the device 110. The total acoustic pressure can be either measured in an anechoic room or simulated by numerically solving the Helmholtz equation with background acoustic plane wave ψ(k).
The total acoustic pressure on the device surface is illustrated in
where ψ(θl(t), ϕl(t); ω) denotes the free-field acoustic plane waves from Equation [5] and {αl} denotes the corresponding weights (e.g., complex amplitude data 116).
The ensemble of all vectors that span the three-dimensional space at all frequencies ω may be referred to as the acoustic dictionary of the device (e.g., device acoustic characteristics data 114). Each entry of the device dictionary can be either measured in an anechoic room with single-frequency far-field sources, or computed numerically by solving the Helmholtz equation on the device surface with background plane-wave using a simulation or model of the device (e.g., computer-assisted design (CAD) model). Both methods yield the same result, but the numerical method has a lower cost and is less error-prone because it does not require human labor. For the numerical method, each entry in the device dictionary is computed by solving the Helmholtz equation, using Finite Element Method (FEM) techniques, Boundary Element Method (BEM) techniques, and/or the like, for the total field at the microphones with a given background plane wave ψ(k). The device model is used to specify the boundary in the simulation, and it is modeled as a sound hard boundary. To have a true background plane-wave, the external boundary should be open and non-reflecting. In the simulation, the device is enclosed by a closed boundary (e.g., a cylinder or spherical surface. To mimic an open-ended boundary, the simulation may use a Perfectly Matched Layer (PML) that defines a special absorbing domain that eliminates reflection and refractions in the internal domain that encloses the device. The acoustic dictionary (e.g., device acoustic characteristics data 114) has the form:
where each entry in the dictionary is a vector whose size equals the microphone array size, and each element in the vector is the total acoustic pressure at one microphone in the microphone array when a plane wave with k(ωl, θl, ϕl) hits the device 110. The dictionary also covers all frequencies of interest, which may be up to 8 kHz but the disclosure is not limited thereto. The dictionary discretizes the azimuth and elevation angles in the three-dimensional space, with angle resolution typically less than 10°. Therefore, the device dictionary may include roughly 800 entries (e.g., 800 entries).
The objective of the decomposition algorithm is to find the best representation of the observed sound field (e.g., microphone audio data 112 y(ω)) at the microphone array 120, using the device dictionary . A least-square formulation can solve this optimization problem, where the objective is to minimize:
where g(·) is a regularization function and ρ(·) is a weighting function. An equivalent matrix form (e.g., optimization model 620) is:
where the columns of A(ω) are the individual entries of the acoustic characteristics data 114 at frequency o. In Equation [8], A refers to the nonzero indices of the dictionary entries, which represent directions in the three-dimensional space, and is independent of ω. This independents stems from the fact that when a sound source emits broadband frequency content, it is reflected by the same boundaries in its propagation path to the receiver. Therefore, all frequencies have components from the same directions but with different strengths (e.g., due to the variability of reflection index with frequency), which is manifested by the components {αl(ω)}. Each component is a function of the source signal, the overall length of the acoustic path of its direction, and the reflectivity of the surfaces across its path. This independent between Λ and ω is a key property in characterizing the optimization problem in Equation [9].
The typical size of an acoustic dictionary is ˜103 entries, which corresponds to an azimuth resolution of 5° and an elevation resolution of 10°. In a typical indoor environment, approximately 20 acoustic plane waves are sufficient for a good approximation in Equation [6]. Moreover, the variability in the acoustic path of the different acoustic waves at each frequency further reduces the effective number of acoustic waves at individual frequencies. Hence, the optimization problem in Equation [9] is a sparse recovery problem and proper regularization is needed to stimulate a sparse a. This requires L1-regularization, such as the L1-regularization used in standard least absolute shrinkage and selection operator (LASSO) optimization. To improve the perceptual quality of the reconstructed audio, L2-regularization is added, and the regularization function g(a) (e.g., regularization function 630) has the general form of elastic net regularization:
The strategy for solving the elastic net optimization problem in Equation [9] depends on the size of the microphone array. If the microphone array size is big (e.g., greater than 20 microphones), then the observation vector is bigger than the typical number of nonzero components in a, making the problem relatively simple with several efficient solutions. However, the problem becomes much harder when the microphone array is relatively small (e.g., fewer than 10 microphones). In this case, the optimization problem at each frequency Ω becomes an undetermined least-square problem because the number of observations is less than the expected number of nonzero elements in the output. Thus, the elastic net regularization illustrated in Equation [10] is necessary. Moreover, the invariance of directions (e.g., indices of nonzero elements A) with frequency could be exploited to reduce the search space for a more tractable solution, which is computed in two steps. Two example methods for solving this optimization problem are illustrated in
The first step computes a pruned set of indices A that contains the nonzero coefficients at all frequencies. This effectively reduces the problem size from || to |Λ|, which is a reduction of about two orders of magnitude. The pruned set A is computed by a two-dimensional matched filter followed by a small scale LASSO optimization. In some examples, the device 110 may determine (714) energy values for each angle in the device acoustic characteristics data 114. For example, for each angle (θl, ϕl) in the device dictionary, the device 110 may calculate:
where the weighting W(ω;t) is a function of the signal-to-noise-ratio (SNR) (e.g., signal quality metric) of the corresponding time-frequency cell (e.g., frequency band). This metric is only calculated when the target signal is present. To account for variation across elevation, the above score may be averaged over its neighboring angles:
where (k) is the set of neighboring angles to (ϕk, θk).
The device 110 may identify (716) local maxima represented in the energy values. For example, the device 110 may identify local maxima of Γ(k, t) and discard values in the neighborhood of the stronger maxima (e.g., values for angles within 100 of the local maxima). This pruning is needed to improve the numerical stability of the optimization problem.
The device 110 may determine (718) pruned set with indices of the strongest surviving local maxima. For example, the device 110 may find a superset
The second step in the solution procedure solves the elastic net optimization problem in Equation [9] with the pruned set A to calculate the complex amplitude data 116 (e.g., {αl(ω)l∈Λ} for all ω. Thus, the device 110 may solve (722) the optimization problem with the pruned set to determine the complex amplitude data 116. For example, the device 110 may use the optimization model 620 and the regularization function 630 described above with regard to
Similar to the method illustrated in
The search phase is solved using a combination of sparse recovery and correlation methods. The main issue is that the number of microphones (e.g., M) is smaller than the number of acoustic waves (e.g., N), making it an undetermined problem that requires design heuristics (e.g., through regularization). As illustrated in
In the second stage, the device 110 may run a limited broadband coordinate-descent (CD) solver on a subset of the subbands with small number of iterations to further refine the components selection to the subset whose size equals the target number of output components N. For example,
Using the pruned device dictionary (e.g., of size N), the device 110 may run (822) the broadband CD solver at all subband frequencies to generate the complex amplitude data 116. The regularization parameters in step 822 may be less strict than the regularization parameters of step 818 because of the smaller dictionary size. Further, the regularization parameters for each component may be weighted to be inversely proportional to its energy value calculated in step 814.
Using the microphone audio data 112 and the device acoustic characteristics data 114, the device 110 may determine (914) a subset of the device acoustic characteristics data 114 and may solve (916) an optimization problem with the subset of the device acoustic characteristics data 114 to determine the complex amplitude data 116, as described in greater detail below with regard to
As illustrated in
Using these likelihood values, the device 110 may determine (922) aggregate likelihood values and may determine (924) a direction of arrival associated with the sound source (e.g., user 5) using the aggregate likelihood values. In some examples, the device 110 may determine the aggregate likelihood values by adding the time-delay likelihood values and the energy likelihood values, although the disclosure is not limited thereto. To determine the direction of arrival (e.g., azimuth value), the device 110 may determine a maximum aggregate likelihood value (e.g., highest value of the aggregate likelihood values), may identify a first acoustic plane-wave associated with the maximum aggregate likelihood value, and determine the azimuth value corresponding to the first acoustic plane-wave, although the disclosure is not limited thereto. Thus, the device 110 may use the aggregate likelihood values to identify a dominant acoustic plane-wave that corresponds to a direct path (e.g., line-of-sight) between the sound source (e.g., user 5) and the device 110, distinguishing this acoustic plane-wave from other acoustic plane-waves that correspond to acoustic reflections.
Assuming that a source audio signal X(ω) experiences multiple reflections in the acoustic path towards a microphone array, the reflections at the receiving microphone (e.g., {Xk(ω)}k) may be calculated as:
where τk>0 is the corresponding delay, and δk is a real-valued propagation loss. Define:
which can be further simplified as:
The device 110 may use the above relation to find the time delay between two components. However, it is susceptible to phase wrapping at large frequency ω, and one extra step is needed to mitigate its impact. Define for a frequency shift Δ:
This eliminates a dependence on frequency ω, and if the frequency shift Δ is chosen small enough, this eliminates the issue of phase wrapping. Then, the device 110 may determine an estimated delay between components l and k (e.g.,
where a weighting function W(ω) is proportional to a signal-to-noise-ratio (SNR) (e.g., signal quality metric) of the corresponding frequency band, as in Equation [11]. Note that if
The system 100 may assume that ρlk˜(
where erfc(·) is the complementary error function. Note that P(τk<τl)=1−P(τk>τl), hence the device 110 may calculate ρlk once for each pair of components.
The acoustic reflections are approximated by {αl(ω;t)}ω. The probability that the l-th component is the first to arrive at the microphone array by βl can be expressed by:
which, using Equation [18], can be expressed in the log-domain as time-delay likelihood estimation 1120:
The time-delay likelihood estimation 1120 is an accurate approximation of the time-delay likelihood function in certain conditions. For example, the device 110 may validate that this approximation is accurate by calculating a correlation coefficient between the two components and determining that the correlation coefficient is above a predetermined threshold, although the disclosure is not limited thereto.
The true energy of the line-of-sight component is theoretically higher than the energy of each individual reflection. However, due to the finite number of microphones, the calculated directional components may have errors. Nevertheless, the line-of-sight energy is usually among the highest energy components. The device 110 may calculate an amount of energy for each component using energy estimation 1130:
where W(ω) is a weighting function that is proportional to SNR for each frequency band (e.g., frequency range), as described above. Thus, the device 110 is not weighting all frequencies evenly, but is instead weighting frequency bands based on an SNR value (e.g., higher SNR value, more weight associated with the frequency band). Therefore, the weighting function W(ω) is determined based on system conditions and may be identical when calculating the time-delay likelihood estimation 1120 (e.g., Equation [17]) and when calculating the energy estimation 1130 (e.g., Equation [21]).
A directional component may be a candidate to be the line-of-sight component if an energy value is above a threshold value (e.g., El>ν·max{Ek}, where ν is a predetermined threshold). If the number of directional components that satisfy this condition is M, then the energy-based likelihood is computed as an energy likelihood estimation 1140:
where ε«−log M corresponds to a small probability value to account for computation errors.
At each time frame, the device may calculate log-likelihoods
and the likelihood function χ(·) of all azimuth angles is updated according to Equation [23] with every new directional component.
The final step is for the device 110 to calculate the maximum-likelihood estimate of the azimuth angle by aggregating the local likelihood χ(ϕ). Thus, the device 110 may calculate a global likelihood for each azimuth angle as global aggregate likelihood estimation 1160:
To illustrate an example, the local aggregate likelihood estimation 1150 may correspond to spatial aggregation, as the device 110 looks at every azimuth value and aggregates everything within a particular range (e.g., aggregating from directional components to azimuth angles). In contrast, the global aggregate likelihood estimation 1160 may correspond to temporal aggregation over a desired time window, as the device 110 looks at selected azimuth values for a period of time and generates a single output azimuth value.
In some examples, the desired time window may correspond to an acoustic event, such as a time boundary associated with a wakeword detected by the device 110. For example, the device 110 may detect a wakeword, determine a time boundary associated with the wakeword, and calculate the global aggregate likelihood estimation 1160 within the time boundary to generate a single azimuth value associated with the wakeword. However, the disclosure is not limited thereto, and in some examples the desired time window may instead correspond to a fixed duration of time. For example, the device 110 may use the global aggregate likelihood estimation 1160 to determine an azimuth value for each individual audio frame (e.g., 8 ms) without departing from the disclosure. In this example, the device 110 may determine an azimuth value for a series of audio frames and, if a wakeword is detected within a time boundary, may optionally determine a final azimuth value associated with the wakeword based on the audio frames within the time boundary.
In some examples, the device 110 may determine the direction of arrival by determining an azimuth value. Thus, while the directional components may correspond to both an azimuth angle and an elevation angle, the device 110 may average across elevation and select an azimuth value associated with a range of elevation angles. However, the disclosure is not limited thereto, and in other examples the device 110 may determine the direction of arrival by determining an azimuth value and an elevation value without departing from the disclosure.
Based on these time delay values, the device 110 may determine time-delay likelihood values using time-delay likelihood estimation 1120 calculated using Equation [20] above. Thus, a time-delay likelihood value may indicate a likelihood that a particular acoustic plane-wave has a shortest time delay (e.g., arrived at the device first) of a plurality of acoustic plane-waves.
The device 110 may determine (1016) energy values using the complex amplitude data 116 and may determine (1018) energy likelihood values (e.g., energy-based likelihood values) using the energy values. For example, the device 110 may determine the energy values using energy estimation 1130, shown as Equation [21] above.
Based on these energy values, the device 110 may determine energy likelihood values using energy likelihood estimation 1140, shown as Equation [22] above. Thus, an energy-based likelihood value may indicate a likelihood that the particular acoustic plane-wave has a highest energy value of the plurality of acoustic plane-waves.
Using the time-delay likelihood values and the energy likelihood values, the device 110 may determine (1020) local aggregate likelihood values and may determine (1022) global aggregate likelihood values. In some examples, the device 110 may determine the aggregate likelihood values by adding the time-delay likelihood values and the energy likelihood values, although the disclosure is not limited thereto.
This corresponds to azimuth value ϕl of the corresponding entry of the device acoustic characteristics data 114 (e.g., a device dictionary), and the total likelihood at this azimuth value ϕl is a sum of the two likelihood values. However, due to the finite dictionary size and the finite precision of the computation, the true angle of the 1-th component can be an angle adjacent to the azimuth value (ϕl). Assuming a normal distribution (with variance κ) of the true azimuth value around ϕl, the likelihood for azimuth angles adjacent to the azimuth value (ϕl) is approximated using local aggregate likelihood estimation 1150, shown as Equation [23] above. These local aggregate likelihood values can be used to determine global aggregate likelihood values for each azimuth value, using global aggregate likelihood estimation 1160, shown as Equation [24] above.
Finally, the device 110 may determine (1024) a direction of arrival associated with the sound source (e.g., user 5) using the global aggregate likelihood values. To determine the direction of arrival (e.g., azimuth value), the device 110 may determine a maximum global aggregate likelihood value (e.g., highest value of the global aggregate likelihood values), may identify a first acoustic plane-wave associated with the maximum global aggregate likelihood value, and determine the azimuth value corresponding to the first acoustic plane-wave, although the disclosure is not limited thereto. Thus, the device 110 may use the global aggregate likelihood values to identify a dominant acoustic plane-wave that corresponds to a direct path (e.g., line-of-sight) between the sound source (e.g., user 5) and the device 110, distinguishing this acoustic plane-wave from other acoustic plane-waves that correspond to acoustic reflections.
Computer instructions for operating the device 110 and its various components may be executed by the respective device's controller(s)/processor(s) 1204, using the memory 1206 as temporary “working” storage at runtime. A device's computer instructions may be stored in a non-transitory manner in non-volatile memory 1206, storage 1208, or an external device(s). Alternatively, some or all of the executable instructions may be embedded in hardware or firmware on the respective device in addition to or instead of software.
The device 110 includes input/output device interfaces 1202. A variety of components may be connected through the input/output device interfaces 1202, as will be discussed further below. Additionally, the device 110 may include an address/data bus 1224 for conveying data among components of the respective device. Each component within a device 110 may also be directly connected to other components in addition to (or instead of) being connected to other components across the bus 1224.
Referring to
Via antenna(s) 1214, the input/output device interfaces 1202 may connect to one or more networks 199 via a wireless local area network (WLAN) (such as WiFi) radio, Bluetooth, and/or wireless network radio, such as a radio capable of communication with a wireless communication network such as a Long Term Evolution (LTE) network, WiMAX network, 3G network, 4G network, 5G network, etc. A wired connection such as Ethernet may also be supported. Through the network(s) 199, the system may be distributed across a networked environment. The I/O device interface 1202 may also include communication components that allow data to be exchanged between devices such as different physical servers in a collection of servers or other components.
The components of the device 110 may include their own dedicated processors, memory, and/or storage. Alternatively, one or more of the components of the device 110 may utilize the I/O interfaces 1202, processor(s) 1204, memory 1206, and/or storage 1208 of the device 110, respectively.
As noted above, multiple devices may be employed in a single system. In such a multi-device system, each of the devices may include different components for performing different aspects of the system's processing. The multiple devices may include overlapping components. The components of the device 110, as described herein, are illustrative, and may be located as a stand-alone device or may be included, in whole or in part, as a component of a larger device or system.
Multiple device 110 and/or other components may be connected over a network(s) 199. The network(s) 199 may include a local or private network or may include a wide network such as the Internet. Devices may be connected to the network(s) 199 through either wired or wireless connections. For example, the devices 110 may be connected to the network(s) 199 through a wireless service provider, over a WiFi or cellular network connection, or the like. Other devices may be included as network-connected support devices, which may connect to the network(s) 199 through a wired connection or wireless connection.
The concepts disclosed herein may be applied within a number of different devices and computer systems, including, for example, general-purpose computing systems, speech processing systems, and distributed computing environments.
The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure. Many modifications and variations of the disclosed aspects may be apparent to those of skill in the art. Persons having ordinary skill in the field of computers and speech processing should recognize that components and process steps described herein may be interchangeable with other components or steps, or combinations of components or steps, and still achieve the benefits and advantages of the present disclosure. Moreover, it should be apparent to one skilled in the art, that the disclosure may be practiced without some or all of the specific details and steps disclosed herein.
Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure. The computer readable storage medium may be implemented by a volatile computer memory, non-volatile computer memory, hard drive, solid-state memory, flash drive, removable disk, and/or other media. In addition, components of system may be implemented as in firmware or hardware, such as an acoustic front end (AFE), which comprises, among other things, analog and/or digital filters (e.g., filters configured as firmware to a digital signal processor (DSP)).
Conditional language used herein, such as, among others, “can,” “could,” “might,” “may,” “e.g.,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements, and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without other input or prompting, whether these features, elements, and/or steps are included or are to be performed in any particular embodiment. The terms “comprising,” “including,” “having,” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list.
Disjunctive language such as the phrase “at least one of X, Y, Z,” unless specifically stated otherwise, is understood with the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present.
As used in this disclosure, the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.
This application is a continuation of, and claims priority to, U.S. Non-Provisional patent application Ser. No. 17/952,806, entitled “SOUND SOURCE LOCALIZATION USING ACOUSTIC WAVE DECOMPOSITION,” filed on Sep. 26, 2022, and scheduled to issue as U.S. Pat. No. 12,101,599. The above application is herein incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 17952806 | Sep 2022 | US |
Child | 18889896 | US |