In audio systems, beamforming refers to techniques that are used to isolate audio from a particular direction. Beamforming may be particularly useful when filtering out noise from non-desired directions. Beamforming may be used for various tasks, including isolating desired speech and/or voice commands to be executed by a speech-processing system.
For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings.
Some electronic devices may include an audio-based input/output interface. A user may interact with such a device—which may be, for example, a smartphone, tablet, computer, or other speech-controlled device—partially or exclusively using his or her voice and ears. Exemplary interactions include listening to music or other audio, communications such as telephone calls, audio messaging, and video messaging, and/or audio input for search queries, weather forecast requests, navigation requests, or other such interactions. The device may include one or more microphones for capturing voice input and hardware and/or software for converting the voice input into audio data. As explained in greater detail below, the device may further include hardware and/or software for analyzing the audio data and determining commands and requests therein and/or may send the audio data to a remote device for such analysis. The device may include an audio output device, such as a speaker, for outputting audio that in some embodiments responds to and/or prompts for the voice input.
Use of the above-described electronic device may, at times, be inconvenient, difficult, or impossible. Sometimes, such as while exercising, working, or driving, the user's hands may be occupied, and the user may not be able to hold the device in such a fashion as to effectively interact with the device's audio interface. Other times, the level of ambient noise may be too high for the device to accurately detect speech from the user or too high for the user to understand audio output from the device. In these situations, the user may prefer to connect headphones to the device. Headphones may also be used by a user to interact with a variety of other devices. As the term is used herein, “headphones” may refer to any wearable audio input/output device and includes headsets, earphones, earbuds, or any similar device. For added convenience, the user may choose to use wireless headphones, which communicate with the device—and optionally each other—via a wireless connection, such as Bluetooth, Wi-Fi, near-field magnetic induction (NFMI), Long-Term Evolution (LTE), 5G, or any other type of wireless connection.
In the present disclosure, for clarity, headphone components that are capable of wireless communication with both a third device and each other are referred to as “wireless earbuds,” but the term “earbud” does not limit the present disclosure to any particular type of wired or wireless headphones. The present disclosure may further differentiate between a “right earbud,” meaning a headphone component disposed in or near a right ear of a user, and a “left earbud,” meaning a headphone component disposed in or near a left ear of a user. A “primary” earbud communicates with both a “secondary” earbud, using a first wireless connection (such as a Bluetooth connection); the primary earbud further communicates with a third device (such as a smartphone, smart watch, or similar device) using a second connection (such as a Bluetooth connection). The secondary earbud communicates directly with only with the primary earbud and does not communicate using a dedicated connection directly with the smartphone; communication therewith may pass through the primary earbud via the first wireless connection.
The primary and secondary earbuds may include similar hardware and software; in other instances, the secondary earbud contains only a subset of the hardware/software included in the primary earbud. If the primary and secondary earbuds include similar hardware and software, they may trade the roles of primary and secondary prior to or during operation. In the present disclosure, the primary earbud may be referred to as the “first device,” the secondary earbud may be referred to as the “second device,” and the smartphone or other device may be referred to as the “third device.” The first, second, and/or third devices may communicate over a network, such as the Internet, with one or more server devices, which may be referred to as “remote device(s).”
Beamforming systems isolate audio from a particular direction in a multi-directional audio capture system. As the terms are used herein, an azimuth direction refers to a direction in the XY plane with respect to the system, and elevation refers to a direction in the Z plane with respect to the system. One technique for beamforming involves boosting target audio received from a desired azimuth direction and/or elevation while dampening noise audio received from a non-desired azimuth direction and/or non-desired elevation. Existing beamforming systems, however, may perform poorly in noisy environments and/or when the target audio is low in volume; in these systems, the audio may not be boosted enough to accurately separate speech signals and/or perform additional processing, such as automatic speech recognition (ASR) or speech-to-text processing.
Target speech separation is an important feature for next-generation hearable devices. Conventional beamformers often underperform in strong noise fields due to their high sidelobes. For example, performing beamforming forms a beampattern, which generates directions of effective gain or attenuation. The beampattern may exhibit a plurality of lobes, or regions of gain, with gain predominating in a particular direction (e.g., beampattern direction). While a main lobe may extend along the beampattern direction, the beampattern also includes side lobes, which are locations where the beampattern extends perpendicular to the beampattern direction that may capture undesired speech or acoustic noise.
Deep neural network (DNN) based methods for speech separation may perform well, but they may carry assumptions, for example, about the number of concurrent talkers. In the case of multiple concurrent talkers, certain DNN methods cannot unambiguously determine which voice the user intends to follow, without additional cues reflecting user's intention. In addition, many DNN based methods process large time windows and therefore suffer from high latency.
Proposed is a neural sidelobe canceller to overcome the limitations of conventional beamformers and DNN-based systems for target speech separation on hearable devices. The neural sidelobe canceller may selectively enhance speech only from the user's desired direction, while suppressing all non-speech sources and competing speech streams from other directions at a low algorithmic latency of 2.5 ms.
In the proposed system, microphone inputs are first processed through a fixed superdirective beamformer (SDB) with a steering vector pointing to the desired enhancement direction. SDB output is then concatenated with raw microphone inputs to form the input to a DNN-based sidelobe canceller, which operates in the time domain with 2.5-ms-long frames. Using such input, the DNN further enhances speech in the target direction, with all the sidelobes, noise sources and reverberation removed. The proposed system generalizes well to different enhancement directions, which can be controlled by the user by adjusting the input SDB direction.
As illustrated in
The device 110 may generate (136) first feature data using the first audio data. For example, the device 110 may include an encoder component configured to apply a first number of filters (e.g., N=512 filters) to generate a high-dimensional representation of the third audio data. The device 110 may generate (138) mask data using the first feature data and a deep neural network (DNN) component, as described in greater detail below with regard to
The devices 110a/110b may include one or more loudspeaker(s) 114 (e.g., loudspeaker 202a/202b), one or more external microphone(s) 112 (e.g., first microphones 204a/204b and second microphones 205a/205b), and one or more internal microphone(s) 112 (e.g., third microphones 206a/206b). The loudspeaker 114 may be any type of loudspeaker, such as an electrodynamic speaker, electrostatic speaker, diaphragm speaker, or piezoelectric loudspeaker; the microphones 112 may be any type of microphones, such as piezoelectric or MEMS microphones. Each device 110a/110b may include one or more microphones 112.
As illustrated in
One or more batteries 207a/207b may be used to supply power to the devices 110a/110b. One or more antennas 210a/210b may be used to transmit and/or receive wireless signals over the first connection 124a and/or second connection 124b; an I/O interface 212a/212b contains software and hardware to control the antennas 210a/210b and transmit signals to and from other components. A processor 214a/214b may be used to execute instructions in a memory 216a/216b; the memory 216a/216b may include volatile memory (e.g., random-access memory) and/or non-volatile memory or storage (e.g., flash memory). One or more sensors 218a/218b, such as accelerometers, gyroscopes, or any other such sensor may be used to sense physical properties related to the devices 110a/110b, such as orientation; this orientation may be used to determine whether either or both of the devices 110a/110b are currently disposed in an ear of the user (i.e., the “in-ear” status of each device).
An audio signal is a representation of sound and an electronic representation of an audio signal may be referred to as audio data, which may be analog and/or digital without departing from the disclosure. For ease of illustration, the disclosure may refer to either audio data (e.g., microphone audio data, input audio data, etc.) or audio signals (e.g., microphone audio signal, input audio signal, etc.) without departing from the disclosure. Additionally or alternatively, portions of a signal may be referenced as a portion of the signal or as a separate signal and/or portions of audio data may be referenced as a portion of the audio data or as separate audio data. For example, a first audio signal may correspond to a first period of time (e.g., 30 seconds) and a portion of the first audio signal corresponding to a second period of time (e.g., 1 second) may be referred to as a first portion of the first audio signal or as a second audio signal without departing from the disclosure. Similarly, first audio data may correspond to the first period of time (e.g., 30 seconds) and a portion of the first audio data corresponding to the second period of time (e.g., 1 second) may be referred to as a first portion of the first audio data or second audio data without departing from the disclosure. Audio signals and audio data may be used interchangeably, as well; a first audio signal may correspond to the first period of time (e.g., 30 seconds) and a portion of the first audio signal corresponding to a second period of time (e.g., 1 second) may be referred to as first audio data without departing from the disclosure.
In some examples, the audio data may correspond to audio signals in a time-domain. However, the disclosure is not limited thereto and the device 110 may convert these signals to a subband-domain or a frequency-domain prior to performing additional processing, such as adaptive feedback reduction (AFR) processing, acoustic echo cancellation (AEC), adaptive interference cancellation (AIC), noise reduction (NR) processing, tap detection, and/or the like. For example, the device 110 may convert the time-domain signal to the subband-domain by applying a bandpass filter or other filtering to select a portion of the time-domain signal within a desired frequency range. Additionally or alternatively, the device 110 may convert the time-domain signal to the frequency-domain using a Fast Fourier Transform (FFT) and/or the like.
As used herein, audio signals or audio data (e.g., microphone audio data, or the like) may correspond to a specific range of frequency bands. For example, the audio data may correspond to a human hearing range (e.g., 20 Hz-20 kHz), although the disclosure is not limited thereto.
As used herein, a frequency band (e.g., frequency bin) corresponds to a frequency range having a starting frequency and an ending frequency. Thus, the total frequency range may be divided into a fixed number (e.g., 256, 512, etc.) of frequency ranges, with each frequency range referred to as a frequency band and corresponding to a uniform size. However, the disclosure is not limited thereto and the size of the frequency band may vary without departing from the disclosure.
The device 110 may include multiple microphones 112 configured to capture sound and pass the resulting audio signal created by the sound to a downstream component, such as an SDB component discussed below. Each individual piece of audio data captured by a microphone may be in a time domain. To isolate audio from a particular direction, the device may compare the audio data (or audio signals related to the audio data, such as audio signals in a sub-band domain) to determine a time difference of detection of a particular segment of audio data. If the audio data for a first microphone includes the segment of audio data earlier in time than the audio data for a second microphone, then the device may determine that the source of the audio that resulted in the segment of audio data may be located closer to the first microphone than to the second microphone (which resulted in the audio being detected by the first microphone before being detected by the second microphone).
Using such direction isolation techniques, a device 110 may isolate directionality of audio sources. For example, a particular direction may be associated with azimuth angles divided into bins (e.g., 0-45 degrees, 46-90 degrees, and so forth). To isolate audio from a particular direction, the device 110 may apply a variety of audio filters to the output of the microphones where certain audio is boosted while other audio is dampened, to create isolated audio corresponding to a particular direction, which may be referred to as a beam. While in some examples the number of beams may correspond to the number of microphones, the disclosure is not limited thereto and the number of beams may be independent of the number of microphones 112. For example, a two-microphone array may be processed to obtain more than two beams, thus using filters and beamforming techniques to isolate audio from more than two directions. Thus, the number of microphones may be more than, less than, or the same as the number of beams. The beamformer unit of the device may have an adaptive beamformer (ABF) unit/fixed beamformer (FBF) unit processing pipeline for each beam, as explained below.
The device 110 may use various techniques to determine the beam corresponding to the look-direction. For example, the device 110 may use techniques (either in the time domain or in the sub-band domain) such as calculating a signal-to-noise ratio (SNR) for each beam, performing voice activity detection (VAD) on each beam, or the like, although the disclosure is not limited thereto.
After identifying the look-direction associated with the speech, the device 110 may use a FBF unit or other such component to isolate audio coming from the look-direction using techniques known to the art and/or explained herein. For example, the device 110 may boost audio coming from a particular direction, thus increasing the amplitude of audio data corresponding to speech from user relative to other audio captured from other directions. In this manner, noise from diffuse sources that is coming from all the other directions will be dampened relative to the desired audio (e.g., speech from the user) coming from the selected direction.
The device 110 may also operate an adaptive noise canceller (ANC) unit 560 to amplify audio signals from directions other than the direction of an audio source. Those audio signals represent noise signals so the resulting amplified audio signals from the ABF unit may be referred to as noise reference signals 520, discussed further below. The device 110 may then weight the noise reference signals, for example using filters 522 discussed below. The device may combine the weighted noise reference signals 524 into a combined (weighted) noise reference signal 525. Alternatively the device may not weight the noise reference signals and may simply combine them into the combined noise reference signal 525 without weighting. The device may then subtract the combined noise reference signal 525 from the amplified first audio signal 532 to obtain a difference 536. The device may then output that difference, which represents the desired output audio signal with the noise removed. The diffuse noise is removed by the FBF unit when determining the signal 532 and the directional noise is removed when the combined noise reference signal 525 is subtracted. The device may also use the difference to create updated weights (for example for filters 522) to create updated weights that may be used to weight future audio signals. The step-size controller 504 may be used modulate the rate of adaptation from one weight to an updated weight.
In this manner noise reference signals are used to adaptively estimate the noise contained in the output signal of the FBF unit using the noise-estimation filters 522. This noise estimate is then subtracted from the FBF unit output signal to obtain the final ABF unit output signal. The ABF unit output signal is also used to adaptively update the coefficients of the noise-estimation filters. Lastly, we make use of a robust step-size controller to control the rate of adaptation of the noise estimation filters.
As shown in
The microphone outputs 513 may be passed to the FBF unit 540 including the filter and sum unit 530. The FBF unit 540 may be implemented as a robust super-directive beamformer unit, delayed sum beamformer unit, or the like. The FBF unit 540 is presently illustrated as a super-directive beamformer (SDB) unit due to its improved directivity properties. The filter and sum unit 530 takes the audio signals from each of the microphones and boosts the audio signal from the microphone associated with the desired look direction and attenuates signals arriving from other microphones/directions. The filter and sum unit 530 may operate as illustrated in
As illustrated in
Each particular FBF unit may be tuned with filter coefficients to boost audio from one of the particular beams. For example, FBF unit 540-1 may be tuned to boost audio from beam 1, FBF unit 540-2 may be tuned to boost audio from beam 2 and so forth. If the filter block is associated with the particular beam, its beamformer filter coefficient h will be high whereas if the filter block is associated with a different beam, its beamformer filter coefficient h will be lower. For example, for FBF unit 540-7, direction 7, the beamformer filter coefficient h7 for filter 612g may be high while beamformer filter coefficients h1-h6 and h8 may be lower. Thus the filtered audio signal y7 will be comparatively stronger than the filtered audio signals y1-y6 and y8 thus boosting audio from direction 7 relative to the other directions. The filtered audio signals will then be summed together to create the output audio signal. The filtered audio signals will then be summed together to create the output audio signal Yf 532. Thus, the FBF unit 540 may phase align microphone audio data toward a give n direction and add it up. So signals that are arriving from a particular direction are reinforced, but signals that are not arriving from the look direction are suppressed. The robust FBF coefficients are designed by solving a constrained convex optimization problem and by specifically taking into account the gain and phase mismatch on the microphones.
The individual beamformer filter coefficients may be represented as HBF,m(r), where r=0, . . . . R, where R denotes the number of beamformer filter coefficients in the subband domain. Thus, the output Yf532 of the filter and sum unit 530 may be represented as the summation of each microphone signal filtered by its beamformer coefficient and summed up across the M microphones:
Turning once again to
As shown in
The output Z 520 of each nullformer 518 will be a boosted signal corresponding to a non-desired direction. As audio from non-desired direction may include noise, each signal Z 520 may be referred to as a noise reference signal. Thus, for each channel 1 through P the adaptive noise canceller (ANC) unit 560 calculates a noise reference signal Z 520, namely Z1 520a through ZP 520p. Thus, the noise reference signals that are acquired by spatially focusing towards the various noise sources in the environment and away from the desired look-direction. The noise reference signal for channel p may thus be represented as Zp(k,n) where ZP is calculated as follows:
where HNF,m(p,r) represents the nullformer coefficients for reference channel p.
As described above, the coefficients for the nullformer filters 612 are designed to form a spatial null toward the look ahead direction while focusing on other directions, such as directions of dominant noise sources. The output from the individual nullformers Z1 520a through ZP 520p thus represent the noise from channels 1 through P.
The individual noise reference signals may then be filtered by noise estimation filter blocks 522 configured with weights W to adjust how much each individual channel's noise reference signal should be weighted in the eventual combined noise reference signal Ŷ 525. The noise estimation filters (further discussed below) are selected to isolate the noise to be removed from output Yf532. The individual channel's weighted noise reference signal ŷ 524 is thus the channel's noise reference signal Z multiplied by the channel's weight W. For example, ŷ1=Z1*W1, ŷ2=Z2*W2, and so forth. Thus, the combined weighted noise estimate Ŷ 525 may be represented as:
where WP(k, n, l) is the l th element of WP(k, n) and l denotes the index for the filter coefficient in subband domain. The noise estimates of the P reference channels are then added to obtain the overall noise estimate:
The combined weighted noise reference signal Y 525, which represents the estimated noise in the audio signal, may then be subtracted from the FBF unit output Yf 532 to obtain a signal E 536, which represents the error between the combined weighted noise reference signal f 525 and the FBF unit output Yf532. That error, E 536, is thus the estimated desired non-noise portion (e.g., target signal portion) of the audio signal and may be the output of the adaptive noise canceller (ANC) unit 560. That error, E 536, may be represented as:
E(k,n)=Y(k,n)−Ŷ(k,n) [5]
As shown in
where Zp(k, n)=[Zp(k, n) Zp(k, n−1) . . . Zp(k, n-L)]T is the noise estimation vector for the pth channel, μp(k,n) is the adaptation step-size for the pth channel, and ε is a regularization factor to avoid indeterministic division. The weights may correspond to how much noise is coming from a particular direction.
As can be seen in Equation [6], the updating of the weights W involves feedback. The weights Ware recursively updated by the weight correction term (the second half of the right hand side of Equation 6) which depends on the adaptation step size, μp(k,n), which is a weighting factor adjustment to be added to the previous weighting factor for the filter to obtain the next weighting factor for the filter (to be applied to the next incoming signal). To ensure that the weights are updated robustly (to avoid, for example, target signal cancellation) the step size μp(k,n) may be modulated according to signal conditions. For example, when the desired signal arrives from the look-direction, the step-size is significantly reduced, thereby slowing down the adaptation process and avoiding unnecessary changes of the weights W. Likewise, when there is no signal activity in the look-direction, the step-size may be increased to achieve a larger value so that weight adaptation continues normally. The step-size may be greater than 0, and may be limited to a maximum value. Thus, the device may be configured to determine when there is an active source (e.g., a speaking user) in the look-direction. The device may perform this determination with a frequency that depends on the adaptation step size.
The step-size controller 504 will modulate the rate of adaptation. Although not shown in
The BNR may be computed as:
where, kLB denotes the lower bound for the subband range bin and kUB denotes the upper bound for the subband range bin under consideration, and δ is a regularization factor. Further, BYY(k,n) denotes the powers of the fixed beamformer output signal (e.g., output Yf 532) and NZZ,p(k,n) denotes the powers of the pth nullformer output signals (e.g., the noise reference signals Z1 520a through ZP 520p). The powers may be calculated using first order recursive averaging as shown below:
BYY(k,n)=αBYY(k,n−1)+(1−α)|Y(k,n)|2 [8]
NZZ,p(k,n)=αNZZ,p(k,n−1)+(1−α)|Zp(k,n)|2 [8]
where, α∈[0,1] is a smoothing parameter.
The BNR values may be limited to a minimum and maximum value as follows BNRp(k, n)∈[BNRmin, BNRmax], and the BNR may be averaged across the subband bins:
The above value may be smoothed recursively to arrive at the mean BNR value:
where β is a smoothing factor.
The mean BNR value may then be transformed into a scaling factor in the interval of [0,1] using a sigmoid transformation:
υ(n)=γ(BNRp(n)−σ) [12]
where γ and σ are tunable parameters that denote the slope (γ) and point of inflection (σ) for the sigmoid function.
Using Equation 11, the adaptation step-size for subband k and frame-index n is obtained as:
where μo is a nominal step-size. μo may be used as an initial step size with scaling factors and the processes above used to modulate the step size during processing.
At a first time period, audio signals from the microphones 112 may be processed as described above using a first set of weights for the filters 522. Then, the error E 536 associated with that first time period may be used to calculate a new set of weights for the filters 522, where the new set of weights is determined using the step size calculations described above. The new set of weights may then be used to process audio signals from the microphones 112 associated with a second time period that occurs after the first time period. Thus, for example, a first filter weight may be applied to a noise reference signal associated with a first audio signal for a first microphone/first direction from the first time period. A new first filter weight may then be calculated using the method above and the new first filter weight may then be applied to a noise reference signal associated with the first audio signal for the first microphone/first direction from the second time period. The same process may be applied to other filter weights and other audio signals from other microphones/directions.
The above processes and calculations may be performed across sub-bands k, across channels p and for audio frames n, as illustrated in the particular calculations and equations.
The estimated non-noise (e.g., output) audio signal E 536 may be processed by a synthesis filterbank 528 which converts the signal 536 into time-domain beamformed audio data Z 550 which may be sent to a downstream component for further operation. As illustrated in
As shown in
The beampattern 802 may exhibit a plurality of lobes, or regions of gain, with gain predominating in a particular direction designated the beampattern direction 804. A main lobe 806 is shown here extending along the beampattern direction 804. A main lobe beam-width 808 is shown, indicating a maximum width of the main lobe 806. In this example, the beampattern 802 also includes side lobes 810, 812, 814, and 816. Opposite the main lobe 806 along the beampattern direction 804 is the back lobe 818.
Disposed around the beampattern 802 are null regions 820. These null regions are areas of attenuation to signals. In some examples, a user may resides within the main lobe 806 and benefit from the gain provided by the beampattern 802 and exhibit an improved SNR ratio compared to a signal acquired with non-beamforming. In contrast, if the user were to speak from a null region, the resulting audio signal may be significantly reduced. As shown in this illustration, the use of the beampattern provides for gain in signal acquisition compared to non-beamforming. Beamforming also allows for spatial selectivity, effectively allowing the system to “turn a deaf ear” on a signal which is not of interest. Beamforming may result in directional audio signal(s) that may then be processed by other components of the device 110 and/or system 100.
Additionally or alternatively, as the first device 110a and the second device 110b are connected via the first connection 124a, in some examples the neural sidelobe canceller 900 may receive raw microphone audio data 915 generated by first microphones 112a included in the first device 110a and second microphones 112b included in the second device 110b. For example, the neural sidelobe canceller 900 may receive raw microphone audio data 915 corresponding to four microphones 112 (e.g., two microphones 204a/205a from the first device 110a and two microphones 204b/205b from the second device 110b), six microphones 112 (e.g., three microphones 204a/205a/206a from the first device 110a and three microphones 204b/205b/206b from the second device 110b), and/or a combination thereof without departing from the disclosure.
The SDB component 920 may receive the raw microphone audio data 915 and process the raw microphone audio data 915 using a steering vector pointing to the desired enhancement direction. For example, the SDB component 920 may be a data-independent beamformer, with its beam pointing to the desired a look direction (e.g., particular azimuth direction and/or elevation), and the SDB component 920 may emphasize or boost audio from the desired look direction to generate directional data 925. In some examples, the SDB component 920 may be configured to enhance sources in front of the user 5, although the disclosure is not limited thereto and the desired look direction may vary without departing from the disclosure. In contrast to fixed beamformers that generate multiple beamformed output signals, the SDB component 920 generates directional data 925 that only corresponds to the look direction (e.g., first direction) and does not include additional beamformed outputs associated with non-target directions. To perform target speech separation, the neural sidelobe canceller 900 assumes that the exact position of the target is unknown but within a directional field of view (FOV) within +/−30° azimuth, and that there may be multiple concurrent targets with small azimuth difference.
In some examples, the SDB component 920 may be configured to enhance sources in front of the user 5, such that the desired a look direction (e.g., particular azimuth direction and/or elevation) is fixed at a first position relative to the user 5. Thus, the user 5 may select desired enhancement direction (e.g., select target speech) by physically moving the user's head, enabling the user 5 to distinguish between different speakers based on the physical orientation of the device 110. However, the disclosure is not limited thereto, and in other examples the device 110 may be configured to receive input data indicating the desired a look direction from the third device 122 and/or other devices without departing from the disclosure. For example, the third device 122 may include a user interface that enables the user 5 to dynamically select the desired a look direction. Thus, the user 5 may control the desired a look direction to select between different speakers without physically moving the user's head without departing from the disclosure.
The neural sidelobe canceller 900 uses the directional data 925 as directional cues to select between different speakers (e.g., users speaking, speech sources, etc.). For example, the encoder component 930 may receive the raw microphone audio data 915 and the directional data 925 and may combine the raw microphone audio data 915 and the directional data 925 to generate input signals for the separation stage. In some examples, the encoder component 930 may concatenate the directional data 925 to the raw microphone audio data 915 to generate the input signals, although the disclosure is not limited thereto.
As described above, the device 110 may include two or more microphones 112 and may generate the raw microphone audio data 915 using the first number of microphones to generate the first number of input signals (e.g., C input signals). For example, a first microphone 112a may generate first microphone audio data z1(t) in the time-domain, a second microphone may generate second microphone audio data z2(t) in the time-domain, and so on until the final microphone 112C may generate final microphone audio data zC(t). In some examples, a time-domain signal may be represented as microphone audio data z(t), which is comprised of a sequence of individual samples of audio data. Thus, z(t) denotes an individual sample that is associated with a time index t, where T is the total number of audio samples (e.g., total number of time indexes).
While the microphone audio data z(t) is comprised of a plurality of samples, in some examples the device 110 may group a plurality of samples and process them together. For example, the device 110 may group a number of samples together in a frame (e.g., audio frame) to generate microphone audio data z(n). Thus, z(n) corresponds to the time-domain signal and identifies an individual frame (e.g., fixed number of samples s) associated with a frame index k, where K is the total number of audio frames (e.g., total number of frame indexes).
In the example illustrated in
In contrast, the SDB component 920 may process the raw microphone audio data 915 using short frames and may represent the directional data 925 using frame indexes. For example, the SDB component 920 may process 40 audio samples per audio frame, which corresponds to 2.5 ms, although the disclosure is not limited thereto. Thus, the directional data 925 may be represented as a two-dimensional array (1×K), with a first dimension (e.g., 1) corresponding to a single row, and a second dimension (e.g., K) corresponding to the number of columns and indicating a total number of frame indexes.
The encoder component 930 may also process short frames of audio data and may therefore represent the combined input signals using frame indexes. For example, as the raw microphone audio data 915 corresponds to the first number of microphones (e.g., C input signals) and the directional data 925 corresponds to a single input signal, the encoder component 930 may represent the combined input signals as a two-dimensional array (C+1×K), with a first dimension (e.g., C+1) corresponding to a total number of rows, and a second dimension (e.g., K) corresponding to the number of columns and indicating the total number of frame indexes.
The encoder component 930 may be configured to transform the audio frames of the input signals into features for the separation stage. Unlike some multi-channel speech separation approaches, the encoder component 930 does not use spatial features such as normalized cross-correlation, inter-channel phase differences, and/or the like. Instead, the encoder component 930 is fully-trainable and configured to learn the optimal representation of the input signals and develop directional selectivity/blocking, while maintaining the small input frame size. As illustrated in
As the encoder component 930 applies the first number of filters using the (C+1, L) kernel, the encoder component 930 generates first feature data 935 that may represent the features as a three-dimensional array (C+1×K×N), with a first dimension (e.g., C+1) corresponding to a total number of rows (e.g., total number of input signals), a second dimension (e.g., K) corresponding to the number of columns and indicating the total number of frame indexes, and a third dimension (e.g., N) corresponding to a depth and indicating a total number of channels. Thus, the encoder component 930 may map a segment of the input signals to a high-dimensional representation that may be used by the separation stage to estimate mask data corresponding to the target speech. As the neural sidelobe canceller 900 uses an encoder-separation-decoder architecture, all of the components between the encoder component 930 and a decoder component 990 may be considered part of the separation stage.
The separation stage is configured to predict mask data 975 with which to apply to the output of the encoder component 930 (e.g., first feature data 935) to retrieve clean, anechoic target speech. Using the mask data 975, the neural sidelobe canceller 900 may generate the enhanced target speech signal while cancelling sidelobes, noise, and reverberation.
Inputs to the separation stage are processed through a bottleneck component 940, which may be configured to reduce the number of channels from a first number of channels (e.g., 512) to a second number of channels (e.g., 128). For example, the bottleneck component 940 may be implemented as a 1×1 pointwise convolutional block that is configured to remap N=512 to B=128 channels, although the disclosure is not limited thereto. Thus, the bottleneck component 940 may process the first feature data 935 to generate second feature data that may represent the features as a three-dimensional array (C+1×K×B), with the first dimension (e.g., C+1) and the second dimension (e.g., K) described above, and a third dimension (e.g., B) limiting a total number of channels processed by temporal convolutional network (TCN) blocks 950 and causal-fastformer (C-FF) blocks 960.
As illustrated in
Functionally, the C-FF block 960 is analogous to dot-product attention, but its frame-wise computational complexity is independent of the sequence length. This property is particular desirable for a low-latency use case with a large number of small frames (K) in the sequence. The causality of the C-FF is achieved by selectively masking the input, so that that module integrates only a most recent 200 frames (e.g., 250 ms) to compute its output at a given time step. The motivation for using such short-sighted self-attention is to reinforce local consistency of speech, which may be disrupted by the long-range relations captured in the TCN blocks 950. The addition of the C-FF blocks 960 does not greatly increase a model size or computational complexity for the neural sidelobe canceller 900, because all transformations performed in the C-FF block 960 are implemented as fully-connected layers and the only other trainable parameter is W (size d).
A single C-FF block 960 follows each TCN block 950 in the separation stage. For example, as the neural sidelobe canceller 900 processes the second feature data using three TCN blocks 950a-950c, the neural sidelobe canceller 900 also includes three C-FF blocks 960a-960c. After processing the second feature data using all three TCN blocks 950a-950c and all three C-FF blocks 960a-960c to generate third feature data, a mask component 970 may process the third feature data to generate mask data 975. For example, the mask component 970 may receive the third feature data output by the third C-FF block 960c and may process the third feature data using a Parametric Rectified Linear Unit (e.g., PReLU) activation layer, a 1×1 pointwise convolutional block that is configured to increase the number of channels, and a Rectified Linear Unit (ReLU) activation layer.
The ReLU activation layer may be represented as:
Such that negative values are ignored and only the positive part of the feature data is passed. Similarly, the PReLU activation layer may be represented as:
where α is a trainable scalar controlling the negative slope of the rectifier. Thus, while the RELU activation layer ignores negative values by setting them equal to zero, the PReLU activation layer allows a small gradient when the unit is not active and makes a coefficient of leakage into a parameter that is learned during training.
While the bottleneck component 940 may be configured to reduce the number of channels (e.g., remap N=512 to B=128 channels), the 1×1 pointwise convolutional block included in the mask component 970 may be configured to increase the number of channels back to the number of filters included in the encoder component 930 and/or the decoder component 990. For example, the 1×1 pointwise convolutional block may remap B=128 to N=512 channels, although the disclosure is not limited thereto. Thus, the mask component 970 may process the third feature data to generate the mask data 975, which may represent the features as a three-dimensional array (C+1×K×N).
After generating the mask data 975, the neural sidelobe canceller 900 may apply the mask to the first feature data 935 to generate masked feature data 985, which is a feature representation of the target speech. For example, a combiner component 980 may multiply the first feature data 935 by the mask data 975 to identify portions of the first feature data 935 that correspond to the target speech and remove portions of the first feature data 935 that correspond to non-target speech, acoustic noise, and/or the like. Thus, the masked feature data 985 includes the first number of channels (e.g., N=512), with nonzero portions of the masked feature data 985 representing the target speech (e.g., based on a corresponding filter of the N filters).
The decoder component 990 may receive the masked feature data 985 and may reconstruct frames of the masked encoder output back to the time domain. For example, the decoder component 990 may perform a transpose 1D convolution to generate reconstructed frames and then may perform an overlap and add operation using the reconstructed frames to generate output audio data 995 representing the target speech. The output audio data 995 may be anechoic, with sidelobes, noise, and reverberation cancelled.
The objective of training the end-to-end system is maximizing the scale-invariant source-to-noise ratio (SI-SNR), which is defined as:
where ŝ∈1xT and s∈
1xT re the estimated and original clean sources, respectively, and ∥s∥2=
ŝ, s
denotes the signal power. Scale invariance is ensured by normalizing ŝ and s to zero-mean prior to the calculation. In some examples, utterance-level permutation invariant training (uPIT) may be applied during training to address the source permutation problem.
Thus, the neural sidelobe canceller 900 may train the end-to-end system using gated SI-SNR−L1 loss defined as:
where s corresponds to the anechoic target speech, ŝ represents model output, and α is a constant used to balance magnitudes of the two components of the gated loss (e.g., α=100, although the disclosure is not limited thereto). Such defined loss allows the neural sidelobe canceller 900 to handle cases in which there is no target talker to reconstruct (ntarget=0). While the loss function above illustrates an example in which the system is trained to maximize the SI-SNR, the disclosure is not limited thereto and the loss function may maximize other parameters, such as a source-to-distortion ratio (SDR) and/or the like, without departing from the disclosure.
Various machine learning techniques may be used to train the neural sidelobe canceller 900 without departing from the disclosure. Models may be trained and operated according to various machine learning techniques. Such techniques may include, for example, neural networks (such as deep neural networks and/or recurrent neural networks), inference engines, trained classifiers, etc. Examples of trained classifiers include conditional random fields (CRF) classifiers, Support Vector Machines (SVMs), neural networks, decision trees, AdaBoost (short for “Adaptive Boosting”) combined with decision trees, and random forests.
In order to apply the machine learning techniques, the machine learning processes themselves need to be trained. Training a machine learning component such as, in this case, one of the first or second models, requires establishing a “ground truth” for the training examples. In machine learning, the term “ground truth” refers to the accuracy of a training set's classification for supervised learning techniques. For example, known types for previous queries may be used as ground truth data for the training set used to train the various components/models. Various techniques may be used to train the models including backpropagation, statistical learning, supervised learning, semi-supervised learning, stochastic learning, stochastic gradient descent, or other known techniques. Thus, many different training examples may be used to train the model(s) discussed herein. Further, as training data is added to, or otherwise changed, new models may be trained to update the models without departing from the disclosure.
As illustrated in
Additionally or alternatively, the neural sidelobe canceller 900 may include only a single C-FF block 960 without departing from the disclosure. As illustrated in
The TCN block 950 is composed of X dilated, causal 1D convolutions implemented as depthwise separable convolution. As illustrated in
In some examples, the TCN block 950 may comprise eight 1-D convolutional blocks (e.g., X=8), such that the final (e.g., eighth) 1-D convolutional block 1030 receives seventh data and has a dilation factor of 128 (e.g., 28−1=27=128), although the disclosure is not limited thereto. The input to each block is zero padded accordingly to ensure that the output length is the same as the input length (e.g., the output data has the same dimensions as the input data). An architecture for each of the 1-D convolutional blocks included in the TCN block 950 is illustrated in
As illustrated in
As described above, the bottleneck component 940 reduces the number of channels for the TCN block 950 by remapping from N channels to B channels (e.g., remapping from N=512 to B=128, although the disclosure is not limited thereto). Thus, the bottleneck component 940 determines the number of channels (e.g., B channels) in the input and residual path of each 1-D convolutional block. For example, for a 1-D convolutional block with H channels and kernel size P, the size of the kernel in the 1×1 convolutional block 1110 and the depthwise convolutional (D-conv) block 1140 should be O∈B×H×1 and K∈
H×P, respectively, and the size of the kernel in the residual path should be L∈
H×B×1.
As described above, the C-FF block 960 improves upon previous attention models at least by adding causality. For example, the causality of the C-FF block 960 is achieved by selectively masking the input, so that that the C-FF block 960 integrates only a most recent 200 frames (e.g., 250 ms) to compute its output at a given time step. To properly describe the layers and functionality of the C-FF block 960, the following description refers to two previous iterations of the attention model (e.g., a transformer model and a fastformer model).
A first iteration of the attention model (e.g., transformer model) was built upon self-attention, which can effectively model the contexts within a sequence by capturing the interactions between inputs at each pair of positions. For example, self-attention calculates a weighted average of feature representations with the weight proportional to a similarity score between pairs of representations. Formally, an input sequence X∈N×d of sequence length N and a depth d (e.g., N tokens of dimensions d), is projected using three matrices WQ∈
dxd, WK∈
dxd and WV∈
dxd to extract feature representations Q, K, and V. These feature representations are referred to as queries (e.g., query matrix), keys (e.g., key matrix), and values (e.g., value matrix), and the outputs Q, K, V are computed as:
Q=XWQ, K=XWK, V=XWV. [18]
where WQ, WK, and WV are learnable parameters that represent linear transformation parameter matrices. Thus, self-attention can be written as:
where softmax denotes a row-wise softmax normalization function (e.g., softmax function is applied row-wise to QKT). Thus, each element in S depends on all other elements in the same row.
Equation [19] implements a specific form of self-attention called softmax attention, where the similarity score is an exponential of the dot product between a query matrix and a key matrix. Given that subscripting a matrix with i returns the i-th row as a vector, a generalized attention equation for any similarity function can be written as:
Equation [20] is equivalent to Equation [19] if the system substitutes the similarity function with
Thus, the transformer model processed three inputs (e.g., a query matrix, a key matrix, and a value matrix) using a sequence length N and a depth d (e.g., hidden dimension in each attention head). However, the computational complexity of the transformer model was quadratic to the sequence length N, which results in extreme inefficiency for long sequences.
A second iteration of the attention model (e.g., fastformer model) uses additive attention to model global contexts and uses element-wise product to model the interaction between each input representation and global contexts, which can greatly reduce the computational cost and still effectively capture contextual information. For example, the fastformer model uses additive attention mechanism to summarize the query sequence into a global query vector, then models the interaction between the global query vector and attention keys with element-wise product and summarize keys into a global key vector via additive attention. The fastformer model then models the interactions between global key and attention values via element-wise product, uses a linear transformation to learn global context-aware attention values, and finally adds them with the attention query to form the final output. Thus, the computational complexity can be reduced to linearity, and the contextual information in the input sequence can be effectively captured.
The fastformer model first transforms the input embedding matrix into the query, key, and value sequences. The input matrix is denoted as X∈N×d, where N is the sequence length and dis the hidden dimension. Its subordinate vectors are denoted as [x1, x2, . . . , xN]. Following the transformer model, in each attention head there are three independent linear transformation layers to transform the input matrix into a query matrix, a key matrix, and a value matrix, which are written as Q=[q1, q2, . . . , qN], K=[k1, k2, . . . , kN], and V=[V1, V2, . . . , VN], respectively. However, modeling the contextual information of the input sequence based on the interactions among these three matrixes is computational prohibitive (e.g., its quadratic complexity makes it inefficient in long sequence modeling), so the fastformer model reduces computational complexity by summarizing the attention matrixes (e.g., query) before modeling their interactions. Additive attention is a form of attention mechanism that can efficiently summarize important information within a sequence in linear complexity. Thus, the fastformer model uses additive attention to summarize the query matrix into a global query vector q∈
d, which condenses the global contextual information in the attention query. More specifically, the attention weight αi of the i-th query vector is computed as follows:
where wq is a learnable parameter vector. The global attention query vector is computed as follows:
In addition, the fastformer model uses the element-wise product between the global query vector and each key vector to model their interactions and combine them into a global context-aware key matrix. Thus, the i-th vector in this matrix is pi, which is formulated as pi=q*ki, where * denotes an element-wise product. In a similar manner, the additive attention mechanism is used to summarize the global context-aware key matrix, such that the additive attention weight of its i-th vector is computed as:
where Wk is the attention parameter vector. The global key vector k∈d is computed as follows:
Finally, the fastformer model attempts to model the interaction between attention value matrix and the global key vector for better context modeling. Similar to the query-key interaction modeling, the fastformer model performs an element-wise product between the global key and each value vector to compute a key-value interaction vector ui, which is formulated as ui=k*vi. The fastformer model also applies a linear transformation layer to each key-value interaction vector to learn its hidden representation. The output matrix from this layer is denoted as R=[r1, r2, . . . , rN]∈N×d. This matrix is further added together with the query matrix to form the final output of the fastformer model.
In addition, the C-FF block 960 can be made causal by masking the attention computation such that the i-th position can only be influenced by a position j if and only if j≤i; namely a position cannot be influenced by subsequent positions. As a result of the causal masking, along with sharing the same values between the query transformation parameters and the value transformation parameters, Equation [20] can be rewritten as follows:
As illustrated in
As described above with regard to the fastformer, additive attention is a form of attention mechanism that can efficiently summarize important information within a sequence in linear complexity. Thus, the fastformer model uses additive attention to summarize the query matrix into a global query vector q∈d, which condenses the global contextual information in the attention query. Similarly, the fastformer model determines the element-wise product between the global query vector and each key vector and then uses additive attention to summarize the key matrix into a global key vector k∈
d.
The C-FF block 960 operates in a similar manner, except that the C-FF block 960 generates the global query matrix 1235 and the global key matrix 1255 by applying causal attention (CA) layers 1230/1250, which will be described in greater detail below with regard to
The C-FF block 960 may include a first combiner 1240 configured to perform an element-wise product between the global query matrix 1235 and the key matrix 1225 to generate a first product matrix 1245, denoted as [P1, P2, . . . , PN]. The C-FF block 960 may process the product matrix 1245 using a second CA layer 1250 to generate the global key matrix 1255, denoted as [K1, K2, . . . , KN].
The C-FF block 960 may include a second combiner 1260 configured to perform an element-wise product between the global key matrix 1255 and the query matrix 1215 (e.g., in place of the value vector) to generate a second product matrix (not illustrated). The C-FF block 960 also includes an output transform layer 1270 configured to perform a linear transformation to the second product matrix to learn its hidden representation and generate a transform matrix 1275, denoted as R=[r1, r2, . . . , rN]. Finally, a summing layer 1280 is configured to add the transform matrix 1275 to the input data 1205 to form the output data 1285 of the C-FF block 960, denoted as [y1, y2, . . . , yN].
As described above, the causal fastformer (C-FF) block 960 is analogous to dot-product attention, but its frame-wise computational complexity is independent of the sequence length. This property is particular desirable for a low-latency use case with a large number of small frames (K) in the sequence. The causality of the C-FF is achieved by selectively masking the input, so that that module integrates only a most recent 200 frames (e.g., 250 ms) to compute its output at a given time step. The motivation for using such short-sighted self-attention is to reinforce local consistency of speech, which may be disrupted by the long-range relations captured in the TCN blocks 950. The addition of the C-FF blocks 960 does not greatly increase a model size or computational complexity for the neural sidelobe canceller 900, because all transformations performed in the C-FF block 960 are implemented as fully-connected layers and the only other trainable parameter is W (size d).
to generate the unnormalized weights 1315, although the disclosure is not limited thereto.
The first CA layer 1230 may include a first combiner 1320 configured to combine the unnormalized weights 1315 and the query matrix 1215 to generate a weighted query matrix 1325. For example, the first combiner 1320 may perform an element-wise product between the unnormalized weights 1315 and the query matrix 1215, although the disclosure is not limited thereto.
The first CA layer 1230 includes a first fixed look-back cumulative sum block 1330 configured to process the weighted query matrix 1325 to generate a numerator 1335, denoted as [N1, N2, . . . , NN]. For example, the first fixed look-back cumulative sum block 1330 may be configured to integrate the most recent 200 frames (e.g., 250 ms) of the weighted query matrix 1325 to compute the numerator 1335. Similarly, the first CA layer 1230 includes a second fixed look-back cumulative sum block 1340 configured to process the unnormalized weights 1315 to generate a denominator 1345, denoted as [D1, D2, . . . , DN]. For example, the second fixed look-back cumulative sum block 1340 may be configured to integrate the most recent 200 frames (e.g., 250 ms) of the unnormalized weights 1315 to compute the denominator 1345.
Finally, a reciprocal layer 1350 may determine a reciprocal of the denominator 1345 and the first CA layer 1230 may include a second combiner 1360 configured to combine the numerator 1335 and the reciprocal of the denominator 1345 to generate a global query matrix 1235. For example, the first combiner 1320 may perform an element-wise product between the numerator 1335 and the reciprocal of the denominator 1345, although the disclosure is not limited thereto. The overall processing performed by the first CA layer 1230 corresponds to the causal attention equation, and may be expressed as:
While
A combiner 980 may combine the first feature data 935 and the mask data 975 to generate masked feature data 985 (e.g., masked encoder output) that represents enhanced target speech. In some examples, the combiner 980 masks the first feature data 935 based on the values represented in the mask data 975. For example, the mask data 975 may have the constraint that m∈[0,1], such that a first value (e.g., m=1) in the mask data 975 passes a corresponding portion of the first feature data 935, a second value (e.g., m=0) in the mask data 975 blocks a corresponding portion of the first feature data 935, and a third value between the second value and the first value (e.g., 0<m<1) in the mask data 975 attenuates a corresponding portion of the first feature data 935 based on the third value.
The decoder component 990 may receive the masked feature data 985 and may include a 1-D transposed convolutional block 1440 and an overlap and add block 1450 that are collectively configured to process the masked feature data 985 to generate the output audio data 995. Thus, the decoder component 990 may reconstruct frames of the masked encoder output back to the time domain and the reconstructed frames may be overlapped and added to obtain the separated target speech stream with sidelobes, noise, and reverberation cancelled. In some examples, the 1-D transposed convolutional block 1440 may be configured to receive the masked feature data 985 having first dimensions (C+1×K×N) and may generate reconstructed output data having second dimensions (K×L×1). The overlap and add block 1450 may then be configured to process the reconstructed output data to generate the output audio data 995 having third dimensions (1×T×1), although the disclosure is not limited thereto.
As most of the components illustrated in
The 2-D convolution block 1520 may be configured to perform 2-D convolution with a (C+1, L) kernel, where L=40 indicates that the audio frame includes 40 audio samples or time indexes (e.g., 2.5 ms audio frame), C+1 reflects that there are C input channels from the raw microphone audio data 915 and one input channel from the SDB component 920, and a stride of (L/2, 0) (e.g., 50% overlap between consecutive frames). The 2-D convolution block 1520 may apply a first number of filters (e.g., N=512), and may be followed by the ReLU activation layer 1530 configured to only pass the positive values as the first feature data 935.
While
The bottleneck component 940 may receive the first feature data 935 and may include a 1×1 (pointwise) convolutional block 1540 configured to reduce the number of channels from the first number of channels (e.g., 512) to the second number of channels (e.g., 128). Thus, the bottleneck component 940 may remap N=512 to B=128 channels, although the disclosure is not limited thereto.
The system 100 may include one or more controllers/processors 1604 that may each include a central processing unit (CPU) for processing data and computer-readable instructions, and a memory 1606 for storing data and instructions. The memory 1606 may include volatile random access memory (RAM), non-volatile read only memory (ROM), non-volatile magnetoresistive (MRAM) and/or other types of memory. The system 100 may also include a data storage component 1608, for storing data and controller/processor-executable instructions (e.g., instructions to perform operations discussed herein). The data storage component 1608 may include one or more non-volatile storage types such as magnetic storage, optical storage, solid-state storage, etc. The system 100 may also be connected to removable or external non-volatile memory and/or storage (such as a removable memory card, memory key drive, networked storage, etc.) through the input/output device interfaces 1602.
Computer instructions for operating the system 100 and its various components may be executed by the controller(s)/processor(s) 1604, using the memory 1606 as temporary “working” storage at runtime. The computer instructions may be stored in a non-transitory manner in non-volatile memory 1606, storage 1608, and/or an external device. Alternatively, some or all of the executable instructions may be embedded in hardware or firmware in addition to or instead of software.
The system may include input/output device interfaces 1602. A variety of components may be connected through the input/output device interfaces 1602, such as the loudspeaker(s) 114/202, the microphone(s) 112/204/205/206, and a media source such as a digital media player (not illustrated). The input/output interfaces 1602 may include A/D converters (not shown) and/or D/A converters (not shown).
The input/output device interfaces 1602 may also include an interface for an external peripheral device connection such as universal serial bus (USB), FireWire, Thunderbolt or other connection protocol. The input/output device interfaces 1602 may also include a connection to one or more networks 199 via an Ethernet port, a wireless local area network (WLAN) (such as WiFi) radio, Bluetooth, and/or wireless network radio, such as a radio capable of communication with a wireless communication network such as a Long Term Evolution (LTE) network, WiMAX network, 3G network, etc. Through the network(s) 199, the system 100 may be distributed across a networked environment.
As illustrated in
Multiple devices may be employed in a single system 100. In such a multi-device system, each of the devices may include different components for performing different aspects of the processes discussed above. The multiple devices may include overlapping components. The components listed in any of the figures herein are exemplary, and may be included a stand-alone device or may be included, in whole or in part, as a component of a larger device or system. For example, certain components, such as the beamforming components, may be arranged as illustrated or may be arranged in a different manner, or removed entirely and/or joined with other non-illustrated components.
As illustrated in
The concepts disclosed herein may be applied within a number of different devices and computer systems, including, for example, general-purpose computing systems, multimedia set-top boxes, televisions, stereos, radios, server-client computing systems, telephone computing systems, laptop computers, cellular phones, personal digital assistants (PDAs), tablet computers, wearable computing devices (watches, glasses, etc.), other mobile devices, etc.
The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure. Many modifications and variations of the disclosed aspects may be apparent to those of skill in the art. Persons having ordinary skill in the field of digital signal processing and echo cancellation should recognize that components and process steps described herein may be interchangeable with other components or steps, or combinations of components or steps, and still achieve the benefits and advantages of the present disclosure. Moreover, it should be apparent to one skilled in the art, that the disclosure may be practiced without some or all of the specific details and steps disclosed herein.
Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure. The computer readable storage medium may be implemented by a volatile computer memory, non-volatile computer memory, hard drive, solid-state memory, flash drive, removable disk and/or other media. In addition, components of system may be implemented in firmware and/or hardware, such as an acoustic front end (AFE), which comprises, among other things, analog and/or digital filters (e.g., filters configured as firmware to a digital signal processor (DSP)). Some or all of the beamforming component 802 may, for example, be implemented by a digital signal processor (DSP).
Conditional language used herein, such as, “can,” “could,” “might,” “may,” “e.g.,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements, and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without other input or prompting, whether these features, elements, and/or steps are included or are to be performed in any particular embodiment. The terms “comprising,” “including,” “having,” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list.
Disjunctive language such as the phrase “at least one of X, Y, Z,” unless specifically stated otherwise, is understood with the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present. As used in this disclosure, the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.
This application claims the benefit of priority of U.S. Provisional Patent Application 63/309,895, filed Feb. 14, 2022, and entitled “Beamforming Using an In-Ear Audio Device,” the contents of which is expressly incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
20190172476 | Wung | Jun 2019 | A1 |
20220124444 | Andersen | Apr 2022 | A1 |
Entry |
---|
Li, Guanjun, et al. “Deep neural network-based generalized sidelobe canceller for dual-channel far-field speech recognition.” Neural Networks 141 (2021): 225-237. (Year: 2021). |
Liu, Yuzhou, and DeLiang Wang. “Causal deep CASA for monaural talker-independent speaker separation.” IEEE/ACM transactions on audio, speech, and language processing 28 (2020): 2109-2118. (Year: 2020). |
Zhao, Yan, et al. “Monaural speech dereverberation using temporal convolutional networks with self attention.” IEEE/ACM transactions on audio, speech, and language processing 28 (2020): 1598-1607. (Year: 2020). |
Shi, Ziqiang, et al. “Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation.” Interspeech. 2019. (Year: 2019). |
Number | Date | Country | |
---|---|---|---|
63309895 | Feb 2022 | US |