The invention relates to audio processing, and more particularly, to a microphone system to solve mirror issues and improve microphone directionality.
Beamforming techniques use the time differences between channels that results from the spatial diversity of the microphones to enhance the reception of signals from desired directions and to suppress or eliminate the undesired signals coming from other directions.
Accordingly, what is needed is a microphone system to solve the mirror issue and provide improved microphone directionality. The invention addresses such a need.
In view of the above-mentioned problems, an object of the invention is to provide a microphone system capable of solving mirror issues and improving microphone directionality.
One embodiment of the invention provides a microphone system. The microphone system comprises a microphone array and a processing unit. The microphone array comprises Q microphones that detect sound and generate Q audio signals. The processing unit is configured to perform a set of operations comprising: performing spatial filtering over the Q audio signals using a trained model based on at least one target beam area (TBA) and coordinates of the Q microphones to generate a beamformed output signal originated from ω target sound sources inside the at least one TBA, where ω>=0. Here, each TBA is defined by r time delay ranges for r combinations of two microphones out of the Q microphones, where Q>=3 and r>=1. A dimension of a first number for locations of all sound sources able to be distinguished by the processing unit increases as a dimension of a second number for a geometry formed by the Q microphones increases.
Further scope of the applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, and wherein:
As used herein and in the claims, the term “and/or” includes any and all combinations of one or more of the associated listed items. The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Throughout the specification, the same components with the same function are designated with the same reference numerals.
A set of microphone coordinates for the microphone array 210 is defined as M={M1, M2, . . . , MQ}, where Mi=(xi, yi, zi) denotes coordinates of for microphone 21i relative to a reference point (not shown) at the electronic device and 1<=i<=Q. Let a set of sound sources S⊆3 and tgi denotes a propagation time of sound from a sound source sg to a microphone 21i, a location L(sg) of the sound source sg relative to the microphone array 210 is defined by R time delays for R combinations of two microphones out of the Q microphones as follows: L(sg)={(tg1−tj2), (tg1−tg3), . . . , (tg1−tgQ), . . . , (tg(Q-1)−tgQ)}, where 3 denotes a three-dimensional space, 1<=g<=Z, S⊇{s1, . . . , sz}, Z denotes the number of sound sources, and R=Q!/((Q−2)!×2!). A beam area (BA) is defined by R time delay ranges for R combinations of two microphones out of the Q microphones as follows: BA={(TS12, TE12), (TS13, TE13), . . . , (TS1Q, TE1Q), . . . , (TS(Q-1)1, TE(Q-d 1)1), . . . , (TS(Q-1)Q, TE(Q-1)Q)}, where TSik and TEik respectively denote a lower limit and a upper limit of a time delay range for the two microphones 21i and 21k, i≠k and 1<=k<=Q. If all the time delays for the location L(sg) of the sound source sg fall within the time delay ranges of the beam area, then it is determined that the sound source sg is located inside the beam area BA or called “inside beam” for short. For example, given that Q=3, BA={(−2 ms, 1 ms), (−3 ms, 2 ms), (−2 ms, 0 ms)} and propagation times from a sound source s1 to three microphones 211˜213 are respectively equal to 1 ms, 2 ms and 3 ms, then the location of sound source s1 would be: L(s1)={(t11−t12), (t11−t13), (t12−t13)}={−1 ms, −2 ms, −1 ms}. Since TS12<(t11−t12)<TE12, TS13<(t11−t13)<TE13 and TS23<(t12−t13)<TE23, it is determined that the sound source s1 is located inside the beam area BA.
Through the specification and claims, the following notations/terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The term “sound source” refers to anything producing audio information, including people, animals, or objects. Moreover, the sound source can be located at any locations in three-dimensional spaces relative to a reference point (e.g., a middle point among the Q microphones 211-21Q) at the electronic device. The term “target beam area (TBA)” refers to a beam area located in desired directions or a desired coordinate range, and audio signals from all target sound sources (TSS) inside the TBA need to be preserved or enhanced. The term “cancel beam area (CBA)” refers to a beam area located in un-desired directions or an un-desired coordinate range, and audio signals from all cancel sound sources inside the CBA need to be suppressed or eliminated.
The microphones 211-21Q in the microphone array 210 may be, for example, omnidirectional microphones, bi-directional microphones, directional microphones, or a combination thereof. Please note that when directional or bi-directional microphones are included in the microphone array 210, a circuit designer needs to ensure the directional or bi-directional microphones are capable of receiving all the audio signal originated from target sound sources inside the at least one TBA.
As set forth above, the beamformer 220 may perform spatial filtering operation over Q audio signals from the microphone array 210 based on at least one TBA, the set M of microphone coordinates and zero or one or two energy losses to generate a beamformed output signal u[n] originated from ω target sound sources inside the at least one TBA, where ω>=0. However, a microphone array may face a mirror issue due to its microphone geometry. The geometry/layout of the microphone array 210 assisting the beamformer 220 in distinguishing different sound source locations is divided into three ranks as follows. (1) rank(M)=3: the layout or geometry of Q microphones 211˜21Q forms a three-dimensional (3D) shape (neither collinear nor coplanar) so that each set of time delays in L(sg) received by the Q microphones are unique enough for the beamformer 220 to locate the sound source sg in 3D space. In geometry, the 3D shape is a shape or figure that has three dimensions, such as length, width and height (such as the example of
Maximum distinguishing rank for the capability of the beamformer 220 to distinguish different sound source locations based on only the geometry of the Q microphones 211˜21Q is the smaller of two numbers (Q-1) and 3, where Q>=3. According to the invention, a distinguishing rank (DR) for the capability of the beamformer 220 can be escalated by changing a geometry of the microphone array 210 from a dimension of a lower number to a dimension of a higher number and/or by inserting zero or one or two spacers into the Q microphones (will be described below).
According to the invention, both the geometry of the microphone array 210 and the number of spacers determine the distinguishing rank (DR) for the capability of the beamformer 220 to distinguish different sound source locations.
For Q=3, the location L(sg) of each sound source sg relative to the microphone array 210 is defined by three time delays for three combinations of two microphones out of three microphones 211˜213. There are five types 3A-3E for layouts of microphones and spacers as follows. (1) Type 3A (DR=1): three microphones 211-213 in the microphone array 210 form a line along y axis (i.e., collinear) and no spacer is inserted, as shown in
For Q=4, the location L(sg) of each sound source sg relative to the microphone array 210 is defined by six time delays for six combinations of two microphones out of four microphones 211-214. There are six types 4A-4F for layouts of microphones and spacers as follows. (1) Type 4A (DR=1): four microphones 211-214 in the microphone 210 are arranged collinearly along y axis and no spacer is inserted, similar to the layout in
Please note that in the examples of
In brief, three or more collinear microphones are used by the beamformer 220 to find locations of sound sources in one dimension (DR=1); besides, with the insertion of one or two spacers, the DR value would be escalated from 1 to 2 or 3. Three or more coplanar microphones are used by the beamformer 220 to find locations of sound sources in two dimensions (DR=2); besides, with the insertion of one spacer, the DR value would be escalated from 2 to 3. Four or more non-coplanar and non-collinear microphones that form a 3D shape are used by the beamformer 220 to find locations of sound sources in three dimensions (DR=3).
Referring back to
The neural network 760 of the invention may be implemented by any known neural network. Various machine learning techniques associated with supervised learning may be used to train a model of the neural network 760. Example supervised learning techniques to train the neural network 760 include, for example and without limitation, stochastic gradient descent (SGD). In the context of the following description, the neural network 760 operates in a supervised setting using a training dataset including multiple training examples, each training example including training input data (such as audio data in each frame of input audio signals b1[n] to bQ[n] in
As set forth above, there are five types 3A-3E for the layouts of three-microphone array and spacers (Q=3) and six types 4A-4E for the layouts of Q-microphone array and spacers (Q>=4). Please note that a neural network 760 in the beamformer 220T in cooperation with each type of the layouts needs to be trained “individually” with corresponding input parameters due to the fact that the set M of microphone coordinates of the microphone array 210, at least one TBA and the energy losses are varied according to different implementations. For example, a neural network 760 in the beamformer 220T in cooperation with one of Type 3A, 3C, 4A, 4C and 4F, needs to be trained with the set M of microphone coordinates of the microphone array 210, the at least one TBA and a training dataset (will be described below); a neural network 760 in the beamformer 220T in cooperation with one of Type 3B, 3D, 4B and 4D, needs to be trained with the set M of microphone coordinates of the microphone array 210, the at least one TBA, a α-dB energy loss for the spacer 410 and the training dataset; a neural network 760 in the beamformer 220T in cooperation with one of Type 3E and 4E, needs to be trained with the set M of microphone coordinates of the microphone array 210, the at least one TBA, the training dataset, a α-dB energy loss for the spacer 410 and a β-dB energy loss for the spacer 510.
As set forth above, a BA is defined by R time delay ranges for R combinations of two microphones out of the Q microphones in the microphone array 210. Each TBA that is fed to the processor 750 in
In a second option (one or more spacers are inserted into the microphone array 210 (such as Type 3B, 4B, 3D, 4D, 3E and 4E)): each TBA can be defined by r2 time delay ranges for r2 combinations of two microphones out of the Q microphones, where r2>=1. For example, for Type 3B, each TBA can be defined by one time delay range for one combination of two microphones as follows: {(TS13, TE13)}, so that the beamformer 220 can distinguish different locations of first sound sources along y axis by their corresponding sets of time delays and different locations of second sound sources along x axis by energy losses. For Type 3D, each TBA can be defined by two time delay ranges for two combinations of two microphones as follows: {(TS12, TE12), (TS23, TE23)}, so that the beamformer 220 can distinguish different locations of first sound sources along x axis and y axis by their corresponding sets of time delays and different locations of second sound sources along z axis by energy losses.
For purposes of clarity and ease of description,
In an offline phase (prior to the training phase), the processor 750 is configured to respectively collect and store a batch of time-domain single-microphone noise-free (or clean) speech audio data (with/without reverberation in different space scenarios) 711a and a batch of time-domain single-microphone noise audio data 711b into the storage device 710. For the noise audio data 711b, all sound other than the speech being monitored (primary sound) is collected/recorded, including markets, computer fans, crowd, car, airplane, construction, keyboard typing, multiple-person speaking, etc.
It is assumed that the whole space (where the microphone system 700T is disposed) minus the at least one TBA leaves a CBA. By executing one of the software programs 713 of any well-known simulation tools, such as Pyroomacoustics, stored in the storage device 710, the processor 750 operates as a data augmentation engine to construct different simulation scenarios involving Z sound sources, Q microphones and different acoustic environments based on the at least one TBA, the set M of microphone coordinates, two energy losses of a and 3 dB for the two spacers 410 and 510, the clean speech audio data 711a and the noise audio data 711b. Besides, ω target sound sources are placed inside the at least one TBA and E cancel sound sources are placed inside the CBA, where ω+ε=Z, and ω, ε,Z>=0. The main purpose of the data augmentation engine 750 is to help the neural network 760 to generalize, so that the neural network 760 can operate in different acoustic environments. Please note that besides the simulation tools (such as Pyroomacoustics), the software programs 713 may include additional programs (such as an operating system or application programs) necessary to cause the beamformer 220/220T/220t/220P to operate.
Specifically, with Pyroomacoustics, the data augmentation engine 750 respectively transforms the single-microphone clean speech audio data 711a and the single-microphone noise audio data 711b into Q-microphone augmented clean speech audio data and Q-microphone augmented noisy audio data, and then mixes the Q-microphone augmented clean speech audio data and the Q-microphone augmented noise audio data to generate and store a mixed Q-microphone time-domain augmented audio data 712 in the storage device 710. In particular, the Q-microphone noise audio data is mixed in with the Q-microphone augmented clean speech audio data at different mixing rates to produce the mixed Q-microphone time-domain augmented audio data 712 having a wide range of SNRs. In the training phase, the mixed Q-microphone time-domain augmented audio data 712 are used by the processor 750 as the training input data (i.e., input audio data b1[n]-bQ[n]) for the training examples of the training dataset; correspondingly, clean or noisy time-domain output audio data transformed from a combination of the clean speech audio data 711a and the noise audio data 711b (that are all originated from the ω target sound sources) are used by the processor 750 as the training output data (i.e., h[n]) for the training examples of the training dataset.
In each magnitude & phase calculation unit 73j, the input audio stream bj[n] is firstly broken up into frames using a sliding window along the time axis so that the frames overlap each other to reduce artifacts at the boundary, and then, the audio data in each frame in time domain are transformed by Fast Fourier transform (FFT) into complex-valued data in frequency domain, where 1=<j<=Q and n denotes the discrete time index. Assuming a number of sampling points in each frame (or the FFT size) is N, the time duration for each frame is Td and the frames overlap each other by Td/2, the magnitude & phase calculation unit 73j divides the input stream bj[n] into a plurality of frames and computes the FFT of audio data in the current frame i of the input audio stream bj[n] to generate a current spectral representation Fj(i) having N complex-valued samples (F1,j(i)-FN,j(i)) with a frequency resolution of fs/N(=1/Td), where 1<=j<=Q, i denotes the frame index of the input/output audio stream bj[n]/u[n]/h[n], fs denotes a sampling frequency of the input audio stream bj[n] and each frame corresponds to a different time interval of the input stream bj[n]. Next, the magnitude & phase calculation unit 73j calculates a magnitude and a phase for each of N complex-valued samples (F1,j(i), . . . , FN,j(i)) based on its length and arctangent function to generate a magnitude spectrum (mj(i)=m1,j(i), . . . , mN,j(i)) with N magnitude elements and a phase spectrum (Pj(i)=P1,j(i), . . . , PN,j(i)) with N phase elements for the current spectral representation Fj(i) (=F1,j(i), . . . , FN,j(i)). Then, the inner product block 73 calculates the inner product for each of N normalized-complex-valued sample pairs in any two phase spectrums Pj(i) and Pk(i) to generate R phase-difference spectrums (pdl(i)=pd1,l(i), . . . , pdN,l(i)), each phase-difference spectrum pdl(i) having N elements, where 1<=k<=Q, j≠k, 1<=l<=R, and there are R combinations of two microphones out of the Q microphones. Finally, the Q magnitude spectrums mj(i), the Q phase spectrums Pj(i) and the R phase-difference spectrums pdl(i) are used/regarded as a feature vector fv(i) and fed to the neural network 760/760T. In a preferred embodiment, the time duration Td of each frame is about 32 milliseconds (ms). However, the above time duration Td is provided by way of example and not limitation of the invention. In actual implementations, other time duration Td may be used.
In the training phase, the neural network 760 receives the feature vector fv(i) including the Q magnitude spectrums m1(i)˜mQ(i), the Q phase spectrums P1(i)-PQ(i) and the R phase-difference spectrums pd1(i)˜pdR(i), and then generates corresponding network output data, including N first sample values of the current frame i of a time-domain beamformed output stream u[n]. On the other hand, the training output data (ground truth), paired with the training input data (i.e., Q*N input sample values of the current frames i of the Q training input streams b1[n]-bQ[n]) for the training examples of the training dataset, includes N second sample values of current frame i of a training output audio stream h[n] and are transmitted to the loss function block 770 by the processor 750. If ω>0 and the neural network 760 is trained to perform the spatial filtering operation only, the training output audio stream h[n] outputted from the processor 750 would be the noisy time-domain output audio data (transformed from a combination of the clean speech audio data 711a and the noise audio data 711b originated from the ω target sound sources). If ω>0 and the neural network 760 is trained to perform spatial filtering and denoising operations, the training output audio stream h[n] outputted from the processor 750 would be the clean time-domain output audio data (transformed from the clean speech audio data 711a originated from the ω target sound sources). If ω=0, the training output audio stream h[n] outputted from the processor 750 would be “zero” time-domain output audio data, i.e., each output sample value being set to zero.
Then, the loss function block 770 adjusts parameters (e.g., weights) of the neural network 760 based on differences between the network output data and the training output data. In one embodiment, the neural network 760 is implemented by a deep complex U-Net, and correspondingly the loss function implemented in the loss function block 770 is weighted-source-to-distortion ratio (weighted-SDR) loss, disclosed by Choi et al., “Phase-aware speech enhancement with deep complex U-net”, a conference paper at ICRL 2019. However, it should be understood that the deep complex U-Net and the weighted-SDR loss have been presented by way of example only, and not limitation of the invention. In actual implementations, any other neural networks and loss functions can be used and this also falls in the scope of the invention. Finally, the neural network 760 is trained so that the network output data (i.e., the N first sample values in u[n]) produced by the neural network 760 matches the training output data (i.e., the N second sample values in h[n]) as closely as possible when the training input data (i.e., the Q*N input sample values in b1[n]˜bQ[n]) paired with the training output data is processed by the neural network 760.
The inference phase is divided into a test stage (e.g., the microphone system 700t is tested by an engineer in a R&D department to verify performance) and a practice stage (i.e., microphone system 7001 is ready on the market).
In sum, the higher a dimension of a geometry formed by the Q microphones 211-21Q and the more the number of the spacers, the higher a dimension (i.e., the DR value) for locations of sound sources are able to be distinguished by the beamformer 220. Further, the higher the dimension for the locations of sound sources are able to be distinguished by the beamformer 220, the more precisely a sound source would be located, and thus the better the performance of the spatial filtering with/without denoising filtering in the beamformer 220.
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that this invention should not be limited to the specific construction and arrangement shown and described, since various other modifications may occur to those ordinarily skilled in the art.
This application claims priority under 35 USC 119(e) to U.S. provisional application No. 63/317,078, filed on Mar. 7, 2022, the content of which is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
9053697 | Park et al. | Jun 2015 | B2 |
11302313 | Li | Apr 2022 | B2 |
20080187152 | Kim | Aug 2008 | A1 |
20080247274 | Seltzer | Oct 2008 | A1 |
20110026732 | Buck | Feb 2011 | A1 |
20150088500 | Conliffe | Mar 2015 | A1 |
20170325020 | Wolff | Nov 2017 | A1 |
20190387311 | Schultz | Dec 2019 | A1 |
20200374624 | Koschak | Nov 2020 | A1 |
20210092548 | McElveen | Mar 2021 | A1 |
20210099796 | Usami | Apr 2021 | A1 |
20210150873 | Shouldice et al. | May 2021 | A1 |
20210219053 | Masnadi-Shirazi | Jul 2021 | A1 |
Number | Date | Country |
---|---|---|
102947878 | Nov 2014 | CN |
3 035 249 | Jun 2016 | EP |
201640422 | Nov 2016 | TW |
201921336 | Jun 2019 | TW |
Entry |
---|
Choi et al., “Phase-aware speech enhancement with deep complex U-net”, conference paper at ICRL 2019, 20 pages, 2019. |
U.S. Pat. No. 9,053,697 B2 is an English language family member of CN 102947878 B. |
U.S. Pat. No. 11,302,313 B2 is an English language family member of TW 201921336 A. |
EP 14382553.7 is an English language family member of TW 201640422 A. |
Number | Date | Country | |
---|---|---|---|
20230283951 A1 | Sep 2023 | US |
Number | Date | Country | |
---|---|---|---|
63317078 | Mar 2022 | US |