The invention relates to audio processing, and more particularly, to a beamforming method and a microphone system in a boomless headset (also called a boomfree headset) able to do away with a boom microphone and provide best speech quality.
For applications that require speech interaction, we often choose a boom microphone headset. A boom microphone is when the microphone is attached to the end of a boom, allowing perfect positioning in front of or next to the user's mouth. This option provides the most accurate and best-quality sound that is possible for software. The advantage of a boom microphone headset is it moves with the user. If the users turn their heads, the boom microphones remain in perfect position to continuously pick up their voices. However, the boom microphone headset has many disadvantages. For example, the boom microphone is usually the easiest part of a headset to break, as it's a flexible piece that if mishandled can break off or snap from the boom swivel. Another disadvantage is that the user must continually and manually adjust the boom to the front of his mouth in order to get proper recording, which usually causing annoyance.
Accordingly, what is needed is a microphone system for use in a boomless headset so as to do away with the boom microphone and provide best speech quality. The invention addresses such a need.
In view of the above-mentioned problems, an object of the invention is to provide a microphone system for use in a boomless headset so as to do away with the boom microphone and provide best speech quality.
One embodiment of the invention provides a microphone system applicable to a boomless headset with two earcups. The microphone system comprises a microphone array and a processing unit. The microphone array comprises Q microphones that detect sound and generate Q audio signals. A first microphone and a second microphone of the Q microphones are disposed on different earcups, and a third microphone of the Q microphones is disposed on one of the two earcups and displaced laterally and vertically from one of the first and the second microphones. The processing unit is configured to perform a set of operations comprising: performing spatial filtering over the Q audio signals using a trained model based on an arc line with a vertical distance and a horizontal distance from a first midpoint between the first and the second microphones, a main time delay range for the first and the second microphones and coordinates of the Q microphones to generate a beamformed output signal originated from zero or more target sound sources inside a target beam area, where Q>=3. The TBA is a collection of intersection planes of multiple surfaces and multiple cones. The multiple surfaces correspond to multiple main time delays within the main time delay range, and angles of the multiple cones are related to multiple intersection points of the multiple surfaces and the arc line. The multiple surfaces extend from the first midpoint, and the multiple cones extend from a second midpoint between the third microphone and the one of the first and the second microphones.
Another embodiment of the invention provides a beamforming method applicable to a boomless headset comprising two earcups and a microphone array. The method comprises: disposing a first microphone and a second microphone of Q microphones in the microphone array on different earcups; disposing a third microphone of the Q microphones on one of the two earcups, wherein the third microphone is displaced laterally and vertically from one of the first and the second microphones; detecting sound by the Q microphones to generate Q audio signals; and, performing spatial filtering over the Q audio signals using a trained model based on an arc line with a vertical distance and a horizontal distance from a first midpoint between the first and the second microphones, a main time delay range for the first and the second microphones and coordinates of the Q microphones to generate a beamformed output signal originated from zero or more target sound sources inside a target beam area (TBA), where Q>=3. The TBA is a collection of intersection planes of multiple surfaces and multiple cones. The multiple surfaces correspond to multiple main time delays within the main time delay range, and angles of the multiple cones are related to multiple intersection points of the multiple surfaces and the arc line. The multiple surfaces extend from the first midpoint, and the multiple cones extend from a second midpoint between the third microphone and the one of the first and the second microphones.
Further scope of the applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, and wherein:
As used herein and in the claims, the term “and/or” includes any and all combinations of one or more of the associated listed items. The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Throughout the specification, the same components with the same function are designated with the same reference numerals.
Q microphones 111˜11Q configured to detect sound to generate Q audio signals b1[n]˜bQ[n], where Q>=3. The neural network-based beamformer 120 is used to perform spatial filtering operation with or without denoising operation over the Q audio signals received from the microphone array 110 using a trained model (e.g., a trained neural network 760T in
The Q microphones 111-11Q in the microphone array 110 may be, for example, omnidirectional microphones, bi-directional microphones, directional microphones, or a combination thereof. Please note that when directional or bi-directional microphones are included in the microphone array 110, a circuit designer needs to ensure the directional or bi-directional microphones are capable of receiving all the audio signal originated from all target sound sources (Ta) inside the TBA.
Layout 2A are analogous to those of the two microphones 112 and 113 on the left earcup 220 for Layout 1A as shown in
Referring to
Through the specification and claims, the following notations/terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The term “sound source” refers to anything producing audio information, including people, animals, or objects. Moreover, the sound source can be located at any locations in three-dimensional (3D) spaces relative to a reference origin (e.g., the midpoint A1 between the two microphones 111-112) at the boomless headset 200A/B/C/D. The term “target beam area (TBA)” refers to a beam area located in desired directions or a desired coordinate range, and audio signals from all target sound sources (Ta) inside the TBA need to be preserved or enhanced. The term “cancel beam area (CBA)” refers to a beam area located in un-desired directions or an un-desired coordinate range, and audio signals from all cancel sound sources (Ca) inside the CBA need to be suppressed or eliminated. It is assumed that the whole 3D space (where the microphone system 100 is disposed) minus the TBA leaves a CBA, i.e., the CBA is out of the TBA in 3D space. The term “multi-mic equivalent class” refers to multiple sound sources that have the same time delays relative to multiple microphones, but do not have the same locations.
A feature of the invention is to arrange the three microphones 111˜113 in specific positions on two earcups 210 and 220 of a boomless headset 200A/B/C/D to eliminate voices from cancel sound sources with their locations higher than or farther than a predefined arc line AL (with a predefined vertical distance ht and a predefined horizontal distance dt from the midpoint A1 of two microphones 111-112 as shown in
A set of microphone coordinates for the microphone array 110 is defined as M={M1, M2, . . . ,MQ}, where Mi=(xi, yi, zi) denotes coordinates of for microphone 11i relative to a reference origin (such as the midpoint A1 between the two microphones 111-112) and 1<=i<=Q. Let a set of sound sources S⊆3 and tgi denotes a propagation time of sound from a sound source sg to a microphone 11i, a location L(sg) of the sound source sg relative to the microphone array 110 is defined by R time delays for R combinations of two microphones out of the Q microphones as follows: L(sg)={(tg1−tg2), (tg1−tg3), . . . , (tg1−tgQ), . . . , (tg(Q−1)−tgQ)}, where 3 denotes a three-dimensional space, 1<=g<=Z, S⊇{s1, . . . , sZ}, Z denotes the number of sound sources, and R=Q!/((Q−2)!×2!).
As set forth above, each two-mic equivalent class refers to a surface i.e., either a right circular conical surface or a plane. Consequently, a three-mic equivalent class for three microphones 111˜113 is equivalent to the intersection of a first two-mic equivalent class (e.g., a first surface 1Sm in
Each AUX time delay range extends from a core time delay TS23 to an outer time delay TE23m for either each second surface 2Sm or each right circular cone Cm of the microphones 112 and 113. As long as a sound source sg and the microphones 112 and 113 (operating as an endfire array) are collinear (not shown), a core time delay TS23(=tg2−tg3) would be equal to a propagation time tg2 of sound from the sound source sg to a microphone 112 minus a propagation time tg3 of sound from the sound source sg to a microphone 113, where the sound source se is closer to the microphone 112 than to the microphone 113. Thus, the core time delay TS23 of the AUX time delay range for the microphones 112 and 113 is fixed for all second surfaces 2Sm or all right circular cones Cm. In an alternative embodiment, the core time delay TS23=(−d2/c), where d2 denotes the 3D distance between the two microphones 112 and 113 in
Referring back to
Step S602: Randomly generate a point/sound source Px with known coordinates relative to a known reference origin in 3D space by the processor 750.
Step S604: Calculate a main time delay τ12(=tx1−tx2) for the sound source Px relative to the two microphones 111-112 based on a difference of two propagation times tx1 and tx2, coordinates of the sound source Px and the set M of microphone coordinates for the microphone array 110, where tx1 denotes a propagation time of sound from the sound source Px to the microphone 111 and tx2 denotes a propagation time of sound from the sound source Px to the microphone 112.
Step S606: Determine whether TS12<τ12<TE12. If YES, the flow goes to step S608; otherwise, the flow goes to step S618.
Step S608: Calculate coordinates of an intersection point rm of the predefined arc line AL and a first surface 1Sm with the main time delay τ12 so that tx1−tx2=τ12=tr1−tr2, where tr denotes a propagation time of sound from the intersection point rm to the microphone 111 and tr2 denotes a propagation time of sound from the intersection point rm to the microphone 112.
Step S610: Calculate an outer time delay TE23m=tr2−tr3 according to a difference of two propagation times tr2 and tr3 and the coordinates of the intersection point rmand the set M of microphone coordinates, where tr3 denotes a propagation time of sound from the intersection point rm to the microphone 113.
Step S612: Calculate an AUX time delay τ23(=tx2−tx3) for the sound source Px according to a difference of propagation times tx2 and tx3, coordinates of the sound source Px and the set M of microphone coordinates, where tx3 denotes a propagation time of sound from the sound source Px to the microphone 113.
Step S614: Determine whether the AUX time delay τ23 falls within the AUX time delay range of the core time delay TS23 to the outer time delay TE23m, i.e., determining whether TS23<τ23<TE23m. If YES, the flow goes to step S616; otherwise, the flow goes back to step S618.
Step S616: Determine that the sound source Px is located in the TBA and is a target sound source Ta. Then, the flow goes back to step S602.
Step S618: Determine that the sound source Px is located in the CBA and is a cancel sound source Ca. Then, the flow goes back to step S602.
For some cases (Layout 1B and 2B) that the microphone 113 is closer than the microphone 111/112 to the user mouth as shown in
The neural network 760 of the invention may be implemented by any known neural network. Various machine learning techniques associated with supervised learning may be used to train a model of the neural network 760. Example supervised learning techniques to train the neural network 760 include, for example and without limitation, stochastic gradient descent (SGD). In the context of the following description, the neural network 760 operates in a supervised setting using a training dataset including multiple training examples, each training example including training input data (such as audio data in each frame of input audio signals b1[n] to bQ[n] in
In an offline phase (prior to the training phase), the processor 750 is configured to respectively collect and store a batch of time-domain single-microphone noise-free (or clean) speech audio data (with/without reverberation in different space scenarios) 711a and a batch of time-domain single-microphone noise audio data 711b into the storage device 710. For the noise audio data 711b, all sound other than the speech being monitored (primary sound) is collected/recorded, including markets, computer fans, crowd, car, airplane, construction, keyboard typing, multiple-person speaking, etc. By executing one of the software programs 713 of any well-known acoustic simulation tools, such as Pyroomacoustics, stored in the storage device 710, the processor 750 operates as a data augmentation engine to construct different simulation scenarios involving Z sound sources, Q microphones and different acoustic environments based on a main time delay range of a lower limit TS12 to a upper limit TE12 for the two microphones 111-112, the predefined arc line AL with a vertical distance ht and a horizontal distance dt from the midpoint A1, the set M of microphone coordinates for the microphone array 110, the clean speech audio data 711a and the noise audio data 711b. By performing the sound source classifying method in
The main purpose of the data augmentation engine 750 is to help the neural network 760 to generalize, so that the neural network 760 can operate in different acoustic environments. Please note that besides the acoustic simulation tools (such as Pyroomacoustics) and the classifying method in
Specifically, with Pyroomacoustics, the data augmentation engine 750 respectively transforms the single-microphone clean speech audio data 711a and the single-microphone noise audio data 711b into Q-microphone augmented clean speech audio data and Q-microphone augmented noisy audio data according to the set M of microphone coordinates and coordinates of both z1 target sound sources inside the TBA and z2 cancel sound sources inside the CBA, and then mixes the Q-microphone augmented clean speech audio data and the Q-microphone augmented noise audio data to generate and store a mixed Q-microphone time-domain augmented audio data 712 in the storage device 710. In particular, the Q-microphone augmented noise audio data is mixed in with the Q-microphone augmented clean speech audio data at different mixing rates to produce the mixed Q-microphone time-domain augmented audio data 712 having a wide range of SNRs.
In the training phase, the mixed Q-microphone time-domain augmented audio data 712 are used by the processor 750 as the training input data (i.e., input audio data b1[n]˜bQ[n]) for the training examples of the training dataset. Correspondingly, clean or noisy time-domain resultant audio data transformed from a combination of the clean speech audio data 711a and the noise audio data 711b according to coordinates of the z1 target sound sources and the set M of microphone coordinates are used by the processor 750 as the training output data (i.e., h[n]) for the training examples of the training dataset. Thus, in the training output data, audio data originated from the z1 target sound sources are preserved and audio originated from the z2 cancel sound sources are cancelled.
In each magnitude & phase calculation unit 73j, the input audio stream bj[n] is firstly broken up into frames using a sliding window along the time axis so that the frames overlap each other to reduce artifacts at the boundary, and then, the audio data in each frame in time domain are transformed by Fast Fourier transform (FFT) into complex-valued data in frequency domain, where 1=<j<=Q and n denotes the discrete time index. Assuming a number of sampling points in each frame (or the FFT size) is N, the time duration for each frame is Td and the frames overlap each other by Td/2, the magnitude & phase calculation unit 73j divides the input stream bj[n] into a plurality of frames and computes the FFT of audio data in the current frame i of the input audio stream bj[n] to generate a current spectral representation Fj(i) having N complex-valued samples (F1,j(i)-FN,j(i)) with a frequency resolution of fs/N(=1/Td), where 1<=j<=Q, i denotes the frame index of the input/output audio stream bj[n]/u[n]/h[n], fs denotes a sampling frequency of the input audio stream bj[n] and each frame corresponds to a different time interval of the input stream bj[n]. Next, the magnitude & phase calculation unit 73j calculates a magnitude and a phase for each of N complex-valued samples (F1,j(i), . . . , FN,j(i)) based on its length and arctangent function to generate a magnitude spectrum (mj(i)=m1,j(i), . . . , mN,j(i)) with N magnitude elements and a phase spectrum (Pj(i)=1,j(i), . . . , PN,j(i)) with N phase elements for the current spectral representation Fj(i) (=F1,ji), . . . , FN,j(i)). Then, the inner product block 73 calculates the inner product for each of N normalized-complex-valued sample pairs in any two phase spectrums Pj(i) and Pk(i) to generate R phase-difference spectrums (pd/(i)=pd1,l(i), . . . , pdN,l(i)), each phase-difference spectrum pd/(i) having N elements, where 1<=k<=Q, j≠k , 1<=;<=R, and there are R combinations of two microphones out of the Q microphones. Finally, the Q magnitude spectrums mj(i), the Q phase spectrums Pj(i) and the R phase-difference spectrums pd/(i) are used/regarded as a feature vector fv(i) and fed to the neural network 760/760T. In a preferred embodiment, the time duration Td of each frame is about 32 milliseconds (ms). However, the above time duration Td is provided by way of example and not limitation of the invention. In actual implementations, other time duration Td may be used.
In the training phase, the neural network 760 receives the feature vector fv(i) including the Q magnitude spectrums m1(i)˜mQ(i), the Q phase spectrums P1(i)-PQ(i) and the R phase-difference spectrums pd1(i)˜pdR(i), and then generates corresponding network output data, including N first sample values of the current frame i of a time-domain beamformed output stream u[n]. On the other hand, the training output data (ground truth), paired with the training input data (i.e., Q*N input sample values of the current frames i of the Q training input streams b1[n]˜bQ[n]) for the training examples of the training dataset, includes N second sample values of current frame i of a training output audio stream h[n] and are transmitted to the loss function block 770 by the processor 750. If z1>0 and the neural network 760 is trained to perform the spatial filtering operation only, the training output audio stream h[n] outputted from the processor 750 would be the noisy time-domain resultant audio data (transformed from a combination of the clean speech audio data 711a and the noise audio data 711b according to coordinates of the z1 target sound sources). If z1>0 and the neural network 760 is trained to perform spatial filtering and denoising operations, the training output audio stream h[n] outputted from the processor 750 would be the clean time-domain resultant audio data (transformed from the clean speech audio data 711a according to coordinates of the z1 target sound sources). If z1=0, the training output audio stream h[n] outputted from the processor 750 would be “zero” time-domain resultant audio data, i.e., each output sample value being set to zero.
Then, the loss function block 770 adjusts parameters (e.g., weights) of the neural network 760 based on differences between the network output data and the training output data. In one embodiment, the neural network 760 is implemented by a deep complex U-Net, and correspondingly the loss function implemented in the loss function block 770 is weighted-source-to-distortion ratio (weighted-SDR) loss, disclosed by Choi et al., “Phase-aware speech enhancement with deep complex U-net”, a conference paper at ICRL 2019. However, it should be understood that the deep complex U-Net and the weighted-SDR loss have been presented by way of example only, and not limitation of the invention. In actual implementations, any other neural networks and loss functions can be used and this also falls in the scope of the invention. Finally, the neural network 760 is trained so that the network output data (i.e., the N first sample values in u[n]) produced by the neural network 760 matches the training output data (i.e., the N second sample values in h[n]) as closely as possible when the training input data (i.e., the Q*N input sample values in b1[n]˜bQ[n]) paired with the training output data is processed by the neural network 760.
The inference phase is divided into a test stage (e.g., the microphone system 700t is tested by an engineer in a R&D department to verify performance) and a practice stage (i.e., microphone system 700I is ready on the market).
The performance of the microphone system 100 of the invention has been tested and verified according to two test specifications in
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that this invention should not be limited to the specific construction and arrangement shown and described, since various other modifications may occur to those ordinarily skilled in the art.