The disclosure relates to selecting an output wave beam of a microphone array, and specifically to a method for selecting an output wave beam of a microphone array based on voice existence probability.
A microphone array can perform beamforming in multiple directions. However, due to the limitation of output hardware resources or application scenarios, usually only a beam in a certain direction is allowed to be selected as an output signal. The output wave beam selection of the microphone array is essentially an estimate of the direction of the source of voice signal. Correctly judging the direction of the voice signal can maximize the application effect of a beamforming algorithm; on the contrary, selecting a non-optimal wave beam as the output may greatly reduce the noise inhibitory effect of the beamforming algorithm. Therefore, in practice, the output wave beam selection mechanism, as a subsequent process to the beamforming algorithm, is of great significance to the research and development of voice signal processing systems using microphone arrays.
The inventor has noticed that while attempts have been made in the prior art to propose different methods for selecting an output wave beam of a microphone array, these existing methods still have at least the following deficiencies:
For example, Chinese Patent with the Publication No. CN103888861B discloses a method for adjusting the directivity of a microphone array, in which the method firstly receives voice information, judges the information of the pre-speaker according to the voice information, and determines the direction of the pre-speaker's location according to the judging result. In this method, it's required to store the speaker's identity information in advance, and wave beam directivity adjustment cannot be performed for unstored speakers.
For another example, the Chinese patent application with the Publication No. CN109119092A discloses a method for switching the directivity of a wave beam based on a microphone array, in which the method only utilizes the phase delay information between the microphones and the energy information of each beam, and cannot distinguish between human voice signals and non-human voice signals, therefore, it is susceptible to interference from high volume unstable noises.
For a further example, Chinese patent application with the Publication No. CN109473118A discloses a dual-channel voice enhancement method, in which the target wave beam is enhanced only according to the existence probability of the sound to be enhanced in the target wave beam, and the wave beam selection is performed based on the ratio of the voice existence probability of each wave beam therein. In practice, this method has the disadvantage of being susceptible to interference from low volume unstable signals.
For another further example, Chinese patent application with the Publication No. CN108899044A discloses a voice signal processing method, in which the correlation between the voice signals and the content is determined by utilizing the wake word existence probability, which specifically comprises firstly inputting the voice signals into the wake word engine, and obtaining the confidence levels of the voice signals output by the wake word engine, and then calculating the voice existence probability and calculating the direction of arrival of the original input signals. However, before the direction of arrival may be judged, this method relies on the wake word engine to calculate the existence probability of particular words or sentences, the realization of which relies on voice recognition technology, therefore, it can only be applied to a voice signal processing system with wake-up function. In addition, the calculation of wake word existence probability and vector operation required by the method increase the computational complexity of the method, which is not practical to be implemented on resource-constrained devices such as IoT microcontroller units (MCUs).
To sum up, there is a need in the prior art for a method for selecting an output wave beam of a microphone array to solve the above problems in the prior art. It should be understood that the technical problems listed above are only examples rather than limitations of the disclosure, and the disclosure is not limited to technical solutions that simultaneously solve all the above technical problems. The technical solutions of the disclosure may be implemented to solve one or more of the above or other technical problems.
In view of the above problems, the object of the disclosure is to provide a method for selecting an output wave beam of a microphone array, which does not rely on pre-stored speaker information, does not require wake word recognition before recognizing a direction of arrival, and can reduce both the high volume noise interference and low volume unstable signal interference, and has reduced computational complexity.
In one aspect of the disclosure, a method is provided for selecting an output wave beam of a microphone array, the method comprising the following steps: (a) receiving a plurality of sound signals from the microphone array comprising a plurality of microphones, and performing beamforming on the plurality of sound signals to obtain a plurality of wave beams and corresponding wave beam output signals; (b) performing the following operations on each wave beam in the plurality of wave beams: converting the wave beam output signal of a current wave beam from time domain to frequency domain to obtain a frequency spectrum vector and a power spectrum vector of the current wave beam; on the basis of the frequency spectrum vector and the power spectrum vector of the current wave beam, calculating an overall voice signal energy of the current wave beam, wherein the overall voice signal energy is a product of an overall energy and an overall voice existence probability of the current wave beam, wherein the overall energy indicates an energy level of the wave beam output signal of the current wave beam, the overall voice existence probability indicates an existence probability of voice in the wave beam output signal of the current wave beam, and the overall voice existence probability and the overall energy are scalar quantities; and (c) selecting a wave beam with a maximal overall voice signal energy value as an output wave beam.
Optionally, the frequency spectrum vector is obtained by performing Short-Time Fourier Transform (STFT) or Short-Time Discrete Cosine Transform (DCT) on the wave beam output signal of the current wave beam.
Optionally, in step (b), after obtaining the frequency spectrum vector and the power spectrum vector of the current wave beam, update the power spectrum vector with the frequency spectrum vector according to the following formula:
Sb(f,t)=α1Sb(f,t−1)+(1−α1)|Yb(f,t)|2,
Preferably, α1 is greater than or equal to 0.9 and less than or equal to 0.99.
Optionally, in step (b), before calculating the overall voice signal energy of the current wave beam based on the frequency spectrum vector and the power spectrum vector of the current wave beam, determining a local energy minimum value corresponding to each element in the power spectrum vector of the current wave beam.
Optionally, determining the local energy minimum value corresponding to each element in the power spectrum vector of the current wave beam comprises: maintaining two vectors Sb,min and Sb,tmp with the same length as the frequency spectrum vector, and with an initial value of zero;
Each element of vectors Sb,min and Sb,tmp is updated according to the following formula:
Sb,min(f,t)=min{Sb,min(f,t−1),Sb(f,t)},
Sb,tmp(f,t)=min{Sb,tmp(f,t−1),Sb(f,t)},
Preferably, the L is set such that the L frames of signals comprise signals of 200 milliseconds to 500 milliseconds.
Optionally, the overall energy is obtained according to the following steps: averaging all elements of the power spectrum vector to obtain the overall energy.
Optionally, averaging all elements of the power spectrum vector to obtain the overall energy comprises:
Optionally, the overall voice existence probability is obtained according to the following steps: for each element in a signal power spectrum vector of the current wave beam, calculating a voice existence probability corresponding to each element in the signal power spectrum vector according to a voice existence probability model, so as to generate a voice existence probability vector of the current wave beam; and perform the following steps to update each element of the voice existence probability vector of the current wave beam:
pb(f,t)=α2pb(f,t−1)+(1−α2)I(b,f,t)
Preferably, α2 is greater than or equal to 0.8 and less than or equal to 0.99.
Optionally, averaging all elements of the voice existence probability vector to obtain the overall voice existence probability comprises: performing weighted averaging on all elements of the voice existence probability vector to obtain the overall voice existence probability, wherein for each element in the voice existence probability vector, if the frequency point corresponding to the element falls in the range of 0-5 kHz, the element is given a weight of 1, otherwise it is given a weight of 0.
Preferably, in step (b), after calculating the overall voice signal energy of the current wave beam, update the overall voice signal energy of the current wave beam according to the following operation:
db(t)=α3db(t−1)+(1−α3)J(b,t),
Preferably, α3 is greater than or equal to 0.8 and less than or equal to 0.99.
The solution of the disclosure calculates the overall voice signal energy of each wave beam to select an output wave beam of the microphone array accordingly. In particular, the overall voice signal energy give sufficient consideration to the overall energy of the wave beam and the overall voice existence probability, and the wave beam selection is performed through both the wave beam energy and the voice existence probability, which does not require pre-acquisition of speaker information, and overcomes the interference of non-human noises, and also does not require any voice recognition prior to recognizing the direction of arrival. In addition, the overall voice signal energy is a product of scalar quantities, which helps reduce vector calculations and lowers computational complexity.
It should be understood that the foregoing description of the background and summary of the invention is only intended to be illustrative rather than limiting.
The disclosure will be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show exemplary embodiments by way of illustration. It should be understood that the embodiments shown in the accompanying drawings and described hereinafter are only illustrative and not intended to limit the disclosure.
Method 100 shown in
The method 100 further comprises: (b) as shown in steps 104 to 108, performing the following operations on each wave beam in the plurality of wave beams: converting the wave beam output signal of a current wave beam from time domain to frequency domain to obtain a frequency spectrum vector and a power spectrum vector of the current wave beam (step 104); on the basis of the frequency spectrum vector and the power spectrum vector of the current wave beam, calculating an overall voice signal energy of the current wave beam (step 106), wherein the overall voice signal energy is a product of an overall energy and an overall voice existence probability of the current wave beam, wherein the overall energy indicates an energy level of the wave beam output signal of the current wave beam, the overall voice existence probability indicates an existence probability of voice in the wave beam output signal of the current wave beam, and the overall voice existence probability and the overall energy are scalar quantities.
The method further comprises: (c) as shown in step 110, selecting a wave beam with a maximal overall voice signal energy value as an output wave beam.
Method 200 begins from step 202, in which the wave beam output by the beamforming algorithm is transformed into the STFT domain, and the power spectrum vector of each wave beam is updated with the frequency spectrum information. Specifically, it is assumed that the beamforming algorithm outputs B wave beams which are transformed into Short-Time Fourier Transform (STFT) domain of F points, then the output signal of the b-th (b=1, 2, . . . , B) wave beam may be represented as an F-dimensional frequency spectrum vector Yb in the STFT domain, and the f-th element Yb(f) of the vector Yb represents the frequency information of the signal at the frequency f. The modulus is taken for each frequency point of vector Yb and weighted with the power spectrum vector Sb, and the latter is updated according to the following formula:
Sb(f,t)=α1Sb(f,t−1)+(1−α1)|Yb(f,t)|2
In step 204, update the estimate of the local energy minimum value Sb,min of the current wave beam. For example, the local energy minimum value estimate may be updated according to the method 300 shown in
In step 302, maintain two vectors Sb,min and Sb,tmp with a length of F (the initial value is 0, that is, the formula Sb,min(f,0)=Sb,tmp(f,0)=0 is for all f).
In step 304, determine whether a next element exists in the power spectrum vector of the current wave beam Sb. If yes, go to step 306; if no, which means that each element of the power spectrum vector of the current wave beam has been processed, go to step 312, and obtain the local minimum energy value corresponding to each element.
In step 306, update the current element corresponding to each frequency point in the following manner,
Sb,min(f,t)=min{Sb,min(f,t−1),Sb(f,t)},
Sb,tmp(f,t)=min{Sb,tmp(f,t−1),Sb(f,t)},
In step 308, judge whether L frames of signals have been processed, that is, judge whether t is a multiple of L or not. Each time when L frames of signals are processed, in step 310, reset Sb,min and Sb,tmp in the following manner,
Sb,min(f,t)=min{Sb,tmp(f,t−1)Sb(f,t)}
Sb,tmp(f,t)=Sb(f,t);
Returning to
pb(f t)=α2pb(f,t−1)+(1−α2)I(b,f,t)
The value of function I(b,f) is
It should be understood that step 206 may be implemented using the method of Cohen, I. and Berdugo, B.: Noise estimation by minima controlled recursive averaging for robust speech enhancement. 2002, IEEE Signal Processing Letters, 9(1): 12-15 or its variants, and other algorithms for probability estimation of voice signals. Similarly, the input to the algorithm is required to be the signal power spectrum Sb, and the output is the voice probability pb between 0 and 1.
In step 208, perform weighted averaging on the voice existence probability vector to obtain the overall voice probability of the current wave beam. Specifically, weighted averaging on the vector pb is performed. Give a weight of 1 to the frequency points in the range of 0-5 kHz, otherwise give a weight of 0, to obtain the overall voice existence probability qb of wave beam b. A scalar quantity qb will be used in subsequent steps instead of a vector pb, which will simplify the calculations; at the same time, since it is almost impossible for the frequency of human voice to exceed 5 kHz, it can be considered that discarding the signals above this frequency will not affect the final result.
In step 210, perform weighted averaging on the power spectrum vector to obtain the overall energy of the current wave beam. Similarly, perform the same weighted averaging on the vector Sb to obtain the overall energy eb of wave beam b. Specifically, weighted averaging is performed on the vector Sb. A weight of 1 is given to frequency points in the range of 0-5 kHz, otherwise a weight of 0 is given.
In step 212, calculate the overall voice signal energy of the current wave beam. db is defined as the voice signal energy of wave beam b, the initial value of which is 0 (i.e., db(0)=0), update each frame in the following manner:
db(t)=α3db(t−1)+(1−α3)J(b,t)
The parameter α3 is between 0 and 1, and the recommended setting is 0.8 to 0.99. The function J(b) represents the voice signal energy of the current frame, the value of which is
In step 214, determine whether a next wave beam exists. If yes, go back to step 204, and execute steps 204-212 for the next wave beam; if not, go to step 218.
In step 218, a wave beam with a maximal overall voice signal energy is determined and selected as an output wave beam. Specifically, take wave beam b corresponding to the maximum value in overall voice signal energy set {db}(b=1, 2, . . . , B) as an output wave beam.
The above embodiments provide specific operation processes by way of example, but it should be understood that the protection scope of the disclosure is not limited thereto.
While various embodiments of various aspects of the invention have been described for the purpose of the disclosure, it shall not be understood that the teaching of the disclosure is limited to these embodiments. The features disclosed in a specific embodiment are therefore not limited to that embodiment, but may be combined with the features disclosed in different embodiments. Furthermore, it should be understood that the method steps described above may be performed sequentially, performed in parallel, combined into fewer steps, split into more steps, combined and/or omitted in ways other than those described. Those skilled in the art should appreciate that there are possibly more optional embodiments and modifications and various changes and modifications may be made to the above components and configurations, without departing from the scope defined by the claims of the disclosure.
Number | Date | Country | Kind |
---|---|---|---|
201911097476.0 | Nov 2019 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2020/128274 | 11/12/2020 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2021/093798 | 5/20/2021 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
6370507 | Grill | Apr 2002 | B1 |
6377920 | Yeldener | Apr 2002 | B2 |
9613640 | Balamurali | Apr 2017 | B1 |
10096328 | Markovich-Golan et al. | Oct 2018 | B1 |
20070260454 | Gemello | Nov 2007 | A1 |
20120173234 | Fujimoto | Jul 2012 | A1 |
20130003987 | Furuta | Jan 2013 | A1 |
20130144614 | Myllyla | Jun 2013 | A1 |
20140074467 | Ziv | Mar 2014 | A1 |
20150039304 | Wein | Feb 2015 | A1 |
20170004848 | Bae | Jan 2017 | A1 |
20180033447 | Ramprashad | Feb 2018 | A1 |
20180090158 | Jensen | Mar 2018 | A1 |
20190259381 | Ebenezer | Aug 2019 | A1 |
20190385635 | Shahen Tov et al. | Dec 2019 | A1 |
20220148611 | Slapak | May 2022 | A1 |
Number | Date | Country |
---|---|---|
101510426 | Aug 2009 | CN |
102324237 | Jan 2012 | CN |
102508204 | Jun 2012 | CN |
102739886 | Oct 2012 | CN |
103456310 | Dec 2013 | CN |
103871420 | Jun 2014 | CN |
104751853 | Jul 2015 | CN |
105590631 | May 2016 | CN |
106251877 | Dec 2016 | CN |
106448692 | Feb 2017 | CN |
107976651 | May 2018 | CN |
108922554 | Nov 2018 | CN |
109346062 | Feb 2019 | CN |
110223708 | Sep 2019 | CN |
110390947 | Oct 2019 | CN |
110600051 | Dec 2019 | CN |
6114053 | Apr 2017 | JP |
20110121319 | Nov 2011 | KR |
2013132926 | Jan 2013 | WO |
2018133056 | Jan 2017 | WO |
Entry |
---|
International Search Report for PCT Publication No. WO 2021093798, dated May 20, 2021. |
Office Action with Search Report for CN Patent Application No. 201911097476.0, dates Dec. 26, 2019. |
Number | Date | Country | |
---|---|---|---|
20220399028 A1 | Dec 2022 | US |