The disclosure relates generally to methods and devices for setting machines, and more particularly it relates to methods and devices for setting and processing authentication self-help machines easily.
Bone conduction sensors have been studied and utilized to improve the speech quality in communication devices due to their immunity to ambient noise in an acoustic noisy environment. These sensor signals or bone-conducted signals, however, can only represent speech signal well at low frequencies, unlike regular air-conducted microphones which capture sound with rich bandwidth either for speech signals or background noise. Therefore, combining of a sensor or bone-conducted signal and an air-conducted acoustic signal to enhance the speech quality is of great interest for communication devices used in a noisy environment.
A method and a device for improving voice quality are provided herein. Signals from an accelerometer sensor and a microphone array are used for speech enhancement for wearable devices like earbuds, neckbands and glasses. All signals from the accelerometer sensor and the microphone array are processed in time-frequency domain for speech enhancement.
In an embodiment, a method for improving voice quality is provided herein. The method comprises receiving acoustic signals from a microphone array; receiving sensor signals from an accelerometer sensor; generating, by a beamformer, a speech output signal and a noise output signal according to the acoustic signals; best-estimating the speech output signal according to the sensor signals to generate a best-estimated signal; and generating a mixed signal according to the speech output signal and the best-estimated signal.
According to an embodiment of the invention, the method further comprises removing DC content of the acoustic signals from the microphone array and pre-emphasizing the acoustic signals to generate pre-emphasized acoustic signals; and performing short-term Fourier transform on the pre-emphasized acoustic signals to generate frequency-domain acoustic signals.
According to an embodiment of the invention, the step of generating, by the beamformer, the speech output signal and the noise output signal according to the acoustic signals comprises applying a spatial filter to the frequency-domain acoustic signals to generate the speech output signal and the noise output signal. The speech output signal is steered toward a first direction of a target speech and the noise output signal is steered toward a second direction. The second direction is opposite to the first direction.
According to an embodiment of the invention, the sensor signals comprise an X-axis signal, a Y-axis signal, and a Z-axis signal. The method further comprises removing DC content of the X-axis signal, the Y-axis signal, and the Z-axis signal from the accelerometer sensor and pre-emphasizing the X-axis signal, the Y-axis signal, and the Z-axis signal to generate a pre-emphasized X-axis signal, a pre-emphasized Y-axis signal, and a pre-emphasized Z-axis signal; and performing short-term Fourier transform on the pre-emphasized X-axis signal, the pre-emphasized Y-axis signal, and the pre-emphasized Z-axis signal to generate a frequency-domain X-axis signal, a frequency-domain Y-axis signal, and a frequency-domain Z-axis signal respectively.
According to an embodiment of the invention, the step of best-estimating the speech output signal by the sensor signals to generate a best-estimated signal further comprises applying an adaptive algorithm to the frequency-domain X-axis signal and the speech output signal to generate a first estimated signal; applying the adaptive algorithm to the frequency-domain Y-axis signal and the speech output signal to generate a second estimated signal; applying the adaptive algorithm to the frequency-domain Z-axis signal and the speech output signal to generate a third estimated signal; and selecting one with a maximal amplitude from the first estimated signal, the second estimated signal, and the third estimated signal to generate the best-estimated signal.
According to an embodiment of the invention, the adaptive algorithm is least mean square (LMS) algorithm, and a mean-square error between the frequency-domain X-axis signal and the speech output signal, a mean-square error between the frequency-domain Y-axis signal and the speech output signal, and a mean-square error between the frequency-domain Z-axis signal and the speech output signal are minimized.
According to another embodiment of the invention, the adaptive algorithm is least square (LS) algorithm, and a least-square error between the frequency-domain X-axis signal and the speech output signal, a least-square error between the frequency-domain Y-axis signal and the speech output signal, and a least-square error between the frequency-domain Z-axis signal and the speech output signal are minimized.
According to an embodiment of the invention, the accelerometer sensor has a maximum sensing frequency. The step of generating the mixed signal according to the speech output signal and the best-estimated signal further comprises when a first frequency range of the mixed signal does not exceed the maximum sensing frequency, selecting one with a minimal amplitude from the speech output signal and the best-estimated signal to represent the first frequency range of the mixed signal; and when a second frequency range of the mixed signal exceeds the maximum sensing frequency, selecting the speech output signal corresponding to the second frequency range to represent the second frequency range of the mixed signal.
According to an embodiment of the invention, the method further comprises after the mixed signal is generated, cancelling noise in the mixed signal with the noise output signal as a reference via an adaptive algorithm to generate a noise-cancelled mixed signal; suppressing noise in the noise-cancelled mixed signal with the noise output signal as a reference via a speech enhancement algorithm to generate a speech-enhanced signal; converting the speech-enhanced signal into time-domain to generate a time-domain speech-enhanced signal; and performing post-processing on the time-domain speech-enhanced signal to generate a speech signal.
According to an embodiment of the invention, the adaptive algorithm comprises least mean square (LMS) algorithm and least square (LS) algorithm. The speech enhancement algorithm comprises Spectral Subtraction, Wiener filter, and minimum mean square error (MMSE). The post-processing comprises de-emphasis, equalizer, and dynamic gain control.
In an embodiment, a device for improving voice quality comprises a microphone array, an accelerometer sensor, a beamformer, and a speech estimator. The accelerometer sensor has a maximum sensing frequency. The beamformer generates a speech output signal and a noise output signal according to acoustic signals from the microphone array. The speech estimator best-estimates the speech output signal according to sensor signals from the accelerometer sensor to generate a best-estimated signal and generates a mixed signal according to the speech output signal and the best-estimated signal.
According to an embodiment of the invention, the device further comprises a first pre-processor and a first STFT analyzer. The first pre-processor removes DC content of the acoustic signals and pre-emphasizes the acoustic signals to generate pre-emphasized acoustic signals. The first STFT analyzer performs short-term Fourier transform on the pre-emphasized acoustic signals to generate frequency-domain acoustic signals.
According to an embodiment of the invention, the beamformer applies a spatial filter to the frequency-domain acoustic signals to generate the speech output signal and the noise output signal. The speech output signal is steered toward a first direction of a target speech and the noise output signal is steered toward a second direction, wherein the second direction is opposite to the first direction.
According to an embodiment of the invention, the sensor signals comprise an X-axis signal, a Y-axis signal, and a Z-axis signal. The device further comprises a second pre-processor and a second STFT analyzer. The second pre-processor removes DC content of the X-axis signal, the Y-axis signal, and the Z-axis signal and pre-emphasizes the X-axis signal, the Y-axis signal, and the Z-axis signal to generate a pre-emphasized X-axis signal, a pre-emphasized Y-axis signal, and a pre-emphasized Z-axis signal. The second STFT analyzer performs short-term Fourier transform on the pre-emphasized X-axis signal, the pre-emphasized Y-axis signal, and the pre-emphasized Z-axis signal to generate a frequency-domain X-axis signal, a frequency-domain Y-axis signal, and a frequency-domain Z-axis signal respectively.
According to an embodiment of the invention, the speech estimator further comprises a first adaptive filter, a second adaptive filter, a third adaptive filter, and a first selector. The first adaptive filter applies an adaptive algorithm to the frequency-domain X-axis signal and the speech output signal to generate a first estimated signal. A difference of the first estimated signal and the speech output signal is minimized. The second adaptive filter applies the adaptive algorithm to the frequency-domain Y-axis signal and the speech output signal to generate a second estimated signal. A difference of the second estimated signal and the speech output signal is minimized. The third adaptive filter applies the adaptive algorithm to the frequency-domain Z-axis signal and the speech output signal to generate a third estimated signal. A difference of the third estimated signal and the speech output signal is minimized. The first selector selects one with a maximal amplitude from the first estimated signal, the second estimated signal, and the third estimated signal to generate the best-estimated signal.
According to an embodiment of the invention, the adaptive algorithm is least mean square (LMS) algorithm, and a mean-square error between the frequency-domain X-axis signal and the speech output signal, a mean-square error between the frequency-domain Y-axis signal and the speech output signal, and a mean-square error between the frequency-domain Z-axis signal and the speech output signal are minimized.
According to another embodiment of the invention, the adaptive algorithm is least square (LS) algorithm, and a least-square error between the frequency-domain X-axis signal and the speech output signal, a least-square error between the frequency-domain Y-axis signal and the speech output signal, and a least-square error between the frequency-domain Z-axis signal and the speech output signal are minimized.
According to an embodiment of the invention, the speech estimator further comprises a second selector. When a first frequency range of the mixed signal does not exceed the maximum sensing frequency, the second selector selects one with a minimal amplitude from the speech output signal and the best-estimated signal to represent the first frequency range of the mixed signal. When a second frequency range of the mixed signal exceeds the maximum sensing frequency, the second selector selects the speech output signal corresponding to the second frequency range to represent the second frequency range of the mixed signal.
According to an embodiment of the invention, the device further comprises a noise canceller, a noise suppressor, an STFT synthesizer, and a post-processor. The noise canceller cancels noise in the mixed signal with the noise output signal as a reference via an adaptive algorithm to generate a noise-cancelled mixed signal. The noise suppressor suppresses noise in the noise-cancelled mixed signal with the noise output signal as a reference via a speech enhancement algorithm to generate a speech-enhanced signal. The STFT synthesizer converts the speech-enhanced signal into time-domain to generate a time-domain speech-enhanced signal. The post-processor performs post-processing on the time-domain speech-enhanced signal to generate a speech signal.
According to an embodiment of the invention, the adaptive algorithm comprises least mean square (LMS) algorithm and least square (LS) algorithm. The speech enhancement algorithm comprises Spectral Subtraction, Wiener filter, and minimum mean square error (MMSE), wherein the post-processing comprises de-emphasis, equalizer, and dynamic gain control.
A detailed description is given in the following embodiments with reference to the accompanying drawings.
The invention can be more fully understood by reading the subsequent detailed description and examples with references made to the accompanying drawings, wherein:
This description is made for the purpose of illustrating the general principles of the invention and should not be taken in a limiting sense. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. The scope of the invention is best determined by reference to the appended claims.
It will be understood that, in the description herein and throughout the claims that follow, although the terms “first,” “second,” etc. may be used to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the embodiments.
It is understood that the following disclosure provides many different embodiments, or examples, for implementing different features of the application. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. Moreover, the formation of a feature on, connected to, and/or coupled to another feature in the present disclosure that follows may include embodiments in which the features are formed in direct contact, and may also include embodiments in which additional features may be formed interposing the features, such that the features may not be in direct contact.
As shown in
The device 100, which receives the acoustic signals m1(t) and m2(t) and the X-axis sensor signal ax(t), the Y-axis sensor signal ay(t), and the Z-axis sensor signal az(t), includes a first pre-processor 101, a first STFT analyzer 102, and a beamformer 103. The first pre-processor 101 removes the DC content of the acoustic signals m1(t) and m2(t) and pre-emphasizes the acoustic signals m1(t) and m2(t) from the microphone array 10 to generate pre-emphasized acoustic signals m1pe(t) and m2pe(t).
The first STFT analyzer 102 performs a short-term Fourier transform to split the pre-emphasized acoustic signals m1pe(t) and m2pe(t) in time domain into a plurality of frequency bins. According to an embodiment of the invention, the first STFT analyzer 102 performs the short-term Fourier transform by using overlap-add approach which performs DFT on one frame of signal with a time window overlapped with previous frame. After the STFT analyzer 102, frequency-domain acoustic signals M1(n, k) and M2(n, k), which are time-frequency representations of the two microphone signals, are obtained, where n represents a time index for one frame of data, k=1, . . . , K and K is total number of frequency bins split over the frequency bandwidth.
For each k, the beamformer 103 applies a spatial filter to the frequency-domain acoustic signals M1(n, k) and M2(n, k) to generate a speech output signal Bs(n, k) and a noise output signal Br(n, k). The speech output signal Bs(n, k) is steered in the direction of a target speech, and the noise output signal Br(n, k) is steered in the opposite direction of the target speech. In other words, the speech output signal Bs(n, k) is speech weighted, and the noise output signal Br(n, k) is noise weighted.
The device 100 further includes a second pre-processor 104, a second STFT analyzer 105, and a speech estimator 106.
The second pre-processor 104 removes the DC content of the X-axis sensor signal ax(t), the Y-axis sensor signal ay(t), and the Z-axis sensor signal az(t) and pre-emphasizes the X-axis sensor signal ax(t), the Y-axis sensor signal ay(t), and the Z-axis sensor signal az(t) from the accelerometer sensor 20 to generate a pre-emphasized X-axis signal axpe(t), a pre-emphasized Y-axis signal aype(t), and a pre-emphasized Z-axis signal azpe(t).
The second STFT analyzer 105 performs the short-term Fourier transform on the pre-emphasized X-axis signal axpe(t), the pre-emphasized Y-axis signal aype(t), and the pre-emphasized Z-axis signal azpe(t) to generate a frequency-domain X-axis signal Ax(n, k), a frequency-domain Y-axis signal Ay(n, k), and a frequency-domain Z-axis signal Az(n, k) respectively, for each frequency bin of k at the time index of n.
The speech estimator 106 best-estimates the speech output signal Bs(n, k) by using the frequency-domain X-axis signal Ax(n, k), the frequency-domain Y-axis signal Ay(n, k), and the frequency-domain Z-axis signal Az(n, k) to generate a best-estimated signal, and then generates a mixed signal S1(n, k) according to the speech output signal Bs(n, k) and the best-estimated signal. How to generate the best-estimated signal and the mixed signal S1(n, k) will be explained in the following paragraphs.
As shown in
The first estimated signal Rx(n, k) is expressed as Eq. 1, where Wx(n, i), i=0, . . . , I−1, are the weights of FIR filter with order I, which will be updated at each time index n for all frequency bins k=1, . . . , K.
Rx(n,k)=Σi=0I-1Wx(n,i)Ax(n−i,k) (Eq. 1)
The second adaptive filter 220 applies the adaptive algorithm to the frequency-domain Y-axis signal Ay(n, k) and the speech output signal Bs(n, k) to generate a second estimated signal Ry(n, k) so that a difference of the second estimated signal Ry(n, k) and the speech output signal Bs(n, k) is minimized.
The second estimated signal Ry(n, k) is expressed as Eq. 2, where Wy(n, i), i=0, . . . , I−1, are the weights of FIR filter with order I, which will be updated at each time index n for all frequency bins k=1, . . . , K.
Ry(n,k)Σi=0I-1Wy(n,i)Ay(n−i,k) (Eq. 2)
The third adaptive filter 230 applies the adaptive algorithm to the frequency-domain Z-axis signal Az(n, k) and the speech output signal Bs(n, k) to generate a third estimated signal Rz(n, k) so that a difference of the third estimated signal Rz(n, k) and the speech output signal Bs(n, k) is minimized.
The third estimated signal Rz(n, k) is expressed as Eq. 3, where Wz(n, i), i=0, . . . , I−1, are the weights of FIR filter with order I, which will be updated at each time index n for all frequency bins k=1, . . . , K.
Rz(n,k)=Σi=0I-1Wz(n,i)Az(n−i,k) (Eq. 3)
According to an embodiment of the invention, the adaptive algorithm of the first adaptive filter 210, the second adaptive filter 220, and the third adaptive filter 230 may be least mean square (LMS) algorithm so that a mean-square error between the frequency-domain X-axis signal Rx(n, k) and the speech output signal Bs(n, k), a mean-square error between the frequency-domain Y-axis signal Ry(n, k) and the speech output signal Bs(n, k), and a mean-square error between the frequency-domain Z-axis signal Rz(n, k) and the speech output signal Bs(n, k) are minimized.
According to another embodiment of the invention, the adaptive algorithm of the first adaptive filter 210, the second adaptive filter 220, and the third adaptive filter 230 may be least square (LS) algorithm so that a least-square error between the frequency-domain X-axis signal Rx(n, k) and the speech output signal Bs(n, k), a least-square error between the frequency-domain Y-axis signal Ry(n, k) and the speech output signal Bs(n, k), and a least-square error between the frequency-domain Z-axis signal Rz(n, k) and the speech output signal Bs(n, k) are minimized.
The first selector 240 selects one with a maximal amplitude from the first estimated signal Rx(n, k), the second estimated signal Ry(n, k), and the third estimated signal Rz(n, k) to generate the best-estimated signal R(n, k), which is expressed as Eq. 4.
R(n,k)=Max{Rx(n,k),Ry(n,k),Rz(n,k)} (Eq. 4)
As shown in
According to an embodiment of the invention, the maximum sensing frequency of the accelerometer sensor 20 is the maximum frequency that the accelerometer sensor 20 is able to sense. When a second frequency range of the mixed signal S1 (n, k) exceeds the maximum sensing frequency of the accelerometer sensor 20 in
The mixed signal S1(n, k) is expressed as Eq. 5, where Min{ } stands for taking the element with the minimal amplitude, and Ks is a threshold of integer to be chosen in practice based on the maximum sensing frequency of the accelerometer being used.
In other words, one having the minimum amplitude from the best-estimated signal R(n, k) and the speech output signal Bs(n, k) is selected to represent the mixed signal S1(n, k) when the frequency of the mixed signal S1(n, k) does not exceed the maximum sensing frequency of the accelerometer sensor 20; the speech output signal Bs(n, k) is selected to represent the when the frequency of the mixed signal S1(n, k) exceeds the maximum sensing frequency of the accelerometer sensor 20.
According to an embodiment of the invention, when the frequency of the mixed signal S1 (n, k) does not exceed the maximum sensing frequency of the accelerometer sensor 20, one having the minimum amplitude from the best-estimated signal R(n, k) and the speech output signal Bs(n, k) is selected so that noise from the microphone array 10 can be reduced.
Referring to
The noise suppressor 108 suppresses noise in the noise-cancelled mixed signal S2 (n, k) with the noise output signal Br(n, k) as a reference via a speech enhancement algorithm to generate a speech-enhanced signal S (n, k). According to some embodiments of the invention, the speech enhancement algorithm includes Spectral Subtraction, Wiener filter, and minimum mean square error (MMSE).
As shown in
S2(n,k)=S1(n,k)−μΣj=0J-1U(n,j)Br(n−j,k) (Eq. 6)
According to an embodiment of the invention, the adaptation of the step-size p in the adaptive filter 311 may be controlled by voice activities in mixed signal S1 (n, k). For examples, a smaller value is adopted when the mixed signal S1 (n, k) contains mainly speech and a larger value is used when it contains mainly noise.
Referring to
The beamformer 103 of the device 100 generates a speech output signal Bs(n, k) and a noise output signal Br(n, k) according to the acoustic signals m1(t) and m2(t) (Step S430). The speech estimator 106 best-estimates the speech output signal Bs(n, k) according to the sensor signals ax(t), ay(t), and az(t) to generate a best-estimated signal R(n, k) (Step S440), and generates a mixed signal S1(n, k) according to the speech output signal Bs(n, k) and the best-estimated signal R(n, k) (Step S450).
A method and a device for improving voice quality are provided herein. Signals from an accelerometer sensor and a microphone array are used for speech enhancement for wearable devices like earbuds, neckbands and glasses. All signals from the accelerometer sensor and the microphone array are processed in time-frequency domain for speech enhancement.
Although some embodiments of the present disclosure and their advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims. For example, it will be readily understood by those skilled in the art that many of the features, functions, processes, and materials described herein may be varied while remaining within the scope of the present disclosure. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present disclosure, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed, that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present disclosure. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.
This application claims the benefit of U.S. Provisional Application No. 63/000,535, filed on Mar. 27, 2020, the entirety of which is incorporated by reference herein.
Number | Name | Date | Kind |
---|---|---|---|
9363596 | Dusan et al. | Jun 2016 | B2 |
20070058799 | Sudo | Mar 2007 | A1 |
20120224715 | Kikkeri | Sep 2012 | A1 |
20120259626 | Li | Oct 2012 | A1 |
20140003611 | Mohammad | Jan 2014 | A1 |
20140270231 | Dusan | Sep 2014 | A1 |
20190272842 | Bryan | Sep 2019 | A1 |
Number | Date | Country | |
---|---|---|---|
20210304779 A1 | Sep 2021 | US |
Number | Date | Country | |
---|---|---|---|
63000535 | Mar 2020 | US |