APPARATUS AND METHOD FOR OWN VOICE SUPPRESSION

Abstract
An own voice suppression apparatus applicable to a hearing aid is disclosed. The own voice suppression apparatus comprises: an air conduction sensor, an own voice indication module and a suppression module. The air conduction sensor is configured to generate an audio signal. The own voice indication module is configured to generate an indication signal according to at least one of user's mouth vibration information and user's voice feature vector comparison result. The suppression module coupled to the air conduction sensor and the own voice indication module is configured to generate an own-voice-suppressed signal according to the indication signal and the audio signal.
Description
BACKGROUND OF THE INVENTION
Field of the Invention

The invention relates to voice signal processing, and more particularly, to an apparatus and method for own voice suppression applicable to a hearing aid.


Description of the Related Art

The aim of a hearing aid is to offer the best clarity and intelligibility in the presence of background noise or competing speech. Since the hearing aid is very dose to the user's mouth, the most common complaint of hearing aid users is abnormally high voices while the user is speaking. This not only leaves the user feeling irritable or agitated, but also shields environmental sounds. Moreover, there is a potential risk of damage the hearing for the hearing aid users.


YAN disclosed a method of deep learning voice extraction and noise reduction method of combining bone conduction sensor and microphone signals in China Patent Pub. No. CN 110931031A. High-pass filtering or frequency band extending operation is performed over an audio signal from a bone conduction sensor to produce a processed signal. Then, both the processed signal and a microphone audio signal are fed into a deep neural network (DNN) module. Finally, the deep neural network module obtains the voice after noise reduction through prediction. Although YAN successfully extracts a target human voice in a complex noise scene and reduces interference noise, YAN fails to deal with the problem of abnormally high voices while the user is speaking.


The perception and acceptance of hearing aids is likely to be improved if the volume of the user's own voice can be reduced while the user is speaking.


SUMMARY OF THE INVENTION

In view of the above-mentioned problems, an object of the invention is to provide an own voice suppression apparatus for hearing aid users to improve comfort and speech intelligibility.


One embodiment of the invention provides an own voice suppression apparatus. The own voice suppression apparatus applicable to a hearing aid comprises: an air conduction sensor, an own voice indication module and a suppression module. The air conduction sensor is configured to generate an audio signal. The own voice indication module is configured to generate an indication signal according to at least one of user's mouth vibration information and user's voice feature vector comparison result. The suppression module coupled to the air conduction sensor and the own voice indication module is configured to generate an own-voice-suppressed signal according to the indication signal and the audio signal.


Another embodiment of the invention provides an own voice suppression method. The own voice suppression method, applicable to a hearing aid, comprises: providing an audio signal by an air conduction sensor; generating an indication signal according to at least one of user's mouth vibration information and user's voice feature vector comparison result; and generating an own-voice-suppressed signal according to the audio signal and the indication signal.


Further scope of the applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.





BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, and wherein:



FIG. 1 is a block diagram showing an own voice suppression apparatus according to the invention.



FIG. 2A is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to an embodiment of the invention.



FIG. 2B is a block diagram showing the computing unit 25A according to an embodiment of the invention.



FIG. 3A is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to an embodiment of the invention.



FIG. 3B is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to another embodiment of the invention.



FIG. 3C shows a relationship between an own voice complex-valued sample Xk and a speech complex-valued sample Zk for the same frequency bin k.



FIG. 3D is a block diagram showing the computing unit 25C according to an embodiment of the invention.



FIG. 3E shows a timing diagram of calculating the suppression mask αk(i) according to three (i.e., L=3) average speech power values and three average product complex values of three frequency bins in the invention.



FIG. 3F is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to another embodiment of the invention.



FIG. 3G is a block diagram showing the computing unit 25D according to an embodiment of the invention.



FIG. 4A is a block diagram showing an own voice suppression apparatus with a voice identification module according to an embodiment of the invention.



FIG. 4B is a block diagram showing an own voice suppression apparatus with a voice identification module according to another embodiment of the invention.



FIG. 4C is a block diagram showing a voice identification module according to an embodiment of the invention.



FIG. 4D is a block diagram showing an own voice suppression apparatus with a voice identification module according to another embodiment of the invention.



FIG. 5A is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to an embodiment of the invention.



FIG. 5B is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to another embodiment of the invention.



FIG. 5C is a block diagram showing the computing unit 25I according to an embodiment of the invention.



FIG. 5D is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to another embodiment of the invention.



FIG. 5E is a block diagram showing the computing unit 25J according to an embodiment of the invention.



FIG. 6 show the waveforms of the audio signal S1, the vibration S2 and the own-voice-suppressed signal S3.





DETAILED DESCRIPTION OF THE INVENTION

As used herein and in the claims, the term “and/or” includes any and all combinations of one or more of the associated listed items. The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Throughout the specification, the same components and/or components with the same function are designated with the same reference numerals.


A feature of the invention is to use at least one of a bone conduction sensor 231 and a voice identification module 130B to identify/detect which frequency bins (or passbands) the user's own voice components are located and then suppress/reduce the user's own voice components according to their respective power levels in multiple detected frequency bins or passbands to prevent from damaging the user's hearing and shielding environmental sounds. Thus, it is likely to improve comfort and speech intelligibility for the hearing aid users.



FIG. 1 is a block diagram showing an own voice suppression apparatus according to the invention. Referring to FIG. 1, an own voice suppression apparatus 10 of the invention, applicable to a hearing aid, includes an air conduction sensor 110, an amplification unit 120, an own voice indication module 130 and a suppression module 150. The air conduction sensor 110 may be implemented by an electret condenser microphone (ECM) or a micro-electro-mechanical system (MEMS) microphone. The air conduction sensor 110 receives both the user's voices/speech/utterances and the environmental sounds to output an audio signal S1.


The amplification unit 120 is configured to increase the magnitude of its input audio signal S1 by a voltage gain to generate an amplified signal Z[n], where n denotes the discrete time index. The own voice indication module 130 generates an indication signal X[n] according to either user's mouth vibration information (e.g., a vibration signal S2 from a bone conduction sensor 231) and/or user's voice feature vector comparison result (e.g., matching scores from a voice identification module 130B). The suppression module 150 calculates a suppression mask according to the amplified signal Z[n] and the indication signal X[n], suppresses the power level for the own voice component contained in the amplified signal Z[n] and generates an own-voice-suppressed signal S3.



FIG. 2A is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to an embodiment of the invention. Referring to FIG. 2A, an own voice suppression apparatus 20A of the invention, applicable to a hearing aid, includes an air conduction sensor 110, a multiplier 120a, an own voice indication module 130A and a suppression module 150A. In this embodiment, the amplification unit 120 in FIG. 1 is implemented by a multiplier 120a and the voltage gain varies according to the magnitude of its input audio signal S1 so that the magnitude of the amplified signal Z[n] falls within a predefined range. The amplification unit 120/120a is optional.


The own voice indication module 130A includes a bone conduction sensor 231 and an own voice reconstruction module 232. The bone conduction sensor 231 may be implemented by a MEMS voice accelerometer. As well known in the art, a voice accelerometer is configured to measure vibrations caused by speech/voice/mouth movement of the user, particularly at low frequencies, to output a vibration signal S2. The audio signal S1 and the vibration signal S2 may be analog or digital. If the signals S1 and S2 are analog, they may be digitized using techniques well known in the art. It is assumed that the amplified signal Z[n] and the reconstructed signal X[n] need to be digitized before being fed to the suppression module 150A. In general, the human voice/speech spans a range from about 125 Hz to 20 kHz. However, the bandwidth of the vibration signal S2 is normally restricted to a range from 0 to 3 kHz depending on the specification of the bone conduction sensor 231, and thus the vibration signal S2 usually sounds muffled. To solve this problem, the own voice reconstruction module 232 is provided to reconstruct the lost high-frequency components from the vibration signal S2 with a frequency range below 3 kHz by any existing or yet-to-be developed audio bandwidth extension approaches or high frequency reconstruction algorithms to generate a reconstructed signal X[n] with a frequency range extended up to 20 KHz. In an embodiment, the own voice reconstruction module 232 includes a deep neural network (not shown) that extracts feature values from the vibration signal S2 and then reconstructs its high-frequency components to generate a reconstructed signal X[n]. The deep neural network may be one or a combination of a recurrent neural network (RNN) and a convolutional neural network (CNN).


Assume that the noisy speech signal Z[n] can be expressed as Z[n]=v[n]+d[n], where v[n] is the clean speech, d[n] is the additive noise and n denotes the discrete time index. The suppression module 150A includes a computing unit 25A and a real-value multiplier 255. The computing unit 25A calculates a corresponding suppression mask α[n] (i.e., sample by sample) according to the amplified signal Z[n] and the reconstructed signal X[n], where 0<=α[n]<=1. FIG. 2B is a block diagram showing the computing unit 25A according to an embodiment of the invention. Referring to FIG. 2B, the computing unit 25A includes two power smooth units 251 and 252 and a suppression mask calculation unit 253. In order to reduce noise interference, the speech power estimation is done in the power smooth unit 251 by averaging speech power values of the past and the current data samples of the amplified signal Z[n] while the vibration power estimation is done in the power smooth unit 252 by averaging vibration power values of the past and the current data samples of the reconstructed signal X[n], using a smoothing parameter. In one embodiment, the following infinite impulse response (IIR) equations are provided for the two power smooth units 251-252 to obtain an average speech power value ZP[n] for the amplified signal Z[n] and an average vibration power value XP[n] for the reconstructed signal X[n]:






ZP[n]=((1−bZP[n−1]+b×Z2[n]);   (1)






XP[n]=((1−bXP[n−1]+b×X2[n]);   (2)


where b is a smoothing parameter whose value is selected in between [0, 1].


According to the disclosure “Single Channel Speech Enhancement: using Wiener Filtering with Recursive Noise estimation”, disclosed by Upadhyay et al, Procedia Computer Science 84 (2016) 22-30, the gain Hwiener(ω) of a wiener filter with recursive noise estimation is given by:












H
wiener



(
ω
)


=




p
SP



(
ω
)


-


p
NP



(
ω
)





P

S

P




(
ω
)




;




(
3
)







where PSP(ω) is the noisy speech power spectrum, PNP(ω) is the noise power spectrum and ω is the frequency bin index. According to the equation (3), the suppression mask calculation unit 253 calculates the suppression mask α[n] for the current sample Z[n] in time domain as follows:











α


[
n
]


=



ZP


[
n
]


-

XF


[
n
]




ZP


[
n
]




,




(
4
)







where 0<=α[n]<=1.


Please note that the above equation (4) is provided by way of example, but not limitations of the invention. Any other type of equations are applicable to the suppression mask calculation unit 253 as long as they satisfies the inversely proportional relationship between X[n] and α[n]. In brief, the greater the magnitude (or the power value) of X[n], the greater the own voice component contained in Z[n] and thus the less the suppression mask α[n] becomes for own voice suppression.


Then, the multiplier 255 is configured to multiply the amplified signal Z[n] by its corresponding suppression mask α[n] (sample by sample) to generate the own-voice-suppressed signal S3[n]. In this manner, the invention avoids abnormally high volume of hearing aids while the user is speaking. However, since the multiplication of the amplified signal Z[n] and the suppression mask α[n] is operated in time domain, it is likely that the environmental sounds as well as the user's voices contained in the amplified signal Z[n] would be suppressed at the same time.



FIG. 3A is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to an embodiment of the invention. Referring to FIG. 3A, an own voice suppression apparatus 30 of the invention, applicable to a hearing aid, includes an air conduction sensor 110, a multiplier unit 120a, an own voice indication module 130A and a suppression module 150B. The suppression module 150B includes a computing unit 25B, Q multipliers 310, two signal splitters 301/303 and a signal synthesizer 302. The signal splitter 301 splits the input signal Z[n] into Q first signal components (Z0˜ZQ−1) and the signal splitter 303 splits the input signal X[n] into Q second signal components (X0˜XQ−1), where Q>=1. Then, the computing unit 25B calculates Q suppression masks (α0˜αQ−1) according to the Q first signal components (Z0˜ZQ−1) and the Q second signal components (X0˜XQ−1). The Q multipliers 310 respectively multiply the Q second suppression masks (α0˜αQ−1) by their corresponding first signal components (Z0˜ZQ−1) to generate Q multiplied signals (Y0˜YQ−1). Finally, the signal synthesizer 302 reconstructs the own-voice-suppressed signal S3 in time domain according to the Q multiplied signals (Y0˜YQ−1). The signal splitters 301/303 in FIGS. 3A, 4A and 5A may be implemented by either transformers 301a/303a or an analysis filter bank 301b/303b while the signal synthesizer 302 in FIGS. 3A, 4A and 5A may be implemented by either an inverse transformer 302a or a synthesis filter bank 302b. Please note that the multipliers 310 may be implemented by complex-value multipliers 311 (together with the Z0˜ZQ−1 values being complex values) or real-value multipliers 255 (together with the Z0˜ZQ−1 values being real values).



FIG. 3B is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to another embodiment of the invention. Comparing to FIG. 3A, the signal splitters 301/303 are implemented by transformers 301a/303a while the signal synthesizer 302 is implemented by an inverse transformer 302a. Accordingly, the computing unit 25C calculates N suppression masks αk(i) for N frequency bins according to a current speech spectral representation for a current frame i of the amplified signal Z[n] and a current vibration spectral representation for a current frame i of the reconstructed signal X[n], where 0<=k<=(N−1), N is the length of each frame and i is the current frame index.


The transformers 301a and 303a is implemented to perform a fast Fourier transform (FFT), a short-time Fourier transform (STFT) or a discrete Fourier transform (DFT) over its input signals. Specifically, the transformers 301a and 303a respectively convert audio data of current frames of the signals Z[n] and X[n] in time domain into complex data (Z0˜ZN−1 and X0˜XN−1) in frequency domain. The inverse transformer 302a is used to transform the complex data (Y0˜YN−1) in frequency domain into the audio signal S3 in time domain for each frame. For purpose of clarity and ease of description, hereinafter, the following examples and embodiments will be described with the transformers 301a and 303a performing the FFT operations over each frame of their input signals. Assuming that a number of sampling points (or FFT size) is N and the time duration for each frame is Td, the transformer 303a divides the reconstructed signal X[n] into a plurality of frames and computes the FFT of the current frame i to generate a current vibration spectral representation having N complex-valued samples (X0˜XN−1) with a frequency resolution of fs/N(=1/Td). Here, fs denotes a sampling frequency of the reconstructed signal X[n] and each frame corresponds to a different time interval of the reconstructed signal X[n]. Likewise, the transformer 301a respectively divides the amplified signal Z[n] into a plurality of frames and computes the FFT of the current frame i to generate a current speech spectral representation having N complex-valued samples (Z0˜ZN−1) with a frequency resolution of fs/N. In a preferred embodiment, the time duration Td of each frame is about 8˜32 milliseconds (ms), and successive frames overlap by less than Td, such as by Td/2.



FIG. 3C shows a relationship between a vibration complex-valued samples Xk and a speech complex-valued sample Zk for the same frequency bin k. Referring to FIG. 3C, two vectors {right arrow over (Xk)} and {right arrow over (Zk)} respectively representing two complex-valued samples Xk and Zk for the same frequency bin k point to different directions. A vector τk{right arrow over (Zk)}, the projection of {right arrow over (Xk)} on {right arrow over (Zk)}, represents an own voice component on {right arrow over (Zk)}. According to the definition of linear minimum mean square error (MMSE) estimator (please go to the web site: https://en.wikipedia.org/wiki/Minimum_mean_square_error), we deduce the suppression mask αk for frequency bin k as follows. Since the two vectors ({right arrow over (Xk)}−τk{right arrow over (Zk)}) and {right arrow over (Zk)} are orthogonal,







E


[


(


X
k

-



τ





k



Z
k



)




(

Z
k

)

*


]


=


0






E


[

(



X
k



(

Z
k

)


*

)

]



=



τ
k



E


[



Z
k



(

Z
k

)


*

]



=


τ
k



E
[




(




Z
k



2

]







t
k


=


E


[



X
k



(

Z
i

)


*

]



E
[


(



Z
k



)

2




,









where E[.] denotes an expectation value.


After the own voice component τk{right arrow over (Zk)} is subtracted from {right arrow over (Zk)}, scalars are calculated as follows:







(


Z
k

-


τ
k



Z
k



)

=



Z
k



(

1
-

τ
k


)


=



Z
k



(

1
-


E


[



X
k



(

Z
k

)


*

]



E


[


(



Z
k



)

2

]




)


=


Z
k

×


(



E


[


(



Z
k



)

2

]


-

E


[



X
k



(

Z
k

)


*

]




E


[


(



Z
k



)

2

]



)

.








Thus, the suppression mask










α
k

=




E


[


(



Z
k



)

2

]


-

E


[



X
k



(

Z
i

)


*

]




E
[


(



Z
k



)

2



.





(

4

a

)








FIG. 3D is a block diagram showing the computing unit 25C according to an embodiment of the invention. Referring to FIG. 3D, the computing unit 25C includes two complex-value multipliers 312, a complex conjugate block 355, two smooth units 351 and 352 and a suppression mask calculation unit 353. According to the current speech spectral representation, the complex-value multiplier 312 multiplies each complex-valued sample Zk(i) by its complex conjugate Z*k(i) from the complex conjugate block 355 to generate a product of |Zk(i)|2. The smooth unit 351 firstly computes the power level |Zk(i)|2 for each frequency bin k to obtain a current speech power spectrum for the current frame i of the amplified signal Z[n] according to the equation: |Zk(i)|2=zkr2+zki2, where zkr denotes a real part of the complex-valued sample Zk(i), zki denotes an imaginary part complex-valued sample Zk(i), and 0<=k<=(N−1). Then, to reduce noise interference, similar to the above equation (1), the following IIR equation (5) is provided for the smooth unit 351 to obtain an average speech power value:





σk2(i)=(1−b)×σk2(i−1)+b×|Zk(i)|2;   (5)


where b is a smoothing parameter whose value is selected in between [0, 1], i is the current frame index and (i−1) is a previous frame index. In other words, σk2(i)=E[|Zk(i)|2].


According to the current vibration spectral representation and the current speech spectral representation, the complex-value multiplier 312 multiplies each complex-valued sample Xk(i) by the complex conjugate Z*k(i) from the complex conjugate block 355 to generate a product of Xk(i)Zk(i)*. The smooth unit 352 calculates a product complex value Xk(i)Zk(i)* for each frequency bin k to obtain a current product spectrum for the current frame i of the reconstructed signal X[n], where 0<=k<=N−1. Then, similar to the above equations (2) and (5), to reduce noise interference, the following IIR equation (6) is provided for the smooth unit 352 to obtain an average product complex value:





ρk(i)=(1−b)×ρk(i−1)+b×Xk(i)(Zk(i))*.   (6)


In other words, ρk(i)=E[Xk(i)(Zk(i))*].


Afterward, according to the equations (3) and (4a), the average speech power value σk2(i) and the average product complex value ρk(i), the suppression mask calculation unit 353 calculates the suppression mask αk(i) for a frequency bin k associated with the current frame z of the amplified signal Z[n] as follows:











α
k



(
i
)


=





σ
k
2



(
i
)


-


ρ
k



(
i
)





σ
k
2



(
i
)



=




E


[


(




Z
k



(
i
)




)

2

]


-

E


[



X
k



(
i
)





(


Z
i



(
i
)


)

*


]




E
[


(




Z
k



(
i
)




)

2



.






(
7
)







Please note that the outputted samples (Z0(i)˜ZN−1(i)) from the transformers 301a are complex values, so the suppression masks αk(i) are also complex values and hereinafter called Θcomplex masks”.


Next, the N complex-value multipliers 311 respectively multiply the N complex-valued samples Zk(i) by the N suppression masks αk(i) for the N frequency bins to generate N complex-valued samples Yk(i), where 0<=k<=N−1. Finally, the inverse transformer 302a performs IFFT over the N complex-valued samples Y0(i)˜YN−1(i) in frequency domain to generate the own-voice-suppressed signal S3 for the current frame i in time domain. Please note that the above equation (7) is provided by way of example, but not limitations of the invention. Any other type of equations are applicable to the suppression mask calculation unit 353 as long as they satisfies the inversely proportional relationship between Xk(i) and αk(i). In brief, the greater the magnitude of Xk(i), the greater the own voice component in the frequency band k of the current speech spectral representation is and thus the less the suppression mask αk(i) becomes for own voice suppression.


Please note that in equation (7), the suppression mask αk(i) for a frequency bin k is calculated according to the average speech power value σk2(i) and the average product complex value ρk(i) of the same frequency bin k. In an alternative embodiment, the suppression mask αk(i) for a frequency bin k is determined according to L average speech power values and L average product complex values of L frequency bins adjacent to the frequency bin k, where L>=1. FIG. 3E shows a timing diagram of calculating the suppression mask αk(i) for a frequency bin k according to three (i.e., L=3) average speech power values and three average product complex values of three frequency bins adjacent to the frequency bin k in the invention. Referring to FIG. 3E, the whole process of calculating the suppression mask αk(i) by the computing unit 25C is divided into three phases. In phase one, the smooth unit 351 respectively calculates three average speech power values σk−12(i), σk2(i) and σk+12(i) for three frequency bins (k−1), k and (k+1) according to the equation (5) and the three power levels |Zk−1(i)|2, |Zk(i)|2 and |Zk+1(i)|2. Meanwhile, the smooth unit 352 respectively calculates the average product complex values ρk−1(i), ρk(i) and ρk+1(i) for three frequency bins (k−1), k and (k+1) according to the equation (6) and the three product complex-valued samples Xk−1(i)(Zk−1(i))*, Xk(i)(Zk(i))* and Xk+1(i)(Zk+1(i))*. In phase two, the suppression mask calculation unit 353 calculates: (i) a suppression mask αk−1(i) for a frequency bin (k−1) according to the equation (7), the average speech power value σk−12(i) and the average product complex value ρk−1(i); (ii) a suppression mask αk(i) for a frequency bin k according to the equation (7), the average speech power value σk2(i) and the average product complex value ρk(i); and (iii) a suppression mask αk+1(i) for a frequency bin (k+1) according to the equation (7), the average speech power value σk+12(i) and the average product complex value ρk+1(i). In phase three, the suppression mask calculation unit 353 calculates an average value of the three suppression masks (αk−1(i), αk(i), αk+1(i)) of the three frequency bins ((k−1), k, (k+1)) and then updates the suppression mask αk(i) for the frequency bin k with the average value. Please note that FIG. 3D (i.e., L=1) is a special case of FIG. 3E.



FIG. 3F is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to another embodiment of the invention. In comparison with the own voice suppression apparatus 30 in FIG. 3A, the signal splitters 301/303 are implemented by analysis filter banks 301b/303b while the signal synthesizer 302 is implemented by a synthesis filter bank 302b and an adder 302c.


Referring to FIG. 3F, the amplified signal Z[n] is decomposed into M speech sub-band signals Z0[n]˜ZM−1[n] by applying M analysis filters of the analysis filter bank 301b with M different passbands. Likewise, the reconstructed signal X[n] is decomposed into M vibration sub-band signals X0[n]˜XM−1[n] by applying M analysis filters of the analysis filter bank 303b with M different passbands. Thus, each of the speech sub-band signals Z0[n]˜ZM−1[n] (in time domain) carries information on the amplified signal Z[n] in a particular frequency band while each of the vibration sub-band signals X0[n]˜XM−1[n] (in time domain) carries information on the reconstructed signal X[n] in a particular frequency band. In an embodiment, the bandwidths of the M passbands of the M analysis filters of the analysis filter bank 301b/303b are equal. In an alternative embodiment, the bandwidths of the M passbands of the M analysis filters of the analysis filter bank 301b/303b are not equal; moreover, the higher the frequency, the wider the bandwidths of the M passbands of the M analysis filters. Then, the M real-value multipliers 255 respectively multiply M speech sub-band signals Z0[n]˜ZM−1[n] by M suppression masks α0[n]˜αM−1[n] to generate M modified signals B0[n]˜BM−1[n]. Next, M synthesis filters of the synthesis filter bank 302b respectively perform interpolation over the M modified signals B0[n]˜BM−1[n] to generate M interpolated signals. Finally, the M interpolated signals are combined by the adder 302c to reconstruct the own-voice-suppressed signal S3. Referring to FIG. 3G, the computing unit 25D includes two power smooth units 391 and 392 and a suppression mask calculation unit 393. Analogous to the equations (1) and (2), the following IIR equations are provided for the two power smooth units 391-392 to obtain an average speech power value ZPj[n] for the speech sub-band signal Z[n] and an average vibration power value XPj[n] for the vibration sub-band signal Xj[n]:






ZP
j[n]=((1−bZPj[n−1]+b×Zj2[n]);   (8)






XP
j[n]=((1−bXPj[n−1]+b×Xj2[n]);   (9)


where b is a smoothing parameter whose value is selected in between [0, 1], j is the passband index, n is the discrete time index and 0<=j<=(M−1).


Analogous to the equation (4), the suppression mask calculation unit 393 calculates the suppression mask αj[n] for the speech sub-band signal Zj[n] as follows:











α
j



[
n
]


=





ZP
j



[
n
]


-


XP
j



[
n
]





ZP
j



[
n
]



.





(
10
)







Please note that the outputted samples (Z0[n])˜ZM−1[n])) from the filter bank 301b are real values, so the suppression masks αj[n] are also real values and hereinafter called “real masks”. Please note that in equation (10), the suppression mask αj[n] for the speech sub-band signal Z[n] is calculated according to the average speech power value ZPj[n] and the average vibration power value XPj[n] of the same passband j. In an alternative embodiment, similar to the three-phase process in FIG. 3E, the suppression mask αj[n] for the speech sub-band signal Zj[n] corresponding to the passband j is determined according to L average speech power values of L speech sub-band signals and L average vibration power values of L vibration sub-band signals, where L>=1 and the passbands of the L speech sub-band signals and the L vibration sub-band signals are adjacent to the passband j. For example, if L=3, the computing unit 25D computes three suppression masks (αj−1[n], αj[n] and αj+1[n]) of three speech sub-band signals (Zj−1[n], Zj[n] and Zj+1[n]) with their passbands adjacent to the passband j based on equation (10), three average speech power values of the three speech sub-band signals and three average vibration power values of three vibration sub-band signals (Xj−1[n], Xj[n] and Xj+1[n]), computes an average value of the three suppression masks and then updates the suppression mask αj[n] for the speech sub-band signal Zj[n] with the average value. Please note that the above equation (10) is provided by way of example, but not limitations of the invention. Any other type of equations are applicable to the suppression mask calculation unit 393 as long as it satisfies the inversely proportional relationship between Xj[n] and αj[n]. In brief, the greater the magnitude (or power value) of Xj[n], the greater the own voice component in the passband j (or the speech sub-band signal Zj[n]) is and thus the less the suppression mask αj[n] becomes for own voice suppression.



FIG. 4A is a block diagram showing an own voice suppression apparatus with a voice identification module according to an embodiment of the invention. In comparison with the own voice suppression apparatus 30 in FIG. 3A, a main difference is that the own voice indication module 130A is replaced with a voice identification module 1308 and the signal components (Z0˜ZQ−1) are not fed to the computing unit 25E. The voice identification module 130B receives the amplified signal Z[n] to generate Q matching scores (P0˜PQ−1) corresponding to Q signal components Z0˜ZQ−1. Then, the computing unit 25E calculates Q suppression masks (α0˜αQ−1) according to the Q matching scores (P0˜PQ−1).



FIG. 4B is a block diagram showing an own voice suppression apparatus with a voice identification module according to another embodiment of the invention. In comparison with the own voice suppression apparatus 40 in FIG. 4A, the signal splitter 301 is implemented by a transformer 301a while the signal synthesizer 302 is implemented by an inverse transformer 302a. The voice identification module 130B receives the amplified signal Z[n] to generate N matching scores Pk corresponding to N frequency bins of the current speech spectral representation associated with a current frame i of Z[n], where 0<=k<=(N−1) and N is the length of each frame of the amplified signal Z[n]. Each matching score Pk is bounded between 0 and 1. Thus, if any matching score Pk gets close to 1, it indicates that the magnitude of the user's own voice component gets greater in the frequency bin k; otherwise, if any matching score Pk gets close to 0, it indicates that the magnitude of the user's own voice component gets smaller in this frequency bin k. According to the N matching scores P0˜PN−1, the computing unit 25F calculates a suppression mask αk for each frequency bin k using the following equation:





αk=(1−Pk),   (11)


where 0<=αk<=1. Please note that since Pk is real number, the suppression mask αk is a real mask.


Please note that the above equation (11) is provided by way of example, but not limitations of the invention. Any other type of equations is applicable to the computing unit 25F as long as it satisfies the inversely proportional relationship between αk and Pk. In brief, the greater the magnitude of Pk, the greater the magnitude of the user own voice component in the frequency band k of the current speech spectral representation is and thus the less the suppression mask αk becomes for own voice suppression. In an alternative embodiment, similar to the three-phase process in FIG. 3E, the suppression mask αk(i) for a frequency bin k is determined according to L matching scores of L frequency bins adjacent to the frequency bin k. For example, the computing unit 25F calculates L suppression masks of the L frequency bins adjacent to the frequency bin k according to the L matching scores of the L frequency bins, calculates an average value of the L suppression masks and then updates the suppression mask αk(i) for the frequency bin k with the average value, where L>=1.


The advantage of the voice identification module 130B is capable of identifying which frequency bins the user's own voice components are located and how strong the user's own voice components are. With this indication, the user's own voice components in the identified frequency bins can be suppressed precisely while the magnitudes of the sound components in the other frequency bins (representative of environmental sounds) are retained.



FIG. 4C is a block diagram showing a voice identification module according to an embodiment of the invention. Referring to FIG. 4C, the voice identification module 130B includes a storage device 42, an audio embedding extraction unit 41 and an embedding match calculation unit 43. The audio embedding extraction unit 41 includes a neural network 410 and an average block 415. The neural network 410 is implemented by a DNN or a long short term memory (LSTM) network. The storage device 42 includes all forms of non-volatile or volatile media and memory devices, such as semiconductor memory devices, magnetic disks, DRAM, or SRAM.


For purpose of clarity and ease of description, hereinafter, the following examples and embodiments will be described with the neural network 410 implemented by a DNN. The DNN may be implemented using any known architectures. For example, referring to the disclosure “End-to-End Text-Dependent Speaker Verification”, disclosed by Heigold et al, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), the DNN 410 consists of successive application of several non-linear functions in order to transform the user utterance into a vector; as shown in FIG. 4C, the DNN 410 includes a locally-connected layer 412 and multiple fully connected layers 411. It should be noted that the architecture of the DNN 410 is provided by way of example, but not limitation of the invention. Any other architecture is applicable to the DNN as long as it can transforms the user utterance Z[n] into a current feature vector CV. The identification protocol is divided into three stages: training, enrollment and evaluation. In the training stage, a suitable user representation is found from the training utterances. For example, the user representations are a summary of frame-level information, such as feature vectors. After the training stage is completed, the parameters of the DNN 410 are fixed. In the enrollment stage, a user provides multiple utterances, which is used to estimate a user model. Due to the fact that each utterance generates one feature vector, the feature vectors of the enrollment utterances are averaged by the average unit 415 to obtain a user vector UV representative of the user model. And then, the user vector UV is stored in the storage device 42 by the DNN 410. Please note that in the enrollment stage, the embedding match calculation unit 43 is disabled. In the evaluation stage, the average block 415 is disabled. The DNN 410 transforms the user utterance Z[n] into a current feature vector CV. The embedding match calculation unit 43 retrieves the user vector UV from the storage device 42 and performs cosine similarity between the user vector UV and the current feature vector CV to generate N matching scores Pk for N frequency bins, where 0<=k<=(N−1). If each of the user vector UV and the current feature vector CV has a dimension of N×N1, then the output vector P from the embedding match calculation unit 43 has a dimension of N×1. If N=256 and N1=2048, after performing cosine similarity, the embedding match calculation unit 43 generates an output vector P with 256×1 components Pk, 0<=k<=255. As well known in the art, cosine similarity is a measure of similarity between two vectors of an inner product space; it is measured by the cosine of the angle between the two vectors and determines whether the two vectors are pointing in roughly the same direction. In this invention, cosine similarity is used to detect how similar the user vector UV and the current feature vector CV are in the frequency bin k, where, 0<=k<=N−1. The more similar (i.e., Pk gets close to 1) the two vectors UV and CV in the frequency bin k, the greater the user own voice component in the frequency bin k.



FIG. 4D is a block diagram showing an own voice suppression apparatus with a voice identification module according to another embodiment of the invention. In comparison with the own voice suppression apparatus 40A in FIG. 4B, a main difference is that the transformer 301a is replaced with an analysis filter bank 301b while the inverse transformer 302a is replaced with a synthesis filter bank 302b. The voice identification module 130B receives the amplified signal Z[n] to generate M matching scores Pj corresponding to the M passbands of the analysis filter bank 301b, where 0<=j<=(M−1). Please note that, the frequency ranges of the M passbands of the M analysis filters for the analysis filter bank 301b respectively correspond to the frequency ranges of the M passbands with M matching scores Pj from the voice identification module 130B. In an embodiment, the bandwidths of the M passbands of the M analysis filters are equal. In an alternative embodiment, the bandwidths of the M passbands of the M analysis filters are not equal; moreover, the higher the frequency, the wider the bandwidths of the passbands of the M analysis filters. Each matching score Pj is bounded between 0 and 1. Thus, if any matching score Pj gets close to 1, it indicates that the magnitude of the user's own voice component gets greater in the passband (or the speech sub-band signal Zj[n]); otherwise, if any matching score Pj gets close to 0, it indicates that the magnitude of the user's own voice component gets smaller in this passband j. According to the M matching scores Pj, the computing unit 25G calculates a suppression mask αj[n] for each passband j (or each speech sub-band signal Zj[n]) according to the following equation:





αj[n]=(1−Pj),   (12)


where 0<=αj[n]<=1 and 0<=j<=(M−1).


Please note that the above equation (12) is provided by way of example, but not limitations of the invention. Any other type of equations is applicable to the computing unit 25G as long as it satisfies the inversely proportional relationship between αj[n] and Pj. In brief, the greater the magnitude of Pj, the greater the magnitude of the user own voice component in the passband j for the speech sub-band signal Zj[n]) is and thus the less the suppression mask αj[n] becomes for own voice suppression. In an alternative embodiment, similar to the three-phase process in FIG. 3E, the suppression mask αj[n] for the speech sub-band signal Zj[n] is determined according to L matching scores of L speech sub-band signals with their passbands adjacent to the passband j of the speech sub-band signal Zj[n]. For example, the computing unit 25G calculates L suppression masks of the L speech sub-band signals with their passbands adjacent to the passband j according to the L matching scores of the L speech sub-band signals, calculates an average value of the L suppression masks and then updates the suppression mask αj[n] for the passband j (or the speech sub-band signal Zj[n]) with the average value, where L>=1.



FIG. 5A is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to an embodiment of the invention. Referring to FIG. 5A, the own voice suppression apparatus 50 includes the own voice suppression apparatus 30 and the voice identification module 130B. The computing unit 25H calculates Q suppression masks (α0˜αQ−1) according to the Q matching scores (P0˜PQ−1), the Q first signal components (Z0˜ZQ−1) and the Q second signal components (X0˜XQ−1).



FIG. 5B is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to another embodiment of the invention. Comparing to the own voice suppression apparatus 50 in FIG. 5A, the signal splitters 301/303 are implemented by transformers 301a/303a while the signal synthesizer 302 is implemented by an inverse transformer 302a.



FIG. 5C is a block diagram showing the computing unit 251 according to an embodiment of the invention. Referring to FIG. 5C, the computing unit 25E includes two complex-value multipliers 312, a complex conjugate block 355, two smooth units 351 and 352 and a suppression mask calculation unit 553. According to the equation (7), the matching score Pk, the average speech power value σk2(i) and the average product complex value ρk(i), the suppression mask calculation unit 553 calculates the suppression mask αk(i) for a frequency bin k in the current speech spectral representation (associated with the current frame i of the amplified signal Z[n]) as follows:











α
k



(
i
)


=




(

1
-

P
k


)



(



σ
k
2



(
i
)


-


ρ
k



(
i
)



)




σ
k
2



(
i
)



=




(

1
-

P
k


)



(

E
[

(






Z
k



(
i
)






)

2


]

-

E


[



X
k



(
i
)





(


Z
i



(
i
)


)

*


]



)





E


[


(




Z
k



(
i
)




)

2

]



.






(
13
)







Please note that the above equation (13) is provided by way of example, but not limitations of the invention. Any other type of equations are applicable to the suppression mask calculation unit 553 as long as they satisfy the inversely proportional relationship between Xk(i) and αk(i), and the inversely proportional relationship between Pk and αk(i). In brief, the greater the magnitude of Xk(i) and/or the magnitude of Pk, the greater the own voice component in the frequency band k of the current speech spectral representation is and thus the less the suppression mask αk(i) becomes for own voice suppression. In an alternative embodiment, similar to the three-phase process in FIG. 3E, the suppression mask αk(i) for a frequency bin k of the current speech spectral representation is determined according to L matching scores, L average speech power values and L average product complex values of L frequency bins adjacent to the frequency bin k. For example, the computing unit 251 calculates L suppression masks of the L frequency bins adjacent to the frequency bin k according to the L matching scores, L average speech power values and L average product complex values of the L frequency bins, calculates an average value of the L suppression masks and then updates the suppression mask αk(i) for the frequency bin k with the average value, where L>=1.



FIG. 5D is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to another embodiment of the invention. Comparing to the own voice suppression apparatus 50 in FIG. 5A, the signal splitters 301/303 are implemented by the analysis filter banks 301b/303b while the signal synthesizer 302 is implemented by the synthesis filter bank 302b. FIG. 5E is a block diagram showing the computing unit 25J according to an embodiment of the invention. Referring to FIG. 5E, the computing unit 25J includes two power smooth units 391 and 392 and a suppression mask calculation unit 554. According to the equation (10), the matching score Pj, the average speech power value ZPj[n] and the average vibration power value XPj[n], the suppression mask calculation unit 554 calculates the suppression mask αj[n] for the passband j (or the speech sub-band signal Zj[n]) as follows:












α
j



[
n
]


=



(

1
-

P
j


)



(



ZP
j



[
n
]


-


XP
j



[
n
]



)




ZP
j



[
n
]




,




(
14
)







where 0<=αj[n]<=1, j is the passband index and 0<=j<=(M−1).


Please note that the above equation (14) is provided by way of example, but not limitations of the invention. Any other type of equations are applicable to the suppression mask calculation unit 554 as long as they satisfy the inversely proportional relationship between Xj[n] and αj[n], and the inversely proportional relationship between Pj and αj[n]. In brief, the greater the magnitude (or power value) of Xj[n] and/or the magnitude of Pj, the greater the own voice component in the speech sub-band signal Zj[n] is and thus the less the suppression mask αj[n] becomes for own voice suppression. In an alternative embodiment, similar to the three-phase process in FIG. 3E, the suppression mask αj[n] for the speech sub-band signal Zj[n] is determined according to L matching scores and L average speech power values of L speech sub-band signals and L average vibration power values of L vibration sub-band signals, where the passbands of the L speech sub-band signals and the L vibration sub-band signals are adjacent to the passband j. For example, the computing unit 25J calculates L suppression masks of the L speech sub-band signals with their passbands adjacent to the passband j according to the L matching scores and the L average speech power values of the L speech sub-band signals and the L average vibration power values of the L vibration sub-band signals, computes an average value of the L suppression masks and then updates the suppression mask αj[n] for the passband j (or the speech sub-band signal Zj[n]) with the average value, where L>=1.


Obviously, the own voice suppression apparatus 50/50A/50B has the best performance of suppressing the user's own voice and retaining the environmental sounds due to the both assistance from the own voice indication module 130A and the voice identification module 130B. FIG. 6 show a relationship among waveforms of the audio signal S1, the vibration signal S2 and the own-voice-suppressed signal S3 according to an embodiment of the invention. Referring to FIG. 6, in the presence of the user's own voice, it is obvious that the magnitude of the audio signal S1 is abnormally large in comparison with the vibration signal S2, but the magnitude of the own-voice-suppressed signal S3 is significantly reduced after own voice suppression.


The own voice suppression apparatus 10/20/30/30A/30B/40/40A/40B/50/50A/50B according to the invention may be hardware, software, or a combination of hardware and software (or firmware). An example of a pure solution would be a field programmable gate array (FPGA) design or an application specific integrated circuit (ASIC) design. In an embodiment, the suppression module (150/150150J) and the amplification unit 120/120a are implemented with a first general-purpose processor and a first program memory; the own voice reconstruction module 232 is implemented with a second general-purpose processor and a second program memory. The first program memory stores a first processor-executable program and the second program memory stores a second processor-executable program. When the first processor-executable program is executed by the first general-purpose processor, the first general-purpose processor is configured to function as: the amplification unit 120/120a and the suppression module (150/150150J). When the second processor-executable program is executed by the second general-purpose processor, the second general-purpose processor is configured to function as: the own voice reconstruction module 232.


In an alternative embodiment, the amplification unit 120/120a, the own voice reconstruction module 232 and the suppression module (150/150150J) are implemented with a third general-purpose processor and a third program memory. The third program memory stores a third processor-executable program. When the third processor-executable program is executed by the third general-purpose processor, the third general-purpose processor is configured to function as: the amplification unit 120/120a, the own voice reconstruction module 232 and the suppression module (150/150150J).


While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that this invention should not be limited to the specific construction and arrangement shown and described, since various other modifications may occur to those ordinarily skilled in the art.

Claims
  • 1. An own voice suppression apparatus applicable to a hearing aid, comprising: an air conduction sensor for generating an audio signal;an own voice indication module for generating an indication signal according to at least one of user's mouth vibration information and user's voice feature vector comparison result; anda suppression module coupled to the air conduction sensor and the own voice indication module for generating an own-voice-suppressed signal according to the indication signal and the audio signal.
  • 2. The apparatus according to claim 1 wherein the own voice indication module comprises: a bone conduction sensor for measuring vibrations caused by user's mouth movements to output a vibration signal; andan own voice reconstruction module for reconstructing high-frequency components from the vibration signal to generate a reconstructed signal as a first indication signal.
  • 3. The apparatus according to claim 2, wherein the suppression module comprises: a first computing unit for generating a first suppression mask for each sample of the audio signal in time domain according to the reconstructed signal and the audio signal; anda multiplier for multiplying each first suppression mask by its corresponding sample of the audio signal to generate the own-voice-suppressed signal.
  • 4. The apparatus according to claim 3. wherein the first computing unit comprises: a first suppression mask calculation unit for generating a first suppression mask for a current sample of the audio signal according to an average speech power value of the current sample and previous samples of the audio signal and an average vibration power value of a current sample and previous samples of the reconstructed signal;wherein the first suppression mask α and a magnitude of the current sample of the vibration signal are inversely proportional, and 0<=α<=1.
  • 5. The apparatus according to claim 2, wherein the suppression module comprises: a first signal splitter coupled to the air conduction sensor for splitting the audio signal into Q first signal components;a second signal splitter coupled to the bone conduction sensor for splitting the reconstructed signal into Q second signal components;a second computing unit coupled to first signal splitter for generating Q second suppression masks for the Q first signal components; andQ multipliers coupled between first signal splitter and the second computing unit for respectively multiplying the Q second suppression masks by their corresponding first signal components to generate Q multiplied signals; anda first signal synthesizer coupled to the Q multipliers for reconstructing the own-voice-suppressed signal according to the Q multiplied signals, where Q>=1.
  • 6. The apparatus according to claim 5, wherein the first and the second signal splitters are transformers, and the first signal synthesizer is an inverse transformer, wherein the Q first signal components are Q spectral values in Q frequency bins of a current audio spectral representation corresponding to a current frame of the audio signal and wherein the Q second signal components are Q spectral values in Q frequency bins of a current vibration spectral representation corresponding to a current frame of the reconstructed signal.
  • 7. The apparatus according to claim 6, wherein the second computing unit comprises: a second suppression mask calculation unit for generating L second suppression masks for L frequency bins adjacent to a frequency bin k according to L average speech power values and L average product complex values for the L frequency bins related to the current audio spectral representation and the current vibration spectral representation and for computing an average of the L second suppression masks to generate a second suppression mask for the frequency bin k, where L>=1 and 0<=k<=(Q−1);wherein the second suppression mask for the frequency bin k and a complex value of the frequency bin k in the current vibration spectral representation are inversely proportional when L=1.
  • 8. The apparatus according to claim 5, wherein the first and the second signal splitters are analysis filter banks with Q different passbands, and the first signal synthesizer is a synthesis filter bank, wherein the Q first signal components are Q first sub-band signals in the Q different passbands corresponding to a current sample of the audio signal and wherein the Q second signal components are Q second sub-band signals in the Q different passbands corresponding to a current sample of the reconstructed signal.
  • 9. The apparatus according to claim 8, wherein the second computing unit comprises: a second suppression mask calculation unit for generating L second suppression masks for L first sub-band signals with L passbands adjacent to a passband j according to L average speech power values for the L first sub-band signals and L average vibration power values for L second sub-band signals with the L passbands and for computing an average of the L second suppression masks to generate a second suppression mask α for a first sub-band signal with the passband j, where 0<=α<=1, L>=1 and 0<=j<=(Q−1);wherein the second suppression mask for the first sub-band signal with the passband j and a magnitude of a second sub-band signal with the passband j are inversely proportional when L=1.
  • 10. The apparatus according to claim 5, wherein the own voice indication module further comprises: a voice identification module for generating Q matching scores as a second indication signal for the Q first signal components according to the audio signal.
  • 11. The apparatus according to claim 10, wherein the voice identification module comprises: an audio embedding extraction unit comprising: a neural network configured to transform a user utterance into a feature vector; andan average unit for computing an average of multiple feature vectors transformed from multiple user utterances during an enrollment stage to generate a user vector;a storage device coupled to the average unit for storing the user vector; andan embedding match calculation unit coupled to the neural network and the storage device for performing cosine similarity between the user vector from the storage device and the feature vector from the neural network to generate the Q matching scores corresponding to Q first signal components in an evaluation stage.
  • 12. The apparatus according to claim 10, wherein the first and the second signal splitters are transformers, and the first signal synthesizer is an inverse transformer, wherein the Q first signal components are Q spectral values in Q frequency bins of a current audio spectral representation corresponding to a current frame of the audio signal and wherein the Q second signal components are Q spectral values in Q frequency bins of a current vibration spectral representation corresponding to a current frame of the indication signal.
  • 13. The apparatus according to claim 10, wherein the second computing unit comprises: a second suppression mask calculation unit for generating L second suppression masks for L frequency bins adjacent to a frequency bin k according to L matching scores for the L frequency bins, L average speech power values and L average complex values for the L frequency bins related to the current audio spectral representation and the current vibration spectral representation and for computing an average of the L second suppression masks to generate a second suppression mask for the frequency bin k, where L>=1 and 0<=k<=(Q−1);wherein when L=1, the second suppression mask for the frequency bin k and a complex value of the frequency bin k in the current vibration spectral representation are inversely proportional, and the second suppression mask and a matching score for the frequency bin k are inversely proportional.
  • 14. The apparatus according to claim 10, wherein the first and the second signal splitters are analysis filter banks with Q different passbands, and the first signal synthesizer is a synthesis filter bank, wherein the Q first signal components are Q first sub-band signals in the Q different passbands corresponding to a current sample of the audio signal and wherein the Q second signal components are Q second sub-band signals in the Q different passbands corresponding to a current sample of the reconstructed signal.
  • 15. The apparatus according to claim 14, wherein the second computing unit comprises: a second suppression mask calculation unit for generating L second suppression masks for L first sub-band signals with L passbands adjacent to a passband j according to L matching scores and L average speech power values for the L first sub-band signals and L average vibration power values for L second sub-band signals with the L passbands and for computing an average of the L second suppression masks to generate a second suppression mask α for a first sub-band signal with the passband j, where 0<=α<=1, L>=1 and 0<=j<=(Q−1);wherein when L=1, the second suppression mask and a matching score for the first sub-band signal with the passband j are inversely proportional and the second suppression mask and a magnitude of a second sub-band signal with the passband j are inversely proportional.
  • 16. The apparatus according to claim 1, wherein the own voice indication module comprises: a voice identification module for receiving the audio signal to generate Q matching scores for Q signal components as the indication signal, where Q>=1.
  • 17. The apparatus according to claim 16, wherein the suppression module comprises: a third signal splitter for splitting the audio signal into the Q third signal components;a third computing unit coupled to the third signal splitter for generating Q third suppression masks for the Q third signal components; andQ multipliers coupled to the third signal splitter and the third computing unit for respectively multiplying the Q third suppression masks by their corresponding third signal components to generate Q multiplied signals; anda second signal synthesizer for reconstructing the own-voice-suppressed signal according to the Q multiplied signals.
  • 18. The apparatus according to claim 17, wherein the third signal splitter is a transformer, and the second signal synthesizer is an inverse transformer, wherein the Q third signal components are Q spectral values in Q frequency bins of a current audio spectral representation corresponding to a current frame of the audio signal.
  • 19. The apparatus according to claim 18, wherein the third computing unit comprises: a third suppression mask calculation unit for generating L third suppression masks for L frequency bins adjacent to a frequency bin k according to L matching scores for the L frequency bins and for computing an average of the L suppression masks to generate a third suppression mask α for the frequency bin k, where 0<=α<=1, L>=1 and 0<=k<=(Q−1);wherein when L=1, the third suppression mask and a matching score for the frequency bin k are inversely proportional.
  • 20. The apparatus according to claim 17, wherein the third signal splitter is an analysis filter bank with Q different passbands, and the second signal synthesizer is a synthesis filter bank, wherein the Q third signal components are Q third sub-band signals in the Q different passbands corresponding to a current sample of the audio signal.
  • 21. The apparatus according to claim 20, wherein the third computing unit comprises: a third suppression mask calculation unit for generating L third suppression masks for L passbands adjacent to a passband j according to L matching scores P, for the L passbands and for computing an average of the L suppression masks to generate a third suppression mask α for the passband j, where 0<=α<=1, L>=1 and 0<=j<=(Q−1);wherein when L=1, the third suppression mask and a matching score for the passband j are inversely proportional.
  • 22. The apparatus according to claim 16, wherein the voice identification module comprises: an audio embedding extraction unit comprising: a neural network for transforming a user utterance into a feature vector; andan average unit for computing an average of multiple feature vectors during an enrollment stage to generate a user vector;a storage device coupled to the average unit for storing the user vector; andan embedding match calculation unit coupled to the neural network and the storage device for performing cosine similarity between the user vector from the storage device and the feature vector from the neural network to generate the Q matching scores corresponding to the Q third signal components in an evaluation stage.
  • 23. An own voice suppression method applicable to a hearing aid, comprising: providing an audio signal by an air conduction sensor;generating an indication signal according to at least one of user's mouth vibration information and user's voice feature vector comparison result; andgenerating an own-voice-suppressed signal according to the audio signal and the indication signal.
  • 24. The method according to claim 23, wherein the step of generating the indication signal comprises: measuring vibrations caused by user's mouth movements by a bone conduction sensor to generate a vibration signal; andreconstructing high-frequency components from the vibration signal to generate a reconstructed signal as a first indication signal.
  • 25. The method according to claim 24, wherein the step of generating the own-voice-suppressed signal comprises: generating a first suppression mask for each sample of the audio signal in time domain according to the reconstructed signal and the audio signal; andmultiplying each first suppression mask by its corresponding sample of the audio signal to generate the own-voice-suppressed signal.
  • 26. The method according to claim 25, wherein the step of generating the first suppression mask comprises: generating the first suppression mask for a current data sample of the audio signal according to an average speech power value of a current and previous data samples of the audio signal and an average vibration power value of a current and previous data samples of the vibration signal;wherein the first suppression mask α and a power value of the current data sample of the vibration signal are inversely proportional, and 0<=α<=1.
  • 27. The method according to claim 24, wherein the step of generating the own-voice-suppressed signal comprises: splitting the audio signal by a first signal splitter into Q first signal components;splitting the reconstructed signal by a second signal splitter into Q second signal components;generating Q second suppression masks for the Q first signal components; andrespectively multiplying the Q second suppression masks by their corresponding first signal components to generate Q multiplied signals; andreconstructing the own-voice-suppressed signal by a first signal synthesizer according to the Q multiplied signals, where Q>=1.
  • 28. The method according to claim 27, wherein the first and the second signal splitter are transformers, and the first signal synthesizer is an inverse transformer, wherein the Q first signal components are Q spectral values in Q frequency bins of a current audio spectral representation corresponding to a current frame of the audio signal and wherein the Q second signal components are Q spectral values in Q frequency bins of a current vibration spectral representation corresponding to a current frame of the reconstructed signal.
  • 29. The method according to claim 28, wherein the step of generating the Q second suppression masks comprises: generating L second suppression masks for L frequency bins adjacent to a frequency bin k according to L average speech power values and L average product complex values for the L frequency bins related to the current audio spectral representation and the current vibration spectral representation: and computing an average of the L second suppression masks to generate a second suppression mask for the frequency bin k, where L>=1 and 0<=k<=(Q−1);wherein the second suppression mask for the frequency bin k and a complex value of the frequency bin k in the current vibration spectral representation are inversely proportional when L=1.
  • 30. The method according to claim 27, wherein the first and the second signal splitters are analysis filter banks with Q different passbands, and the first signal synthesizer is a synthesis filter bank, wherein the Q first signal components are Q first sub-band signals in the Q different passbands corresponding to a current sample of the audio signal and wherein the Q second signal components are Q second sub-band signals in the Q different passbands corresponding to a current sample of the reconstructed signal.
  • 31. The method according to claim 30, wherein the step of generating the Q second suppression masks comprises: generating L second suppression masks for L first sub-band signals with L passbands adjacent to a passband j according to L average speech power values for the L first sub-band signals and L average vibration power values for L second sub-band signals with the L passbands; andcomputing an average of the L second suppression masks to generate a second suppression mask α for a first sub-band signal with the passband j, where 0<=α<=1, L>1 and 0<=j<=(Q−1); wherein the second suppression mask for the first sub-band signal with the passband j and a magnitude of a second sub-band signal with the passband j are inversely proportional when L=1.
  • 32. The method according to claim 27, further comprising: generating Q matching scores for the Q first signal components as a second indication signal according to the audio signal.
  • 33. The method according to claim 32, wherein the step of generating the Q matching scores comprises: transforming multiple user utterances into multiple feature vectors using a neural network in an enrollment stage;computing an average of the multiple feature vectors to generate a user vector in an enrollment stage;transforming a user utterance into a feature vector using the neural network in an evaluation stage; andperforming cosine similarity between a user vector and the feature vector to generate the Q matching scores corresponding to Q first signal components in the evaluation stage.
  • 34. The method according to claim 32, wherein the first and the second signal splitter are transformers, and the first signal synthesizer is an inverse transformer, wherein the Q first signal components are Q spectral values in Q frequency bins of a current audio spectral representation corresponding to a current frame of the audio signal and wherein the Q second signal components are Q spectral values in Q frequency bins of a current vibration spectral representation corresponding to the current frame of the indication signal.
  • 35. The method according to claim 32, wherein the step of generating Q second suppression masks comprises: generating L second suppression masks for L frequency bins adjacent to a frequency bin k according to L matching scores for the L frequency bins, L average speech power values and L average complex values for the L frequency bins related to the current audio spectral representation and the current vibration spectral representation; andcomputing an average of the L second suppression masks to generate a second suppression mask for the frequency bin k, where L>=1 and 0<=k<=(Q−1);wherein when L=1, the second suppression mask for the frequency bin k and a complex value of the frequency bin k in the current vibration spectral representation are inversely proportional, and the second suppression mask and a matching score for the frequency bin k are inversely proportional.
  • 36. The method according to claim 30, wherein the first and the second signal splitter are analysis filter banks with Q different passbands, and the first signal synthesizer is a synthesis filter bank, wherein the Q first signal components are Q first sub-band signals in the Q different passbands corresponding to a current sample of the audio signal and wherein the Q second signal components are Q second sub-band signals in the Q different passbands corresponding to a current sample of the reconstructed signal.
  • 37. The method according to claim 30, wherein the step of generating the Q second suppression masks comprises: generating L second suppression masks for L first sub-band signals with L passbands adjacent to a passband j according to L matching scores and L average speech power values for the L first sub-band signals and L average vibration power values for L second sub-band signals with the L passbands; andcomputing an average of the L second suppression masks to generate a second suppression mask α for a first sub-band signal with the passband j, where 0<=α<=1, L>=1 and 0<=j<=(Q−1);wherein when L=1, the second suppression mask and a matching score for the first sub-band signal with the passband j are inversely proportional, and the second suppression mask for the first sub-band signal with the passband j and a magnitude of a second sub-band signal with the passband j are inversely proportional.
  • 38. The method according to claim 23, further comprising: generating Q matching scores for Q third signal components as the indication signal according to the audio signal, where Q>=1.
  • 39. The method according to claim 38, wherein the step of generating the own-voice-suppressed signal comprises: splitting the audio signal by a third signal splitter into the Q third signal components;generating Q third suppression masks for the Q third signal components; andrespectively multiplying the Q third suppression masks by their corresponding third signal components to generate Q multiplied signals; andreconstructing the own-voice-suppressed signal by a second signal synthesizer according to the Q multiplied signals.
  • 40. The method according to claim 39, wherein the third signal splitter is a transformer. and the second signal synthesizer is an inverse transformer. wherein the Q third signal components are Q spectral values in Q frequency bins of a current audio spectral representation corresponding to a current frame of the audio signal.
  • 41. The method according to claim 40, wherein the step of generating the Q third suppression masks comprises: generating L third suppression masks for L frequency bins adjacent to a frequency bin k according to L matching scores for the L frequency bins; andcomputing an average of the L third suppression masks to generate a third suppression mask α for the frequency bin k, where 0<=α<=1, L>=1 and 0<=k<=(Q−1);wherein when L=1, the third suppression mask and a matching score for the frequency bin k are inversely proportional.
  • 42. The method according to claim 39, wherein the third signal splitter is an analysis filter bank with Q different passbands, and the second signal synthesizer is a synthesis filter bank, wherein the Q third signal components are Q third sub-band signals in the Q different passbands corresponding to a current sample of the audio signal.
  • 43. The method according to claim 42, wherein the step of generating the Q third suppression masks comprises: generating L third suppression masks for L third sub-band signals with L passbands adjacent to a passband j according to L matching scores for the L third sub-band signals; andcomputing an average of the L third suppression masks to generate a third suppression mask α for a third sub-band signal with the passband j, where 0<=α<=1, L>=1 and 0<=j<=(Q−1);wherein when L=1, the third suppression mask and a matching score for the third sub-band signal with the passband j are inversely proportional.
  • 44. The method according to claim 38, wherein the step of generating the Q matching scores comprises: transforming multiple user utterances into multiple feature vectors using a neural network in an enrollment stage;computing an average of the multiple feature vectors to generate a user vector in an enrollment stage;transforming a user utterance into a feature vector using the neural network in the evaluation stage; andperforming cosine similarity between the user vector and the feature vector to generate the Q matching scores corresponding to the Q third signal components in the evaluation stage.
CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority under 35 USC 119(e) to U.S. provisional application No. 63/075,310, filed on Sep. 8, 2020, the content of which is incorporated herein by reference in its entirety.

Provisional Applications (1)
Number Date Country
63075310 Sep 2020 US