BACKGROUND OF THE INVENTION
Field of the Invention
The invention relates to voice signal processing, and more particularly, to an apparatus and method for own voice suppression applicable to a hearing aid.
Description of the Related Art
The aim of a hearing aid is to offer the best clarity and intelligibility in the presence of background noise or competing speech. Since the hearing aid is very dose to the user's mouth, the most common complaint of hearing aid users is abnormally high voices while the user is speaking. This not only leaves the user feeling irritable or agitated, but also shields environmental sounds. Moreover, there is a potential risk of damage the hearing for the hearing aid users.
YAN disclosed a method of deep learning voice extraction and noise reduction method of combining bone conduction sensor and microphone signals in China Patent Pub. No. CN 110931031A. High-pass filtering or frequency band extending operation is performed over an audio signal from a bone conduction sensor to produce a processed signal. Then, both the processed signal and a microphone audio signal are fed into a deep neural network (DNN) module. Finally, the deep neural network module obtains the voice after noise reduction through prediction. Although YAN successfully extracts a target human voice in a complex noise scene and reduces interference noise, YAN fails to deal with the problem of abnormally high voices while the user is speaking.
The perception and acceptance of hearing aids is likely to be improved if the volume of the user's own voice can be reduced while the user is speaking.
SUMMARY OF THE INVENTION
In view of the above-mentioned problems, an object of the invention is to provide an own voice suppression apparatus for hearing aid users to improve comfort and speech intelligibility.
One embodiment of the invention provides an own voice suppression apparatus. The own voice suppression apparatus applicable to a hearing aid comprises: an air conduction sensor, an own voice indication module and a suppression module. The air conduction sensor is configured to generate an audio signal. The own voice indication module is configured to generate an indication signal according to at least one of user's mouth vibration information and user's voice feature vector comparison result. The suppression module coupled to the air conduction sensor and the own voice indication module is configured to generate an own-voice-suppressed signal according to the indication signal and the audio signal.
Another embodiment of the invention provides an own voice suppression method. The own voice suppression method, applicable to a hearing aid, comprises: providing an audio signal by an air conduction sensor; generating an indication signal according to at least one of user's mouth vibration information and user's voice feature vector comparison result; and generating an own-voice-suppressed signal according to the audio signal and the indication signal.
Further scope of the applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, and wherein:
FIG. 1 is a block diagram showing an own voice suppression apparatus according to the invention.
FIG. 2A is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to an embodiment of the invention.
FIG. 2B is a block diagram showing the computing unit 25A according to an embodiment of the invention.
FIG. 3A is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to an embodiment of the invention.
FIG. 3B is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to another embodiment of the invention.
FIG. 3C shows a relationship between an own voice complex-valued sample Xk and a speech complex-valued sample Zk for the same frequency bin k.
FIG. 3D is a block diagram showing the computing unit 25C according to an embodiment of the invention.
FIG. 3E shows a timing diagram of calculating the suppression mask αk(i) according to three (i.e., L=3) average speech power values and three average product complex values of three frequency bins in the invention.
FIG. 3F is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to another embodiment of the invention.
FIG. 3G is a block diagram showing the computing unit 25D according to an embodiment of the invention.
FIG. 4A is a block diagram showing an own voice suppression apparatus with a voice identification module according to an embodiment of the invention.
FIG. 4B is a block diagram showing an own voice suppression apparatus with a voice identification module according to another embodiment of the invention.
FIG. 4C is a block diagram showing a voice identification module according to an embodiment of the invention.
FIG. 4D is a block diagram showing an own voice suppression apparatus with a voice identification module according to another embodiment of the invention.
FIG. 5A is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to an embodiment of the invention.
FIG. 5B is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to another embodiment of the invention.
FIG. 5C is a block diagram showing the computing unit 25I according to an embodiment of the invention.
FIG. 5D is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to another embodiment of the invention.
FIG. 5E is a block diagram showing the computing unit 25J according to an embodiment of the invention.
FIG. 6 show the waveforms of the audio signal S1, the vibration S2 and the own-voice-suppressed signal S3.
DETAILED DESCRIPTION OF THE INVENTION
As used herein and in the claims, the term “and/or” includes any and all combinations of one or more of the associated listed items. The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Throughout the specification, the same components and/or components with the same function are designated with the same reference numerals.
A feature of the invention is to use at least one of a bone conduction sensor 231 and a voice identification module 130B to identify/detect which frequency bins (or passbands) the user's own voice components are located and then suppress/reduce the user's own voice components according to their respective power levels in multiple detected frequency bins or passbands to prevent from damaging the user's hearing and shielding environmental sounds. Thus, it is likely to improve comfort and speech intelligibility for the hearing aid users.
FIG. 1 is a block diagram showing an own voice suppression apparatus according to the invention. Referring to FIG. 1, an own voice suppression apparatus 10 of the invention, applicable to a hearing aid, includes an air conduction sensor 110, an amplification unit 120, an own voice indication module 130 and a suppression module 150. The air conduction sensor 110 may be implemented by an electret condenser microphone (ECM) or a micro-electro-mechanical system (MEMS) microphone. The air conduction sensor 110 receives both the user's voices/speech/utterances and the environmental sounds to output an audio signal S1.
The amplification unit 120 is configured to increase the magnitude of its input audio signal S1 by a voltage gain to generate an amplified signal Z[n], where n denotes the discrete time index. The own voice indication module 130 generates an indication signal X[n] according to either user's mouth vibration information (e.g., a vibration signal S2 from a bone conduction sensor 231) and/or user's voice feature vector comparison result (e.g., matching scores from a voice identification module 130B). The suppression module 150 calculates a suppression mask according to the amplified signal Z[n] and the indication signal X[n], suppresses the power level for the own voice component contained in the amplified signal Z[n] and generates an own-voice-suppressed signal S3.
FIG. 2A is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to an embodiment of the invention. Referring to FIG. 2A, an own voice suppression apparatus 20A of the invention, applicable to a hearing aid, includes an air conduction sensor 110, a multiplier 120a, an own voice indication module 130A and a suppression module 150A. In this embodiment, the amplification unit 120 in FIG. 1 is implemented by a multiplier 120a and the voltage gain varies according to the magnitude of its input audio signal S1 so that the magnitude of the amplified signal Z[n] falls within a predefined range. The amplification unit 120/120a is optional.
The own voice indication module 130A includes a bone conduction sensor 231 and an own voice reconstruction module 232. The bone conduction sensor 231 may be implemented by a MEMS voice accelerometer. As well known in the art, a voice accelerometer is configured to measure vibrations caused by speech/voice/mouth movement of the user, particularly at low frequencies, to output a vibration signal S2. The audio signal S1 and the vibration signal S2 may be analog or digital. If the signals S1 and S2 are analog, they may be digitized using techniques well known in the art. It is assumed that the amplified signal Z[n] and the reconstructed signal X[n] need to be digitized before being fed to the suppression module 150A. In general, the human voice/speech spans a range from about 125 Hz to 20 kHz. However, the bandwidth of the vibration signal S2 is normally restricted to a range from 0 to 3 kHz depending on the specification of the bone conduction sensor 231, and thus the vibration signal S2 usually sounds muffled. To solve this problem, the own voice reconstruction module 232 is provided to reconstruct the lost high-frequency components from the vibration signal S2 with a frequency range below 3 kHz by any existing or yet-to-be developed audio bandwidth extension approaches or high frequency reconstruction algorithms to generate a reconstructed signal X[n] with a frequency range extended up to 20 KHz. In an embodiment, the own voice reconstruction module 232 includes a deep neural network (not shown) that extracts feature values from the vibration signal S2 and then reconstructs its high-frequency components to generate a reconstructed signal X[n]. The deep neural network may be one or a combination of a recurrent neural network (RNN) and a convolutional neural network (CNN).
Assume that the noisy speech signal Z[n] can be expressed as Z[n]=v[n]+d[n], where v[n] is the clean speech, d[n] is the additive noise and n denotes the discrete time index. The suppression module 150A includes a computing unit 25A and a real-value multiplier 255. The computing unit 25A calculates a corresponding suppression mask α[n] (i.e., sample by sample) according to the amplified signal Z[n] and the reconstructed signal X[n], where 0<=α[n]<=1. FIG. 2B is a block diagram showing the computing unit 25A according to an embodiment of the invention. Referring to FIG. 2B, the computing unit 25A includes two power smooth units 251 and 252 and a suppression mask calculation unit 253. In order to reduce noise interference, the speech power estimation is done in the power smooth unit 251 by averaging speech power values of the past and the current data samples of the amplified signal Z[n] while the vibration power estimation is done in the power smooth unit 252 by averaging vibration power values of the past and the current data samples of the reconstructed signal X[n], using a smoothing parameter. In one embodiment, the following infinite impulse response (IIR) equations are provided for the two power smooth units 251-252 to obtain an average speech power value ZP[n] for the amplified signal Z[n] and an average vibration power value XP[n] for the reconstructed signal X[n]:
ZP[n]=((1−b)×ZP[n−1]+b×Z2[n]); (1)
XP[n]=((1−b)×XP[n−1]+b×X2[n]); (2)
where b is a smoothing parameter whose value is selected in between [0, 1].
According to the disclosure “Single Channel Speech Enhancement: using Wiener Filtering with Recursive Noise estimation”, disclosed by Upadhyay et al, Procedia Computer Science 84 (2016) 22-30, the gain Hwiener(ω) of a wiener filter with recursive noise estimation is given by:
where PSP(ω) is the noisy speech power spectrum, PNP(ω) is the noise power spectrum and ω is the frequency bin index. According to the equation (3), the suppression mask calculation unit 253 calculates the suppression mask α[n] for the current sample Z[n] in time domain as follows:
where 0<=α[n]<=1.
Please note that the above equation (4) is provided by way of example, but not limitations of the invention. Any other type of equations are applicable to the suppression mask calculation unit 253 as long as they satisfies the inversely proportional relationship between X[n] and α[n]. In brief, the greater the magnitude (or the power value) of X[n], the greater the own voice component contained in Z[n] and thus the less the suppression mask α[n] becomes for own voice suppression.
Then, the multiplier 255 is configured to multiply the amplified signal Z[n] by its corresponding suppression mask α[n] (sample by sample) to generate the own-voice-suppressed signal S3[n]. In this manner, the invention avoids abnormally high volume of hearing aids while the user is speaking. However, since the multiplication of the amplified signal Z[n] and the suppression mask α[n] is operated in time domain, it is likely that the environmental sounds as well as the user's voices contained in the amplified signal Z[n] would be suppressed at the same time.
FIG. 3A is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to an embodiment of the invention. Referring to FIG. 3A, an own voice suppression apparatus 30 of the invention, applicable to a hearing aid, includes an air conduction sensor 110, a multiplier unit 120a, an own voice indication module 130A and a suppression module 150B. The suppression module 150B includes a computing unit 25B, Q multipliers 310, two signal splitters 301/303 and a signal synthesizer 302. The signal splitter 301 splits the input signal Z[n] into Q first signal components (Z0˜ZQ−1) and the signal splitter 303 splits the input signal X[n] into Q second signal components (X0˜XQ−1), where Q>=1. Then, the computing unit 25B calculates Q suppression masks (α0˜αQ−1) according to the Q first signal components (Z0˜ZQ−1) and the Q second signal components (X0˜XQ−1). The Q multipliers 310 respectively multiply the Q second suppression masks (α0˜αQ−1) by their corresponding first signal components (Z0˜ZQ−1) to generate Q multiplied signals (Y0˜YQ−1). Finally, the signal synthesizer 302 reconstructs the own-voice-suppressed signal S3 in time domain according to the Q multiplied signals (Y0˜YQ−1). The signal splitters 301/303 in FIGS. 3A, 4A and 5A may be implemented by either transformers 301a/303a or an analysis filter bank 301b/303b while the signal synthesizer 302 in FIGS. 3A, 4A and 5A may be implemented by either an inverse transformer 302a or a synthesis filter bank 302b. Please note that the multipliers 310 may be implemented by complex-value multipliers 311 (together with the Z0˜ZQ−1 values being complex values) or real-value multipliers 255 (together with the Z0˜ZQ−1 values being real values).
FIG. 3B is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to another embodiment of the invention. Comparing to FIG. 3A, the signal splitters 301/303 are implemented by transformers 301a/303a while the signal synthesizer 302 is implemented by an inverse transformer 302a. Accordingly, the computing unit 25C calculates N suppression masks αk(i) for N frequency bins according to a current speech spectral representation for a current frame i of the amplified signal Z[n] and a current vibration spectral representation for a current frame i of the reconstructed signal X[n], where 0<=k<=(N−1), N is the length of each frame and i is the current frame index.
The transformers 301a and 303a is implemented to perform a fast Fourier transform (FFT), a short-time Fourier transform (STFT) or a discrete Fourier transform (DFT) over its input signals. Specifically, the transformers 301a and 303a respectively convert audio data of current frames of the signals Z[n] and X[n] in time domain into complex data (Z0˜ZN−1 and X0˜XN−1) in frequency domain. The inverse transformer 302a is used to transform the complex data (Y0˜YN−1) in frequency domain into the audio signal S3 in time domain for each frame. For purpose of clarity and ease of description, hereinafter, the following examples and embodiments will be described with the transformers 301a and 303a performing the FFT operations over each frame of their input signals. Assuming that a number of sampling points (or FFT size) is N and the time duration for each frame is Td, the transformer 303a divides the reconstructed signal X[n] into a plurality of frames and computes the FFT of the current frame i to generate a current vibration spectral representation having N complex-valued samples (X0˜XN−1) with a frequency resolution of fs/N(=1/Td). Here, fs denotes a sampling frequency of the reconstructed signal X[n] and each frame corresponds to a different time interval of the reconstructed signal X[n]. Likewise, the transformer 301a respectively divides the amplified signal Z[n] into a plurality of frames and computes the FFT of the current frame i to generate a current speech spectral representation having N complex-valued samples (Z0˜ZN−1) with a frequency resolution of fs/N. In a preferred embodiment, the time duration Td of each frame is about 8˜32 milliseconds (ms), and successive frames overlap by less than Td, such as by Td/2.
FIG. 3C shows a relationship between a vibration complex-valued samples Xk and a speech complex-valued sample Zk for the same frequency bin k. Referring to FIG. 3C, two vectors {right arrow over (Xk)} and {right arrow over (Zk)} respectively representing two complex-valued samples Xk and Zk for the same frequency bin k point to different directions. A vector τk{right arrow over (Zk)}, the projection of {right arrow over (Xk)} on {right arrow over (Zk)}, represents an own voice component on {right arrow over (Zk)}. According to the definition of linear minimum mean square error (MMSE) estimator (please go to the web site: https://en.wikipedia.org/wiki/Minimum_mean_square_error), we deduce the suppression mask αk for frequency bin k as follows. Since the two vectors ({right arrow over (Xk)}−τk{right arrow over (Zk)}) and {right arrow over (Zk)} are orthogonal,
where E[.] denotes an expectation value.
After the own voice component τk{right arrow over (Zk)} is subtracted from {right arrow over (Zk)}, scalars are calculated as follows:
Thus, the suppression mask
FIG. 3D is a block diagram showing the computing unit 25C according to an embodiment of the invention. Referring to FIG. 3D, the computing unit 25C includes two complex-value multipliers 312, a complex conjugate block 355, two smooth units 351 and 352 and a suppression mask calculation unit 353. According to the current speech spectral representation, the complex-value multiplier 312 multiplies each complex-valued sample Zk(i) by its complex conjugate Z*k(i) from the complex conjugate block 355 to generate a product of |Zk(i)|2. The smooth unit 351 firstly computes the power level |Zk(i)|2 for each frequency bin k to obtain a current speech power spectrum for the current frame i of the amplified signal Z[n] according to the equation: |Zk(i)|2=zkr2+zki2, where zkr denotes a real part of the complex-valued sample Zk(i), zki denotes an imaginary part complex-valued sample Zk(i), and 0<=k<=(N−1). Then, to reduce noise interference, similar to the above equation (1), the following IIR equation (5) is provided for the smooth unit 351 to obtain an average speech power value:
σk2(i)=(1−b)×σk2(i−1)+b×|Zk(i)|2; (5)
where b is a smoothing parameter whose value is selected in between [0, 1], i is the current frame index and (i−1) is a previous frame index. In other words, σk2(i)=E[|Zk(i)|2].
According to the current vibration spectral representation and the current speech spectral representation, the complex-value multiplier 312 multiplies each complex-valued sample Xk(i) by the complex conjugate Z*k(i) from the complex conjugate block 355 to generate a product of Xk(i)Zk(i)*. The smooth unit 352 calculates a product complex value Xk(i)Zk(i)* for each frequency bin k to obtain a current product spectrum for the current frame i of the reconstructed signal X[n], where 0<=k<=N−1. Then, similar to the above equations (2) and (5), to reduce noise interference, the following IIR equation (6) is provided for the smooth unit 352 to obtain an average product complex value:
ρk(i)=(1−b)×ρk(i−1)+b×Xk(i)(Zk(i))*. (6)
In other words, ρk(i)=E[Xk(i)(Zk(i))*].
Afterward, according to the equations (3) and (4a), the average speech power value σk2(i) and the average product complex value ρk(i), the suppression mask calculation unit 353 calculates the suppression mask αk(i) for a frequency bin k associated with the current frame z of the amplified signal Z[n] as follows:
Please note that the outputted samples (Z0(i)˜ZN−1(i)) from the transformers 301a are complex values, so the suppression masks αk(i) are also complex values and hereinafter called Θcomplex masks”.
Next, the N complex-value multipliers 311 respectively multiply the N complex-valued samples Zk(i) by the N suppression masks αk(i) for the N frequency bins to generate N complex-valued samples Yk(i), where 0<=k<=N−1. Finally, the inverse transformer 302a performs IFFT over the N complex-valued samples Y0(i)˜YN−1(i) in frequency domain to generate the own-voice-suppressed signal S3 for the current frame i in time domain. Please note that the above equation (7) is provided by way of example, but not limitations of the invention. Any other type of equations are applicable to the suppression mask calculation unit 353 as long as they satisfies the inversely proportional relationship between Xk(i) and αk(i). In brief, the greater the magnitude of Xk(i), the greater the own voice component in the frequency band k of the current speech spectral representation is and thus the less the suppression mask αk(i) becomes for own voice suppression.
Please note that in equation (7), the suppression mask αk(i) for a frequency bin k is calculated according to the average speech power value σk2(i) and the average product complex value ρk(i) of the same frequency bin k. In an alternative embodiment, the suppression mask αk(i) for a frequency bin k is determined according to L average speech power values and L average product complex values of L frequency bins adjacent to the frequency bin k, where L>=1. FIG. 3E shows a timing diagram of calculating the suppression mask αk(i) for a frequency bin k according to three (i.e., L=3) average speech power values and three average product complex values of three frequency bins adjacent to the frequency bin k in the invention. Referring to FIG. 3E, the whole process of calculating the suppression mask αk(i) by the computing unit 25C is divided into three phases. In phase one, the smooth unit 351 respectively calculates three average speech power values σk−12(i), σk2(i) and σk+12(i) for three frequency bins (k−1), k and (k+1) according to the equation (5) and the three power levels |Zk−1(i)|2, |Zk(i)|2 and |Zk+1(i)|2. Meanwhile, the smooth unit 352 respectively calculates the average product complex values ρk−1(i), ρk(i) and ρk+1(i) for three frequency bins (k−1), k and (k+1) according to the equation (6) and the three product complex-valued samples Xk−1(i)(Zk−1(i))*, Xk(i)(Zk(i))* and Xk+1(i)(Zk+1(i))*. In phase two, the suppression mask calculation unit 353 calculates: (i) a suppression mask αk−1(i) for a frequency bin (k−1) according to the equation (7), the average speech power value σk−12(i) and the average product complex value ρk−1(i); (ii) a suppression mask αk(i) for a frequency bin k according to the equation (7), the average speech power value σk2(i) and the average product complex value ρk(i); and (iii) a suppression mask αk+1(i) for a frequency bin (k+1) according to the equation (7), the average speech power value σk+12(i) and the average product complex value ρk+1(i). In phase three, the suppression mask calculation unit 353 calculates an average value of the three suppression masks (αk−1(i), αk(i), αk+1(i)) of the three frequency bins ((k−1), k, (k+1)) and then updates the suppression mask αk(i) for the frequency bin k with the average value. Please note that FIG. 3D (i.e., L=1) is a special case of FIG. 3E.
FIG. 3F is a block diagram showing an own voice suppression apparatus with a bone conduction sensor according to another embodiment of the invention. In comparison with the own voice suppression apparatus 30 in FIG. 3A, the signal splitters 301/303 are implemented by analysis filter banks 301b/303b while the signal synthesizer 302 is implemented by a synthesis filter bank 302b and an adder 302c.
Referring to FIG. 3F, the amplified signal Z[n] is decomposed into M speech sub-band signals Z0[n]˜ZM−1[n] by applying M analysis filters of the analysis filter bank 301b with M different passbands. Likewise, the reconstructed signal X[n] is decomposed into M vibration sub-band signals X0[n]˜XM−1[n] by applying M analysis filters of the analysis filter bank 303b with M different passbands. Thus, each of the speech sub-band signals Z0[n]˜ZM−1[n] (in time domain) carries information on the amplified signal Z[n] in a particular frequency band while each of the vibration sub-band signals X0[n]˜XM−1[n] (in time domain) carries information on the reconstructed signal X[n] in a particular frequency band. In an embodiment, the bandwidths of the M passbands of the M analysis filters of the analysis filter bank 301b/303b are equal. In an alternative embodiment, the bandwidths of the M passbands of the M analysis filters of the analysis filter bank 301b/303b are not equal; moreover, the higher the frequency, the wider the bandwidths of the M passbands of the M analysis filters. Then, the M real-value multipliers 255 respectively multiply M speech sub-band signals Z0[n]˜ZM−1[n] by M suppression masks α0[n]˜αM−1[n] to generate M modified signals B0[n]˜BM−1[n]. Next, M synthesis filters of the synthesis filter bank 302b respectively perform interpolation over the M modified signals B0[n]˜BM−1[n] to generate M interpolated signals. Finally, the M interpolated signals are combined by the adder 302c to reconstruct the own-voice-suppressed signal S3. Referring to FIG. 3G, the computing unit 25D includes two power smooth units 391 and 392 and a suppression mask calculation unit 393. Analogous to the equations (1) and (2), the following IIR equations are provided for the two power smooth units 391-392 to obtain an average speech power value ZPj[n] for the speech sub-band signal Z[n] and an average vibration power value XPj[n] for the vibration sub-band signal Xj[n]:
ZP
j[n]=((1−b)×ZPj[n−1]+b×Zj2[n]); (8)
XP
j[n]=((1−b)×XPj[n−1]+b×Xj2[n]); (9)
where b is a smoothing parameter whose value is selected in between [0, 1], j is the passband index, n is the discrete time index and 0<=j<=(M−1).
Analogous to the equation (4), the suppression mask calculation unit 393 calculates the suppression mask αj[n] for the speech sub-band signal Zj[n] as follows:
Please note that the outputted samples (Z0[n])˜ZM−1[n])) from the filter bank 301b are real values, so the suppression masks αj[n] are also real values and hereinafter called “real masks”. Please note that in equation (10), the suppression mask αj[n] for the speech sub-band signal Z[n] is calculated according to the average speech power value ZPj[n] and the average vibration power value XPj[n] of the same passband j. In an alternative embodiment, similar to the three-phase process in FIG. 3E, the suppression mask αj[n] for the speech sub-band signal Zj[n] corresponding to the passband j is determined according to L average speech power values of L speech sub-band signals and L average vibration power values of L vibration sub-band signals, where L>=1 and the passbands of the L speech sub-band signals and the L vibration sub-band signals are adjacent to the passband j. For example, if L=3, the computing unit 25D computes three suppression masks (αj−1[n], αj[n] and αj+1[n]) of three speech sub-band signals (Zj−1[n], Zj[n] and Zj+1[n]) with their passbands adjacent to the passband j based on equation (10), three average speech power values of the three speech sub-band signals and three average vibration power values of three vibration sub-band signals (Xj−1[n], Xj[n] and Xj+1[n]), computes an average value of the three suppression masks and then updates the suppression mask αj[n] for the speech sub-band signal Zj[n] with the average value. Please note that the above equation (10) is provided by way of example, but not limitations of the invention. Any other type of equations are applicable to the suppression mask calculation unit 393 as long as it satisfies the inversely proportional relationship between Xj[n] and αj[n]. In brief, the greater the magnitude (or power value) of Xj[n], the greater the own voice component in the passband j (or the speech sub-band signal Zj[n]) is and thus the less the suppression mask αj[n] becomes for own voice suppression.
FIG. 4A is a block diagram showing an own voice suppression apparatus with a voice identification module according to an embodiment of the invention. In comparison with the own voice suppression apparatus 30 in FIG. 3A, a main difference is that the own voice indication module 130A is replaced with a voice identification module 1308 and the signal components (Z0˜ZQ−1) are not fed to the computing unit 25E. The voice identification module 130B receives the amplified signal Z[n] to generate Q matching scores (P0˜PQ−1) corresponding to Q signal components Z0˜ZQ−1. Then, the computing unit 25E calculates Q suppression masks (α0˜αQ−1) according to the Q matching scores (P0˜PQ−1).
FIG. 4B is a block diagram showing an own voice suppression apparatus with a voice identification module according to another embodiment of the invention. In comparison with the own voice suppression apparatus 40 in FIG. 4A, the signal splitter 301 is implemented by a transformer 301a while the signal synthesizer 302 is implemented by an inverse transformer 302a. The voice identification module 130B receives the amplified signal Z[n] to generate N matching scores Pk corresponding to N frequency bins of the current speech spectral representation associated with a current frame i of Z[n], where 0<=k<=(N−1) and N is the length of each frame of the amplified signal Z[n]. Each matching score Pk is bounded between 0 and 1. Thus, if any matching score Pk gets close to 1, it indicates that the magnitude of the user's own voice component gets greater in the frequency bin k; otherwise, if any matching score Pk gets close to 0, it indicates that the magnitude of the user's own voice component gets smaller in this frequency bin k. According to the N matching scores P0˜PN−1, the computing unit 25F calculates a suppression mask αk for each frequency bin k using the following equation:
αk=(1−Pk), (11)
where 0<=αk<=1. Please note that since Pk is real number, the suppression mask αk is a real mask.
Please note that the above equation (11) is provided by way of example, but not limitations of the invention. Any other type of equations is applicable to the computing unit 25F as long as it satisfies the inversely proportional relationship between αk and Pk. In brief, the greater the magnitude of Pk, the greater the magnitude of the user own voice component in the frequency band k of the current speech spectral representation is and thus the less the suppression mask αk becomes for own voice suppression. In an alternative embodiment, similar to the three-phase process in FIG. 3E, the suppression mask αk(i) for a frequency bin k is determined according to L matching scores of L frequency bins adjacent to the frequency bin k. For example, the computing unit 25F calculates L suppression masks of the L frequency bins adjacent to the frequency bin k according to the L matching scores of the L frequency bins, calculates an average value of the L suppression masks and then updates the suppression mask αk(i) for the frequency bin k with the average value, where L>=1.
The advantage of the voice identification module 130B is capable of identifying which frequency bins the user's own voice components are located and how strong the user's own voice components are. With this indication, the user's own voice components in the identified frequency bins can be suppressed precisely while the magnitudes of the sound components in the other frequency bins (representative of environmental sounds) are retained.
FIG. 4C is a block diagram showing a voice identification module according to an embodiment of the invention. Referring to FIG. 4C, the voice identification module 130B includes a storage device 42, an audio embedding extraction unit 41 and an embedding match calculation unit 43. The audio embedding extraction unit 41 includes a neural network 410 and an average block 415. The neural network 410 is implemented by a DNN or a long short term memory (LSTM) network. The storage device 42 includes all forms of non-volatile or volatile media and memory devices, such as semiconductor memory devices, magnetic disks, DRAM, or SRAM.
For purpose of clarity and ease of description, hereinafter, the following examples and embodiments will be described with the neural network 410 implemented by a DNN. The DNN may be implemented using any known architectures. For example, referring to the disclosure “End-to-End Text-Dependent Speaker Verification”, disclosed by Heigold et al, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), the DNN 410 consists of successive application of several non-linear functions in order to transform the user utterance into a vector; as shown in FIG. 4C, the DNN 410 includes a locally-connected layer 412 and multiple fully connected layers 411. It should be noted that the architecture of the DNN 410 is provided by way of example, but not limitation of the invention. Any other architecture is applicable to the DNN as long as it can transforms the user utterance Z[n] into a current feature vector CV. The identification protocol is divided into three stages: training, enrollment and evaluation. In the training stage, a suitable user representation is found from the training utterances. For example, the user representations are a summary of frame-level information, such as feature vectors. After the training stage is completed, the parameters of the DNN 410 are fixed. In the enrollment stage, a user provides multiple utterances, which is used to estimate a user model. Due to the fact that each utterance generates one feature vector, the feature vectors of the enrollment utterances are averaged by the average unit 415 to obtain a user vector UV representative of the user model. And then, the user vector UV is stored in the storage device 42 by the DNN 410. Please note that in the enrollment stage, the embedding match calculation unit 43 is disabled. In the evaluation stage, the average block 415 is disabled. The DNN 410 transforms the user utterance Z[n] into a current feature vector CV. The embedding match calculation unit 43 retrieves the user vector UV from the storage device 42 and performs cosine similarity between the user vector UV and the current feature vector CV to generate N matching scores Pk for N frequency bins, where 0<=k<=(N−1). If each of the user vector UV and the current feature vector CV has a dimension of N×N1, then the output vector P from the embedding match calculation unit 43 has a dimension of N×1. If N=256 and N1=2048, after performing cosine similarity, the embedding match calculation unit 43 generates an output vector P with 256×1 components Pk, 0<=k<=255. As well known in the art, cosine similarity is a measure of similarity between two vectors of an inner product space; it is measured by the cosine of the angle between the two vectors and determines whether the two vectors are pointing in roughly the same direction. In this invention, cosine similarity is used to detect how similar the user vector UV and the current feature vector CV are in the frequency bin k, where, 0<=k<=N−1. The more similar (i.e., Pk gets close to 1) the two vectors UV and CV in the frequency bin k, the greater the user own voice component in the frequency bin k.
FIG. 4D is a block diagram showing an own voice suppression apparatus with a voice identification module according to another embodiment of the invention. In comparison with the own voice suppression apparatus 40A in FIG. 4B, a main difference is that the transformer 301a is replaced with an analysis filter bank 301b while the inverse transformer 302a is replaced with a synthesis filter bank 302b. The voice identification module 130B receives the amplified signal Z[n] to generate M matching scores Pj corresponding to the M passbands of the analysis filter bank 301b, where 0<=j<=(M−1). Please note that, the frequency ranges of the M passbands of the M analysis filters for the analysis filter bank 301b respectively correspond to the frequency ranges of the M passbands with M matching scores Pj from the voice identification module 130B. In an embodiment, the bandwidths of the M passbands of the M analysis filters are equal. In an alternative embodiment, the bandwidths of the M passbands of the M analysis filters are not equal; moreover, the higher the frequency, the wider the bandwidths of the passbands of the M analysis filters. Each matching score Pj is bounded between 0 and 1. Thus, if any matching score Pj gets close to 1, it indicates that the magnitude of the user's own voice component gets greater in the passband (or the speech sub-band signal Zj[n]); otherwise, if any matching score Pj gets close to 0, it indicates that the magnitude of the user's own voice component gets smaller in this passband j. According to the M matching scores Pj, the computing unit 25G calculates a suppression mask αj[n] for each passband j (or each speech sub-band signal Zj[n]) according to the following equation:
αj[n]=(1−Pj), (12)
where 0<=αj[n]<=1 and 0<=j<=(M−1).
Please note that the above equation (12) is provided by way of example, but not limitations of the invention. Any other type of equations is applicable to the computing unit 25G as long as it satisfies the inversely proportional relationship between αj[n] and Pj. In brief, the greater the magnitude of Pj, the greater the magnitude of the user own voice component in the passband j for the speech sub-band signal Zj[n]) is and thus the less the suppression mask αj[n] becomes for own voice suppression. In an alternative embodiment, similar to the three-phase process in FIG. 3E, the suppression mask αj[n] for the speech sub-band signal Zj[n] is determined according to L matching scores of L speech sub-band signals with their passbands adjacent to the passband j of the speech sub-band signal Zj[n]. For example, the computing unit 25G calculates L suppression masks of the L speech sub-band signals with their passbands adjacent to the passband j according to the L matching scores of the L speech sub-band signals, calculates an average value of the L suppression masks and then updates the suppression mask αj[n] for the passband j (or the speech sub-band signal Zj[n]) with the average value, where L>=1.
FIG. 5A is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to an embodiment of the invention. Referring to FIG. 5A, the own voice suppression apparatus 50 includes the own voice suppression apparatus 30 and the voice identification module 130B. The computing unit 25H calculates Q suppression masks (α0˜αQ−1) according to the Q matching scores (P0˜PQ−1), the Q first signal components (Z0˜ZQ−1) and the Q second signal components (X0˜XQ−1).
FIG. 5B is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to another embodiment of the invention. Comparing to the own voice suppression apparatus 50 in FIG. 5A, the signal splitters 301/303 are implemented by transformers 301a/303a while the signal synthesizer 302 is implemented by an inverse transformer 302a.
FIG. 5C is a block diagram showing the computing unit 251 according to an embodiment of the invention. Referring to FIG. 5C, the computing unit 25E includes two complex-value multipliers 312, a complex conjugate block 355, two smooth units 351 and 352 and a suppression mask calculation unit 553. According to the equation (7), the matching score Pk, the average speech power value σk2(i) and the average product complex value ρk(i), the suppression mask calculation unit 553 calculates the suppression mask αk(i) for a frequency bin k in the current speech spectral representation (associated with the current frame i of the amplified signal Z[n]) as follows:
Please note that the above equation (13) is provided by way of example, but not limitations of the invention. Any other type of equations are applicable to the suppression mask calculation unit 553 as long as they satisfy the inversely proportional relationship between Xk(i) and αk(i), and the inversely proportional relationship between Pk and αk(i). In brief, the greater the magnitude of Xk(i) and/or the magnitude of Pk, the greater the own voice component in the frequency band k of the current speech spectral representation is and thus the less the suppression mask αk(i) becomes for own voice suppression. In an alternative embodiment, similar to the three-phase process in FIG. 3E, the suppression mask αk(i) for a frequency bin k of the current speech spectral representation is determined according to L matching scores, L average speech power values and L average product complex values of L frequency bins adjacent to the frequency bin k. For example, the computing unit 251 calculates L suppression masks of the L frequency bins adjacent to the frequency bin k according to the L matching scores, L average speech power values and L average product complex values of the L frequency bins, calculates an average value of the L suppression masks and then updates the suppression mask αk(i) for the frequency bin k with the average value, where L>=1.
FIG. 5D is a block diagram showing an own voice suppression apparatus with a voice identification module and a bone conduction sensor according to another embodiment of the invention. Comparing to the own voice suppression apparatus 50 in FIG. 5A, the signal splitters 301/303 are implemented by the analysis filter banks 301b/303b while the signal synthesizer 302 is implemented by the synthesis filter bank 302b. FIG. 5E is a block diagram showing the computing unit 25J according to an embodiment of the invention. Referring to FIG. 5E, the computing unit 25J includes two power smooth units 391 and 392 and a suppression mask calculation unit 554. According to the equation (10), the matching score Pj, the average speech power value ZPj[n] and the average vibration power value XPj[n], the suppression mask calculation unit 554 calculates the suppression mask αj[n] for the passband j (or the speech sub-band signal Zj[n]) as follows:
where 0<=αj[n]<=1, j is the passband index and 0<=j<=(M−1).
Please note that the above equation (14) is provided by way of example, but not limitations of the invention. Any other type of equations are applicable to the suppression mask calculation unit 554 as long as they satisfy the inversely proportional relationship between Xj[n] and αj[n], and the inversely proportional relationship between Pj and αj[n]. In brief, the greater the magnitude (or power value) of Xj[n] and/or the magnitude of Pj, the greater the own voice component in the speech sub-band signal Zj[n] is and thus the less the suppression mask αj[n] becomes for own voice suppression. In an alternative embodiment, similar to the three-phase process in FIG. 3E, the suppression mask αj[n] for the speech sub-band signal Zj[n] is determined according to L matching scores and L average speech power values of L speech sub-band signals and L average vibration power values of L vibration sub-band signals, where the passbands of the L speech sub-band signals and the L vibration sub-band signals are adjacent to the passband j. For example, the computing unit 25J calculates L suppression masks of the L speech sub-band signals with their passbands adjacent to the passband j according to the L matching scores and the L average speech power values of the L speech sub-band signals and the L average vibration power values of the L vibration sub-band signals, computes an average value of the L suppression masks and then updates the suppression mask αj[n] for the passband j (or the speech sub-band signal Zj[n]) with the average value, where L>=1.
Obviously, the own voice suppression apparatus 50/50A/50B has the best performance of suppressing the user's own voice and retaining the environmental sounds due to the both assistance from the own voice indication module 130A and the voice identification module 130B. FIG. 6 show a relationship among waveforms of the audio signal S1, the vibration signal S2 and the own-voice-suppressed signal S3 according to an embodiment of the invention. Referring to FIG. 6, in the presence of the user's own voice, it is obvious that the magnitude of the audio signal S1 is abnormally large in comparison with the vibration signal S2, but the magnitude of the own-voice-suppressed signal S3 is significantly reduced after own voice suppression.
The own voice suppression apparatus 10/20/30/30A/30B/40/40A/40B/50/50A/50B according to the invention may be hardware, software, or a combination of hardware and software (or firmware). An example of a pure solution would be a field programmable gate array (FPGA) design or an application specific integrated circuit (ASIC) design. In an embodiment, the suppression module (150/150A˜150J) and the amplification unit 120/120a are implemented with a first general-purpose processor and a first program memory; the own voice reconstruction module 232 is implemented with a second general-purpose processor and a second program memory. The first program memory stores a first processor-executable program and the second program memory stores a second processor-executable program. When the first processor-executable program is executed by the first general-purpose processor, the first general-purpose processor is configured to function as: the amplification unit 120/120a and the suppression module (150/150A˜150J). When the second processor-executable program is executed by the second general-purpose processor, the second general-purpose processor is configured to function as: the own voice reconstruction module 232.
In an alternative embodiment, the amplification unit 120/120a, the own voice reconstruction module 232 and the suppression module (150/150A˜150J) are implemented with a third general-purpose processor and a third program memory. The third program memory stores a third processor-executable program. When the third processor-executable program is executed by the third general-purpose processor, the third general-purpose processor is configured to function as: the amplification unit 120/120a, the own voice reconstruction module 232 and the suppression module (150/150A˜150J).
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that this invention should not be limited to the specific construction and arrangement shown and described, since various other modifications may occur to those ordinarily skilled in the art.