The present application claims priority to Chinese application No. 201910117712.4, filed on Feb. 15, 2019, which is incorporated by reference in its entirety.
The present application relates to the field of speech processing technology, and in particular, to a speech enhancement method and apparatus, a device and a storage medium.
Speech enhancement is an important part of speech signal processing. By enhancing speech signals, the clarity, intelligibility and comfort of the speech in a noisy environment can be improved, thereby improving the human auditory perception effect. In a speech processing system, before processing various speech signals, it is often necessary to perform speech enhancement processing first, thereby reducing the influence of noise on the speech processing system.
At present, the combination of a non-air conduction speech sensor and an air conduction speech sensor is generally used to improve speech quality. A voiced/unvoiced segment is determined according to the non-air conduction speech sensor and the determined voiced segment is applied to the air conduction speech sensor to extract the speech signals therein.
However, high frequency speech signals of the non-air conduction speech sensor are easily interfered by high frequency noise, resulting in a serious loss of the speech signals in the high frequency part, thereby affecting the quality of the output speech signals.
The present disclosure provides a speech enhancement method and apparatus, a device and a storage medium, which can adaptively adjust a fusion coefficient of speech signals of a non-air conduction speech sensor and an air conduction speech sensor according to environment noise, thereby improving the signal quality after speech fusion, and improving the effect of speech enhancement.
In a first aspect, an embodiment of the present disclosure provides a speech enhancement method, including:
Optionally, acquiring a first speech signal and a second speech signal includes:
Optionally, obtaining a signal to noise ratio of the first speech signal includes:
Optionally, after obtaining a signal to noise ratio of the first speech signal, the method further includes:
Optionally, determining, according to the signal to noise ratio of the first speech signal, a cutoff frequency of a first filter corresponding to the first speech signal, and a cutoff frequency of a second filter corresponding to the second speech signal includes:
Optionally, determining, according to the signal to noise ratio of the first speech signal, a fusion coefficient of filtered signals corresponding to the first speech signal and the second speech signal includes:
Optionally, performing, according to the fusion coefficient, speech fusion processing on the filtered signals corresponding to the first speech signal and the second speech signal to obtain an enhanced speech signal includes:
In a second aspect, an embodiment of the present disclosure provides a speech enhancement apparatus, including:
Optionally, the acquiring module is specifically configured to:
Optionally, the obtaining module is specifically configured to:
Optionally, the apparatus further includes:
Optionally, the filtering module is specifically configured to:
Optionally, the determining module is specifically configured to:
Optionally, the fusion module is specifically configured to:
where: s is the enhanced speech signal after the speech fusion, sac is the filtered signal corresponding to the first speech signal, sbc is the filtered signal corresponding to the second speech signal, and k is the fusion coefficient.
In a third aspect, an embodiment of the present disclosure provides a speech enhancement device, including: a signal processor and a memory; where the memory has an algorithm program stored therein, and the signal processor is configured to call the algorithm program in the memory to perform the speech enhancement method of any one of the items in the first aspect.
In a fourth aspect, an embodiment of the present disclosure provides a computer readable storage medium, including: program instructions, which, when running on a computer, cause the computer to execute the program instructions to implement the speech enhancement method of any one of the items in the first aspect.
The speech enhancement method and apparatus, the device and the storage medium provided by the present disclosure acquires a first speech signal and a second speech signal; obtains a signal to noise ratio of the first speech signal; determines, according to the signal to noise ratio of the first speech signal, a fusion coefficient of filtered signals corresponding to the first speech signal and the second speech signal; and performs, according to the fusion coefficient, speech fusion processing on the filtered signals corresponding to the first speech signal and the second speech signal to obtain an enhanced speech signal. Thereby, it is realized that a fusion coefficient of speech signals of a non-air conduction speech sensor and an air conduction speech sensor is adjusted adaptively according to environment noise, thereby improving the signal quality after speech fusion, and improving the effect of speech enhancement.
In order to illustrate the embodiments of the present disclosure or the technical solutions in the prior art more clearly, the drawings required in the description of the embodiments or the prior art will be briefly described below. Obviously, the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained according to these drawings by those skilled in the art without inventive efforts.
Through the above drawings, specific embodiments of the present disclosure have been shown, which will be described in more detail later. The drawings and the text descriptions are not intended to limit the scope of the conception of the present disclosure in any way, but rather to illustrate the concepts mentioned in the present disclosure for those skilled in the art by referring to the specific embodiments.
In order to make the objectives, technical solutions, and advantages of the embodiments of the present disclosure more clearly, the technical solutions in the embodiments of the present disclosure will be clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present disclosure. It is obvious that the described embodiments are only a part of the embodiments of the present disclosure, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present disclosure without inventive efforts are within the scope of the present disclosure.
The terms “first”, “second”, “third”, “fourth”, etc. (if present) in the description, claims and accompanying drawings described above of the present disclosure are used to distinguish similar objects and not necessarily used to describe a specific order or an order of priority. It should be understood that the data so used is interchangeable where appropriate, so that the embodiments of the present disclosure described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms “comprising” and “including” and any variants thereof are intended to cover a non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units that are clearly listed, but may include other steps or units that are not clearly listed or inherent to such process, method, product or device.
The technical solutions of the present disclosure will be described in detail below with specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be described in some embodiments.
Speech enhancement is an important part of speech signal processing. By enhancing speech signals, the clarity, intelligibility and comfort of the speech in a noisy environment can be improved, thereby improving the human auditory perception effect. In a speech processing system, before processing various speech signals, it is often necessary to perform speech enhancement processing first, thereby reducing the influence of noise on the speech processing system.
At present, the combination of a non-air conduction speech sensor and an air conduction speech sensor is generally used to improve speech quality. A voiced/unvoiced segment is determined according to the non-air conduction speech sensor and the determined voiced segment is applied to the air conduction speech sensor to extract the speech signals therein. This is to make use of the fact that when noise exists, the speech via the air conduction speech sensor has a messy and irregular spectrum, while the speech via the bone conduction sensor has a characteristic that it has complete low-frequency signal and clean spectrum, and it is not easily affected by external noise.
However, the existing traditional single-channel noise reduction's performance relies heavily on the accuracy of noise estimation. A too large noise estimate is likely to cause speech loss and residual music noise, and a too small noise estimate makes residual noise serious and affects the intelligibility of speech. An existing practice is that, according to the characteristic of bone conduction speech, the low frequency of speech of the non-air conduction sensor is used to replace the low frequency of speech of the air conduction sensor which is subject to noise interference and to superimpose with the high frequency of speech of the air conduction sensor to resynthesize a speech signal. In this practice, the high frequency of speech of the air conduction sensor is also subject to severe noise interference, and it is difficult to obtain high quality speech. In addition, the existing fusion of bone conduction speech and air conduction speech does not consider the influence of signal to noise ratio (SNR) and the fusion coefficient is fixed, and thereby it is difficult to adapt to the environment. Moreover, although the mapping between speech via the bone conduction sensor and clean speech and noisy speech via the air conduction sensor has a good effect, but the building of the model is complex, and the resource overhead of the algorithm is too large, which is not conducive to the adoption of wearable devices.
The present disclosure provides a speech enhancement method, which can adaptively adjust the fusion coefficient of the bone conduction speech and the air conduction speech according to a SNR of environment noise. This method can avoid the dependence on the noise estimation in the single channel speech enhancement, and can adapt to the change of environment noise and to the scene where the high frequency of air conduction speech is subject to severe noise interference, and can eliminate background noise and residual music noise well. The speech enhancement method provided by the present disclosure can be applied to the field of speech signal processing technology, and is applicable to products for low power speech enhancement, speech recognition, or speech interaction, which include but are not limited to earphones, hearing aids, mobile phones, wearable devices, and smart homes. etc.
In a specific implementation process,
Using the above method, it is realized that a fusion coefficient of speech signals of a non-air conduction speech sensor and an air conduction speech sensor is adaptively adjusted according to environment noise, thereby improving the signal quality after speech fusion, and improving the effect of speech enhancement.
The technical solutions of the present disclosure and how the technical solutions of the present application solve the above technical problems will be described in detail below with reference to specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be described in some embodiments. The embodiments of the present disclosure will be described below with reference to the accompanying drawings.
S101, acquiring a first speech signal and a second speech signal.
In the embodiment, the first speech signal is acquired through an air conduction speech sensor, and a second speech signal is acquired through a non-air conduction speech sensor; where the non-air conduction speech sensor includes a bone conduction speech sensor, and the air conduction speech sensor includes a microphone.
S102, obtaining a signal to noise ratio of the first speech signal.
In the embodiment, the first speech signal is preprocessed to obtain a preprocessed signal; Fourier transform processing is performed on the preprocessed signal to obtain a corresponding frequency domain signal; a noise power of the frequency domain signal is estimated, and the signal to noise ratio of the first speech signal is obtained based on the noise power.
Specifically, firstly, the first speech signal acquired through the air conduction speech sensor is preprocessed, mainly including pre-emphasis processing, filtering out low frequency components, enhancing high frequency speech components, and overlap windowing processing, to avoid the sudden change caused by the overlap between frames of signal. Then, through Fourier transform processing, conversion between the time domain signal and the frequency domain signal is performed to obtain the frequency domain signal of the first speech signal. Then, through the noise power estimation, an air conduction noise signal is estimated as accurately as possible; for example, the minimum value tracking method, the time recursive averaging algorithm, and the histogram-based algorithm are used for noise estimation. Finally, the signal to noise ratio of the air conduction speech signal is calculated based on the estimated noise, and the signal to noise ratio of the noisy speech signal is calculated as far as possible. There are many methods for calculating the signal to noise ratio, such as calculating the signal to noise ratio per frame, calculating a priori signal to noise ratio by decision-directed method, and the like.
In the embodiment, the sampling rate of the input data stream is Fs=8000 Hz, and the data length of data to be processed is generally between 8 ms and 30 ms. In the embodiment, the data to be processed is 64 points superimposed with 64 points of the previous frame, and then the system algorithm actually processes 128 points at a time. Firstly, the pre-emphasis processing needs to be performed on the original data to improve the high-frequency components of the speech, and there are many methods for pre-emphasis. The specific operation of the embodiment is:
ŷac(n)=yac(n)−αyac(n−1),
where α is a smoothing factor, the value of which is 0.98, yac(n−1) is the air conduction speech signal at the time of n−1 before preprocessing, yac(n) is the air conduction speech signal at the time of n before preprocessing, ŷac(n) is the air conduction speech signal at the time of n after preprocessing, and n is the nth moment.
The window function in the preprocessing must be a power-preserving map, that is, the sum of the squares of the windows of the overlapping portions of the speech signal must be 1, as shown below.
w2(N)+w2(N+M)=1,
where w2 (N) is the square of the value of the window function at the Nth point, w2 (N+M) is the square of the value of the window function at the (N+M)th point, N is the number of points for FFT processing, the value of which in the present disclosure is 128, and the frame length M is 64. The window function design can choose a rectangular window, a Hamming window, a Hanning window, a Gaussian window function and the like according to different application scenarios, which can be flexibly selected in actual design. The embodiment adopts a Kaiser Window with a 50% overlap.
Since the noise estimation and the signal to noise ratio calculation of the present disclosure are both processed in the frequency domain, the weighted preprocessed signal is windowed, and the windowed data is transformed into the frequency domain by FFT.
where k represents the number of spectral points, w(n) is a window function, yw (n, m) is the air conduction speech signal at the time of n after the mth frame speech is multiplied by the window function, and Yac(m) is the spectrum of the air conduction speech signal at the frequency point m after the FFT transform.
Classical noise estimations mainly include minimum-based tracking algorithm, time recursive averaging algorithm, and histogram-based algorithm. In the embodiment the time recursive averaging algorithm (MCRA) is adopted according to actual needs, and the specific practices are as follows:
The embodiment needs to calculate a priori signal to noise ratio at the frequency point k of each frame of speech ξ(λ,k) and a signal to noise ratio of the whole frame SNR(λ). The calculation of the priori signal to noise ratio at the frequency point k of each frame of speech ξ(λ,k) mainly adopts an improved decision-directing method, and the specific practices are as follows:
The calculation formula of signal to the noise ratio of the whole frame SNR(λ) is as follows:
S103, determining, according to the signal to noise ratio of the first speech signal, a fusion coefficient of filtered signals corresponding to the first speech signal and the second speech signal.
In the embodiment, a solution model of the fusion coefficient is constructed, and the solution model of the fusion coefficient is as follows:
kλ=γkλ−1+(1−γ)f(SNR),
where f(SNR)=0.5·tanh(0.025·SNR)+0.5,
kλ=max[0,f(SNR)] or kλ=min[f(SNR),1],
S104, performing, according to the fusion coefficient, speech fusion processing on the filtered signals corresponding to the first speech signal and the second speech signal to obtain an enhanced speech signal.
In the embodiment, speech fusion processing is performed on the filtered signals corresponding to the first speech signal and the second speech signal by using a preset speech fusion algorithm; where a calculation formula of the preset speech fusion algorithm is as follows:
s=sbc+k·sac,
The embodiment acquires a first speech signal and a second speech signal; obtains a signal to noise ratio of the first speech signal; determines, according to the signal to noise ratio of the first speech signal, a fusion coefficient of filtered signals corresponding to the first speech signal and the second speech signal; and performs, according to the fusion coefficient, speech fusion processing on the filtered signals corresponding to the first speech signal and the second speech signal to obtain an enhanced speech signal. Thereby, it is realized that a fusion coefficient of speech signals of a non-air conduction speech sensor and an air conduction speech sensor is adaptively adjusted according to environment noise, thereby improving the signal quality after speech fusion, and improving the effect of speech enhancement.
S201, acquiring a first speech signal and a second speech signal.
S202, obtaining a signal to noise ratio of the first speech signal.
For the specific implementation process and technical principles of the steps S201 to S202 in this embodiment, refer to the related descriptions in the steps S101 to S102 in the method shown in
S203, obtaining, according to the signal to noise ratio of the first speech signal, a first filtered signal and a second filtered signal.
In the embodiment, a cutoff frequency of a first filter corresponding to the first speech signal and a cutoff frequency of a second filter corresponding to the second speech signal are determined according to the signal to noise ratio of the first speech signal; filtering processing is performed on the first speech signal through the first filter to obtain a first filtered signal, and filtering processing is performed on the second speech signal through the second filter to obtain a second filtered signal.
In an alternative implementation, a priori signal to noise ratio of each frame of speech of the first speech signal is obtained; the number of frequency points at which the priori signal to noise ratio continuously increases is determined in a preset frequency range; and the cutoff frequencies of the first filter and the second filter are calculated and obtained according to the number of frequency points, a sampling frequency of the first speech signal, and a number of sampling points of Fourier transform.
Specifically, the cutoff frequencies of the high pass filter and the low pass filter are adaptively adjusted by the priori signal to noise ratio ξ(λ,k) of each frame of speech. The specific processing flow is as follows:
First, the low frequency part ξ(λ,k)=ξ(λ,k) k≤┌2000·N/fs┐ of ξ(λ,k) is selected. Then, the slope between the two points of ξ(λ,k) is calculated. Then, the number of frequency points k at which the slope continuously increases is selected, or the number of frequency points k at which the priori signal to noise ratio continuously increases is found.
fcl=min[k·fs/N+200,2000],
fch=max[k·fs/N−200,800],
S204, determining, according to the signal to noise ratio of the first speech signal, a fusion coefficient of filtered signals corresponding to the first speech signal and the second speech signal.
S205, performing, according to the fusion coefficient, speech fusion processing on the filtered signals corresponding to the first speech signal and the second speech signal to obtain an enhanced speech signal.
For the specific implementation process and technical principles of the steps S204 to S205 in the embodiment, refer to the related descriptions in the steps S103 to S104 in the method shown in
The embodiment acquires a first speech signal and a second speech signal; obtains a signal to noise ratio of the first speech signal; determines, according to the signal to noise ratio of the first speech signal, a fusion coefficient of filtered signals corresponding to the first speech signal and the second speech signal; and performs, according to the fusion coefficient, speech fusion processing on the filtered signals corresponding to the first speech signal and the second speech signal to obtain an enhanced speech signal. Thereby, it is realized that a fusion coefficient of speech signals of a non-air conduction speech sensor and an air conduction speech sensor is adaptively adjusted according to environment noise, thereby improving the signal quality after speech fusion, and improving the effect of speech enhancement.
In addition, the embodiment can further determine, according to the signal to noise ratio of the first speech signal, a cutoff frequency of a first filter corresponding to the first speech signal and a cutoff frequency of a second filter corresponding to the second speech signal; perform filtering processing on the first speech signal through the first filter to obtain a first filtered signal, and perform filtering processing on the second speech signal through the second filter to obtain a second filtered signal. Thereby the signal quality after speech fusion is improved, and the effect of speech enhancement is improved.
Optionally, the acquiring module 31 is specifically configured to:
Optionally, the obtaining module 32 is specifically configured to:
Optionally, the determining module 33 is specifically configured to:
Optionally, the fusion module 34 is specifically configured to:
where s is the enhanced speech signal after the speech fusion, sac is the filtered signal corresponding to the first speech signal, sbc is the filtered signal corresponding to the second speech signal, and k is the fusion coefficient.
The speech enhancement apparatus of the embodiment can perform the technical solution in the method shown in
The embodiment acquires a first speech signal and a second speech signal; obtains a signal to noise ratio of the first speech signal; determines, according to the signal to noise ratio of the first speech signal, a fusion coefficient of filtered signals corresponding to the first speech signal and the second speech signal; and performs, according to the fusion coefficient, speech fusion processing on the filtered signals corresponding to the first speech signal and the second speech signal to obtain an enhanced speech signal. Thereby, it is realized that a fusion coefficient of speech signals of a non-air conduction speech sensor and an air conduction speech sensor is adaptively adjusted according to environment noise, thereby improving the signal quality after speech fusion, and improving the effect of speech enhancement.
Optionally, the filtering module 35 is specifically configured to:
The speech enhancement apparatus of the embodiment can perform the technical solutions in the methods shown in
The embodiment acquires a first speech signal and a second speech signal; obtains a signal to noise ratio of the first speech signal; determines, according to the signal to noise ratio of the first speech signal, a fusion coefficient of filtered signals corresponding to the first speech signal and the second speech signal; and performs, according to the fusion coefficient, speech fusion processing on the filtered signals corresponding to the first speech signal and the second speech signal to obtain an enhanced speech signal. Thereby, it is realized that a fusion coefficient of speech signals of a non-air conduction speech sensor and an air conduction speech sensor is adaptively adjusted according to environment noise, thereby improving the signal quality after speech fusion, and improving the effect of speech enhancement.
In addition, the embodiment can further determine, according to the signal to noise ratio of the first speech signal, a cutoff frequency of a first filter corresponding to the first speech signal and a cutoff frequency of a second filter corresponding to the second speech signal; perform filtering processing on the first speech signal through the first filter to obtain a first filtered signal, and perform filtering processing on the second speech signal through the second filter to obtain a second filtered signal. Thereby the signal quality after speech fusion is improved, and the effect of speech enhancement is improved.
The signal processor 41 is configured to execute the executable instructions stored in the memory to implement various steps in the method involved in the above embodiments.
For details, refer to the related descriptions in the foregoing method embodiments.
Optionally, the memory 42 may be either stand-alone or integrated with the signal processor 41.
When the memory 42 is a device independent of the signal processor 41, the speech enhancement device 40 may further include:
The speech enhancement device in the embodiment can perform the methods shown in
In addition, the embodiment of the present application further provides a computer readable storage medium, where computer execution instructions are stored therein, and when at least one signal processor of a user equipment executes the computer execution instructions, the user equipment performs the foregoing various possible methods.
The computer readable storage medium includes a computer storage medium and a communication medium, where the communication medium includes any medium that facilitates the transfer of a computer program from one location to another. The storage medium may be any available medium that can be accessed by a general purpose or special purpose computer. An exemplary storage medium is coupled to a processor, such that the processor can read information from the storage medium and can write information to the storage medium. Of course, the storage medium may also be a part of the processor. The processor and the storage medium may be located in an application specific integrated circuit (ASIC). In additional, the application specific integrated circuit can be located in a user equipment. Of course, the processor and the storage medium may also reside as discrete components in a communication device.
Those skilled in the art will understand that all or part of the steps to implement the various method embodiments described above may be accomplished by hardware related to program instructions. The aforementioned program may be stored in a computer readable storage medium. The program, when executed, performs the steps included in the foregoing various method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
Other embodiments of the present disclosure will be apparent to those skilled in the art after considering the specification and practicing the disclosure disclosed here. The present disclosure is intended to cover any variations, uses, or adaptive changes of the present disclosure, which are in accordance with the general principles of the present disclosure and include common general knowledge or conventional technical means in the art that are not disclosed in the present disclosure. The specification and embodiments are deemed to be exemplary only and the true scope and spirit of the present disclosure is indicated by the claims below.
It should be understood that the present disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and can be subject to various modifications and changes without deviating from its scope. The scope of the present disclosure is limited only by the attached claims.
Number | Date | Country | Kind |
---|---|---|---|
201910117712.4 | Feb 2019 | CN | national |
Number | Name | Date | Kind |
---|---|---|---|
20090164212 | Chan | Jun 2009 | A1 |
20130046535 | Parikh | Feb 2013 | A1 |
20180277135 | Ali | Sep 2018 | A1 |
Number | Date | Country |
---|---|---|
101685638 | Mar 2010 | CN |
101807404 | Aug 2010 | CN |
102347027 | Feb 2012 | CN |
102347027 | Feb 2012 | CN |
105632512 | Jun 2016 | CN |
105632512 | Jun 2016 | CN |
109102822 | Dec 2018 | CN |
2458586 | May 2012 | EP |
WO2017190219 | Nov 2017 | WO |
Entry |
---|
First Office Action of the prior Chinese application. |
“Research on Digital Hearing Aid Speech Enhancement Algorithm”, Proceedings of the 37th Chinese Control Conference, Jul. 25-27, 2018, Wuhan, China. |
“Speech enhancement based on harmonic reconstruction filter used in digital hearing aids”, Chinese Journal of Electron Devic, vol. 41 No. 6 Dec. 2018. |
First Office Action of parallel EPO application No. 19204922.9. |
XP 2311265A. |
XP 11531021A IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, No. 12, Dec. 2013. |
Number | Date | Country | |
---|---|---|---|
20200265857 A1 | Aug 2020 | US |