This application claims the benefit of China application Serial No. 202310067077.X, filed on Jan. 16, 2023, the subject matter of which is incorporated herein by reference.
The present invention generally relates to signal processing, and, more particularly, to a speech enhancement method and a processing circuit for performing the speech enhancement method.
Speech enhancement (SE), an important technology in voice calls, uses algorithms to suppress noise (including steady noise and non-steady noise) to improve voice quality. The effect of noise suppression directly determines the effect of speech enhancement. Therefore, the present invention provides a device and method to improve the effect of noise suppression (i.e., improve the effect of speech enhancement).
In view of the issues of the prior art, an object of the present invention is to provide a speech enhancement method and a processing circuit for performing the speech enhancement method, so as to improve the effect of noise suppression.
According to one aspect of the present invention, a processing circuit is provided. The processing circuit processes a to-be-processed signal to generate a target signal. The processing circuit executes a plurality of program codes or program instructions to perform the following steps: performing Fourier transform on the to-be-processed signal to generate a spectral signal of the to-be-processed signal; performing a first noise reduction processing on the spectral signal to obtain a first intermediate signal; performing a noise analysis on the first intermediate signal to obtain a noise feature; performing a second noise reduction processing on the first intermediate signal to generate a second intermediate signal when the noise feature does not satisfy a target condition; and performing inverse Fourier transform on the second intermediate signal to generate the target signal. The first noise reduction processing is different from the second noise reduction processing.
According to another aspect of the present invention, a speech enhancement method is provided. The speech enhancement method processes a to-be-processed signal to generate a target signal and includes the following steps: performing Fourier transform on the to-be-processed signal to generate a spectral signal of the to-be-processed signal; performing a first noise reduction processing on the spectral signal to obtain a first intermediate signal; performing a noise analysis on the first intermediate signal to obtain a noise feature; performing a second noise reduction processing on the first intermediate signal to generate a second intermediate signal when the noise feature does not satisfy a target condition; and performing inverse Fourier transform on the second intermediate signal to generate the target signal. The first noise reduction processing is different from the second noise reduction processing.
According to still another aspect of the present invention, a speech enhancement method is provided. The speech enhancement method processes a to-be-processed signal to generate a target signal and includes the following steps: performing Fourier transform on the to-be-processed signal to generate a spectral signal of the to-be-processed signal; performing a first noise reduction processing on the spectral signal to obtain a first intermediate signal; performing a second noise reduction processing on the first intermediate signal to generate a second intermediate signal; and performing inverse Fourier transform on the second intermediate signal to generate the target signal; wherein the first noise reduction processing is different from the second noise reduction processing.
The technical means embodied in the embodiments of the present invention can solve at least one of the problems of the prior art. Therefore, compared to the prior art, the present invention can improve the effect of noise suppression.
These and other objectives of the present invention no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiments with reference to the various figures and drawings.
The following description is written by referring to terms of this technical field. If any term is defined in this specification, such term should be interpreted accordingly. In addition, the connection between objects or events in the below-described embodiments can be direct or indirect provided that these embodiments are practicable under such connection. Said “indirect” means that an intermediate object or a physical space exists between the objects, or an intermediate event or a time interval exists between the events.
The disclosure herein includes a speech enhancement method and a processing circuit for performing the speech enhancement method. On account of that some or all elements of the processing circuit for performing the speech enhancement method could be known, the detail of such elements is omitted provided that such detail has little to do with the features of this disclosure, and that this omission nowhere dissatisfies the specification and enablement requirements. Some or all of the processes of the speech enhancement method may be implemented by software and/or firmware and can be performed by the processing circuit for performing the speech enhancement method. A person having ordinary skill in the art can choose components or steps equivalent to those described in this specification to carry out the present invention, which means that the scope of this invention is not limited to the embodiments in the specification.
The input device 130 is used to input an analog input signal ASin (e.g., a speech signal) to the chip 110. The input device 130 may be a microphone.
The ADC 115 is used to convert the analog input signal ASin into a digital signal D1.
The audio transmission circuit 111 is used to receive a digital input signal DSin through a digital signal transceiver circuit (including but not limited to a wired network module, a wireless network module, a Bluetooth module, etc.).
The audio processing circuit 114 is used to perform audio processing on the digital input signal DSin or the digital signal D1 to generate the to-be-processed signal SN. In some embodiments, the audio processing circuit 114 may include a pulse density modulation (PDM) to pulse-code modulation (PCM) circuit, a resampling circuit, a filter circuit, and a digital programmable gain amplifier (DPGA). The PDM to PCM circuit is used to convert a PDM signal into a PCM signal. The resampling circuit is used to convert the high-sampling-rate PCM signal into a low-sampling-rate PCM signal. The filter circuit is used to filter out high frequency components and DC components. The DPGA is used to adjust the gain of the filtered signal.
In some embodiments, the chip 110 further includes a direct memory access (DMA) circuit. The DMA circuit stores the to-be-processed signal SN generated by the audio processing circuit 114 in the memory 120, and reads the to-be-processed signal SN from the memory 120 and then provides it to the processing circuit 112.
The processing circuit 112 is used to perform speech enhancement processing on the to-be-processed signal SN to generate a target signal SE (i.e., a noise-suppressed (speech-enhanced) signal). The processing circuit 112 can perform speech enhancement processing by executing program instructions and/or codes stored in the memory 120.
The processor 112_a may be a general-purpose processor capable of executing programs, such as a central processing unit, a microprocessor, a microprocessor unit, a digital signal processor, an application-specific integrated circuit (ASIC), or an equivalent circuit thereof. The auxiliary processor 112_b can be a special-purpose processor capable of executing programs, such as an intelligence processing unit (IPU), a neural-network processing unit (NPU), or a graphics processing unit (GPU). The processor 112_a cooperates with the auxiliary processor 112_b to perform speech enhancement processing. That is to say, the chip 110 can use the execution capability of the auxiliary processor 112_b to accelerate the overall speech enhancement processing (i.e., to improve the overall performance of the chip 110).
In an alternative embodiment, the chip 110 may include only the processor 112_a, but not the auxiliary processor 112_b. In this case, the processor 112_a is responsible for all speech enhancement processing.
The audio processing circuit 114 performs audio processing on the target signal SE to generate a digital signal D2. The digital signal D2 can be outputted through the audio transmission circuit 111, or converted into an analog output signal ASout by the DAC 116 and then outputted to the output device 140. The output device 140 may be a speaker.
Reference is made to
Step S210: performing Fourier transform (such as short-time Fourier transform (STFT)) on the to-be-processed signal SN to generate a spectral signal MG of the to-be-processed signal SN.
Step S220: performing a first noise reduction processing on the spectral signal MG to generate a first intermediate signal MM.
Step S230: performing noise analysis based on the spectral signal MG and/or the first intermediate signal MM to obtain a noise feature.
Step S240: determining whether the noise feature satisfies a preset condition. If YES, the flow proceeds to step S250; if NO, the flow proceeds to step S260 and step S270.
Step S250: performing inverse Fourier transform (such as inverse short-time Fourier transform (ISTFT)) on the first intermediate signal MM to generate the target signal SE.
Step S260: performing a second noise reduction processing on the first intermediate signal MM to generate a second intermediate signal SR.
Step S270: performing inverse Fourier transform on the second intermediate signal SR to generate the target signal SE.
The implementation details of
Reference is made to
The Fourier transform module 310 corresponds to step S210 in
The deep learning-based speech enhancement module 320 corresponds to step S220 in
The feature extraction module 322 is used to extract the speech feature FT of the spectral signal MG. The speech feature FT may be the amplitude spectrum of the spectral signal MG. In some embodiments, the deep learning model 324 includes a one-dimensional convolutional layer, a recurrent neural network layer, a linear layer, and an activation layer. The deep learning model 324 calculates a mask MK according to the speech feature FT. The multiplication circuit 326 suppresses a specific frequency spectrum by multiplying the spectral signal MG with the mask MK. In some embodiments, the mask MK includes multiple “1”s and “0”s; the spectrum corresponding to “1” is preserved while the spectrum corresponding to “0” is suppressed.
With respect to training the deep learning model 324, people having ordinary skill in the art know how to provide various input signals and corresponding output signals to the deep learning-based speech enhancement module 320, so the training details are omitted for brevity.
The determination module 330 corresponds to step S230 and step S240 in
The signal processing-based speech enhancement module 340 corresponds to step S260 in
The speech activity detection module 342 is used to perform speech activity detection on the first intermediate signal MM to generate a detection result DR. In some specific embodiments, the detection result DR includes the probability of presence of the speech corresponding to each frequency point. The noise estimation module 344 estimates the amplitude spectrum SS of the residual noise of the first intermediate signal MM according to the detection result DR. The suppression gain calculation module 346 calculates the suppression gain GS according to the first intermediate signal MM and the amplitude spectrum SS. The multiplication circuit 348 multiplies the first intermediate signal MM by the suppression gain GS to generate the second intermediate signal SR.
In some embodiments, the noise estimation module 344 estimates the amplitude spectrum SS of the residual noise of the first intermediate signal MM based on the following equations. In the following equations, Y represents the first intermediate signal MM, {tilde over (λ)}d represents the amplitude spectrum SS of the residual noise, Sf is the amplitude spectrum after frequency domain smoothing, b(i) is the frequency domain smoothing factor, w is the frequency domain smoothing window length, S is the amplitude spectrum after time domain smoothing, as is the time domain smoothing factor, k is the frequency point, and I is the speech frame.
Firstly, the corresponding smooth amplitude spectrum S is calculated for the first intermediate signal MM (i.e., the spectrum Y after deep-learning speech enhancement) based on equations (1)-(2).
Next, local minimum tracking is calculated based on equations (3) to (5), where Smin is the global minimum and Stmp is the local minimum. Equation (3) is for initialization, equation (4) is for tracking the local minimum and global minimum, and equation (5) is for updating the tracking results.
Then, the signal-to-noise ratio (SNR) and speech presence judgment are calculated based on equations (6)-(7), where I is the speech presence judgment result, and “1” and “0” indicate speech presence and absence, respectively.
Then, the speech presence probability is updated based on equation (8).
Then, the smoothing factor is calculated based on equation (9).
Finally, the amplitude spectrum of the noise is updated based on equation (10).
In some embodiments, the suppression gain calculation module 346 calculates the suppression gain Âk based on equation (11).
Reference is made to
Step S230 includes sub-step S410: calculating the SNR of the to-be-processed signal SN based on the spectral signal MG and the first intermediate signal MM. The SNR is the aforementioned noise feature. More specifically, the processing circuit 112 calculates the SNR according to equation (12).
In some embodiments, the SNR may also be replaced by a scale invariant source-to-artifact ratio (SI-SAR) or a scale invariant signal-to-distortion ratio (SI-SDR).
Step S240 includes sub-step S420: determining whether the SNR is greater than a threshold. The threshold can be determined by the user based on experience and/or the current application environment. If YES (meaning that the quality of the first intermediate signal MM is good enough), the flow proceeds to step S250; if NO, the flow proceeds to step S260 to perform the second noise reduction processing.
Reference is made to
Step S230 includes sub-step S510: calculating the steady noise based on the first intermediate signal MM. The steady noise is the aforementioned noise feature. The steady noise refers to steady sounds in the background (e.g., constant noise such as the sound of wind, air conditioners running, etc.). The steady noise of the first intermediate signal MM can be calculated by performing spectrum analysis on the first intermediate signal MM. Spectrum analysis techniques are well known to people having ordinary skill in the art, so the details are thus omitted for brevity.
Step S240 includes sub-step S520: determining whether the amplitude of the steady noise is smaller than a threshold. If YES (meaning that the steady noise of the first intermediate signal MM is small enough), the flow proceeds to step S250; if NO, the flow proceeds to step S260 to perform the second noise reduction processing.
Reference is made to
The difference between the embodiment of
Reference is made to
In the first embodiment of the determination module 730, the processing circuit 112 performs determination according to the SNR of the to-be-processed signal SN. Reference can be made to the embodiment of
Reference is made to
Step S230 includes sub-step S810: calculating the non-steady noise based on the first intermediate signal MM. The non-steady noise is the aforementioned noise feature. The non-steady noise refers to sudden sounds in the background (e.g., instantaneous sounds, such as the sound of door closing, object falling to the ground, etc.). The non-steady noise of the first intermediate signal MM can be calculated by performing spectrum analysis on the first intermediate signal MM.
Step S240 includes sub-step S820: determining whether the amplitude of the non-steady noise is smaller than a threshold. If YES (meaning that the non-steady noise of the first intermediate signal MM is small enough), the flow proceeds to step S250; if NO, the flow proceeds to step S260 to perform the second noise reduction processing.
Reference is made to
The difference between the embodiment of
In the embodiment of
In the embodiment of
As far as the training of the deep learning model 324 is concerned, the embodiment of
The embodiment of
The aforementioned descriptions represent merely the preferred embodiments of the present invention, without any intention to limit the scope of the present invention thereto. Various equivalent changes, alterations, or modifications based on the claims of the present invention are all consequently viewed as being embraced by the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
202310067077.X | Jan 2023 | CN | national |