The present disclosure relates to systems and methods for sound signal processing, and relates more particularly to enhancement of speech signal(s) captured by at least one input device.
When a speech signal is acquired from one or more far field microphone(s) (microphones configured to capture sound from distances of, e.g., 5 feet or more from the sound source), the speech signal often gets corrupted with additive noise and convolutive effects of room reverberation. This is true for artificial-intelligence (AI)-based automatic speech recognition (ASR) system applications, e.g., for meeting transcription. In such applications, it is desirable to provide a technique for signal enhancement, thereby enabling the ASR system to operate in a robust manner.
In an example case involving signal processing within a neural beamforming structure (multiple-channels input, single-channel output), signal processing problems include de-reverberation, de-noising and spatial filtering (beamforming). Among the currently-known beamformers, a Minimum Variance Distortionless Response (MVDR) beamformer targets de-noising. An example MVDR beamformer includes a filter-and-sum beamformer with the ability to place nulls towards competing speakers or noise. Another currently-known beamformer is a neural beamformer, which uses a neural network to optimize the beamformer parameters and can also be trained to target word error rate (WER) reduction. The neural beamformer is typically composed of a neural network estimating MVDR parameters (and thus also targets de-noising). In addition, there are known de-reverberation methods such as Channel Shortening (CS) and Weighted Prediction Error (WPE), which have been deployed as part of MVDR beamformers. Furthermore, there are a class of “convolutional beamformers” which jointly optimize the WPE and MVDR constraints. However, there is no known solution which jointly optimizes de-reverberation, de-noising and spatial filtering in an integrated manner.
Similarly, in an example case involving signal processing for a single channel input, single channel output system, there are currently no known method which jointly optimizes de-reverberation and de-noising in an integrated manner.
Therefore, a need exists for providing a solution which jointly optimizes at least de-reverberation and de-noising, as well as including spatial filtering in the joint optimization in the case of multiple-channels input, single-channel output system.
According to an example embodiment of the present disclosure, a method for jointly optimizing the objectives of de-reverberation and de-noising (also referred to as noise reduction) with a neural-network-based approach is provided.
According to an example embodiment of the present disclosure, the objective of spatial filtering (also known as beamforming) is jointly optimized with the objectives of de-reverberation and de-noising (also referred to as noise reduction), using a neural-network-based approach.
According to an example embodiment of the present disclosure, a method for jointly optimizing de-reverberation, spatial filtering and de-noising for a multi-channel input, single-channel output (MISO) system is provided, which method utilizes a combination of signal quality and automatic speech recognition (ASR)-based losses for the optimization criteria.
According to an example embodiment of the present disclosure, a method for jointly optimizing de-reverberation and de-noising for a single-channel input, single-channel output (SISO) system is provided.
According to an example embodiment of the present disclosure, for the MISO system, the following are performed: i) neural delay-and-sum beamforming, ii) channel-shortening-based de-reverberation, and iii) mask-based noise reduction.
According to an example embodiment of the present disclosure, for the MISO system, the following are performed: i) CS filter estimation and noise reduction mask (NRM) estimation are performed by a CS filter estimation component using information from the spectra of all of the multiple channel inputs to configure a single CS filter and a single NRM; ii) phase shift estimation is performed (e.g., in parallel with CS filter and NRM estimation); iii) phase alignment is performed after the phase shift estimation; iv) a weight-and-sum operation is performed next; and then v) a single channel shortening (CS) filter and, optionally, a single noise-reduction mask (NRM) can be applied to the output of the weight-and-sum operation.
According to an example embodiment of the present disclosure, for the MISO system, the following are performed: i) a CS filter estimation component uses information from the spectra of all of the multiple channel inputs to configure corresponding multiple CS filters and a single NRM; ii) phase shift estimation is performed (e.g., in parallel with CS filter and NRM estimation); iii) phase alignment is performed after the phase shift estimation; iv) the output of the phase alignment is applied to the multiple CS filters; and v) a weight-and-sum operation is performed on the output of the multiple CS filters, the output of which weight-and-sum operation is a single channel signal that can be further processed by the single NRM and/or a voice activity detection (VAD) estimation component.
According to an example embodiment of the present disclosure, for the SISO system, the following are performed: i) noise reduction is performed explicitly using a time-frequency (T-F) mask; ii) de-reverberation is performed in the form of channel shortening (e.g., by applying a CS filter); and iii) voice activity estimated from the T-F mask (voice activity detection (VAD)) is used to determine the amount of speech present in a context window.
According to an example embodiment of the present disclosure, for the SISO system, noise reduction is performed explicitly to find the reverberant-only signal before performing channel shortening.
According to an example embodiment of the present disclosure, for the SISO system, after performing channel shortening to produce estimated de-reverberated and noisy speech, noise reduction is performed on the estimated de-reverberated and noisy speech.
According to an example embodiment of the present disclosure, for the SISO system, the multiplicative factors for channel shortening and noise reduction are estimated jointly as one filter, whereby noise reduction is performed implicitly in combination with the channel shortening filter.
According to an example embodiment of the present disclosure, for the SISO system, noise reduction is performed implicitly in combination with the channel shortening filter and VAD estimation.
According to an example embodiment of the present disclosure, for the SISO system, noise reduction is performed implicitly in combination with the channel shortening filter and a set of non-intrusive measures (NIM) including, e.g., reverberation time (“T60”), clarity index (“C50”), direct-to-reverberant ratio (DRR), and signal-to-noise ratio (SNR).
Continuing with
Continuing with
For the example MISO systems shown in
As shown in
Y=XR+N
Y−N=XR=YREV
YREV=YM, where M=1−N/Y
X=YREV/R
In the example embodiment of
Continuing with
As shown in
Y=XR+N
YNOISY=Y/R
X=YNOISY−N/R
X=YNOISYM, where M=1−N/(YNOISYR)
Continuing with
As shown in
As shown in
The present disclosure provides several embodiments of an architecture for jointly optimizing at least de-reverberation and noise reduction. In the case of multiple microphone input, the example embodiments provide an improvement over the known convolutional beamformers by enabling full optimization for, e.g., an ASR application. This is possible due to the neural network structure employed for the de-reverberation and noise reduction front end components, allowing for end-to-end optimization (e.g., with a WER loss component). In addition, the disclosed example embodiments for jointly optimizing de-reverberation and de-noising differ from the known approaches in that the disclosed example embodiments utilize a channel shortening system model as opposed to an MVDR/WPE system model, for example. Moreover, the disclosed example embodiments utilize a delay-and-sum structure for beamforming, instead of the MVDR or minimum power distortion-less response (MPDR) filter and sum structure for beamforming.
Similarly, in the case of a single microphone input, the example embodiments provide an improvement over the known approaches by providing a novel structure of channel shortening and mask estimation for jointly performing de-reverberation and de-noising with criteria for fully optimizing, e.g., ASR. In addition, the VAD estimation is performed jointly with the optimization process, which incorporation of the VAD estimation is important to allow the system to respond to non-speech regions (i.e., trying to perform de-reverberation in non-speech regions can result in unwanted artifacts).
The present disclosure provides a first example of a method of performing at least de-reverberation and noise-reduction of an input sound signal of at least one input channel, comprising: performing, using at least one filter element, at least one of de-reverberation and noise-reduction of the input sound signal to generate a clean output sound signal; and determining, by a non-intrusive measure (NIM) estimation element, at least one non-intrusive measure (NIM) from the sound signal, wherein the at least one NIM includes at least one of voice activity detection (VAD) posterior, reverberation time, clarity index, direct-to-reverberant ratio (DRR), and signal-to-noise ratio (SNR); wherein the de-reverberation is achieved by applying at least one channel shortening (CS) filter component of the at least one filter element.
The present disclosure provides a second example method based on the above-discussed first example method, in which second example method: the noise reduction is performed in combination with the de-reverberation by the channel shortening (CS) filter component; and the de-reverberation is achieved by applying the at least one channel shortening (CS) filter component of the at least one filter element in conjunction with the at least one NIM.
The present disclosure provides a third example method based on the above-discussed first example method, in which third example method: a VAD estimation element is used as the NIM estimation element, and the VAD posterior is used as the at least one NIM.
The present disclosure provides a fourth example method based on the above-discussed first example method, the fourth example method further comprising: estimating a time-frequency (T-F) mask based on one of the input sound signal or a sound signal derived from the input sound signal, and wherein the noise-reduction is achieved by applying the T-F mask.
The present disclosure provides a fifth example method based on the above-discussed fourth example method, in which fourth example method the at least one CS filter component and the T-F mask are optimized jointly.
The present disclosure provides a sixth example method based on the above-discussed fourth example method, in which fourth example method a noise-reduced sound signal is produced by applying the T-F mask, and wherein the at least one CS filter component is applied to the noise-reduced sound signal to achieve de-reverberation and produce a clean output signal.
The present disclosure provides a seventh example method based on the above-discussed fourth example method, in which fourth example method the at least one CS filter component is applied to the input sound signal to produce de-reverberated sound signal; and the T-F mask is applied to the de-reverberated sound signal to achieve noise-reduction and produce a clean output signal.
The present disclosure provides an eighth example method based on the above-discussed first example method, in which eighth example method multiple input channels are provided for capturing multiple input sound signals, the eighth example method further comprising: performing, by a phase alignment module, phase alignment of the multiple input sound signals to produce phase-aligned multiple sound signals.
The present disclosure provides a ninth example method based on the above-discussed eighth example method, the ninth example method further comprising: performing, by a weight-and-sum module, a weighted delay-and-sum beamforming of the phase-aligned multiple sound signals to produce a beamformed signal; wherein at least one of i) a single filter element is applied to perform at least one of de-reverberation and noise-reduction of the beamformed signal to produce the clean output sound signal, and ii) at least one voice activity detection (VAD) posterior is determined based on the clean output sound signal.
The present disclosure provides a tenth example method based on the above-discussed eighth example method, in which tenth example method multiple CS filter components and a single noise-reduction mask are provided, the tenth example method further comprising: applying the multiple CS filter components to the phase-aligned multiple sound signals to produce de-reverberated multiple sound signals; performing, by a weight-and-sum module, a weighted delay-and-sum beamforming of the de-reverberated multiple sound signals to produce a beamformed signal; and at least one of i) applying the single noise-reduction mask to the beamformed signal to produce the clean output sound signal, and ii) at least one voice activity detection (VAD) posterior is determined based at least in part on the clean output sound signal.
The present disclosure provides a first example system for performing at least de-reverberation and noise-reduction of an input sound signal of at least one input channel, comprising: at least one filter element configured to perform at least one of de-reverberation and noise-reduction of the input sound signal to generate a clean output sound signal; and a non-intrusive measure (NIM) estimation element configured to perform at least one non-intrusive measure (NIM) from the sound signal, wherein the at least one NIM includes at least one of voice activity detection (VAD) posterior, reverberation time, clarity index, direct-to-reverberant ratio (DRR), and signal-to-noise ratio (SNR); wherein the de-reverberation is achieved by applying at least one channel shortening (CS) filter component of the at least one filter element.
The present disclosure provides a second example system based on the above-discussed first example system, in which second example system: the noise reduction is performed in combination with the de-reverberation by the channel shortening (CS) filter component; and the de-reverberation is achieved by applying the at least one channel shortening (CS) filter component of the at least one filter element in conjunction with the at least one NIM.
The present disclosure provides a third example system based on the above-discussed first example system, in which third example system a VAD estimation element is used as the NIM estimation element, and the VAD posterior is used as the at least one NIM.
The present disclosure provides a fourth example system based on the above-discussed first example system, in which fourth example system a time-frequency (T-F) mask is estimated based on one of the input sound signal or a sound signal derived from the input sound signal, and the noise-reduction is achieved by applying the T-F mask.
The present disclosure provides a fifth example system based on the above-discussed fourth example system, in which fifth example system the at least one CS filter component and the T-F mask are optimized jointly.
The present disclosure provides a sixth example system based on the above-discussed fourth example system, in which sixth example system a noise-reduced sound signal is produced by applying the T-F mask, and the at least one CS filter component is applied to the noise-reduced sound signal to achieve de-reverberation and produce a clean output signal.
The present disclosure provides a seventh example system based on the above-discussed fourth example system, in which seventh example system: the at least one CS filter component is applied to the input sound signal to produce de-reverberated sound signal; and the T-F mask is applied to the de-reverberated sound signal to achieve noise-reduction and produce a clean output signal.
The present disclosure provides an eighth example system based on the above-discussed first example system, in which eighth example system multiple input channels are provided for capturing multiple input sound signals, the eighth example system further comprising: a phase alignment module configured to perform phase alignment of the multiple input sound signals to produce phase-aligned multiple sound signals.
The present disclosure provides a ninth example system based on the above-discussed eighth example system, the ninth example system further comprising: a weight-and-sum module configured to perform a weighted delay-and-sum beamforming of the phase-aligned multiple sound signals to produce a beamformed signal; wherein at least one of i) a single filter element is applied to perform at least one of de-reverberation and noise-reduction of the beamformed signal to produce the clean output sound signal, and ii) at least one voice activity detection (VAD) posterior is determined based on the clean output sound signal.
The present disclosure provides a tenth example system based on the above-discussed eighth example system, the tenth example system further comprising: a weight-and-sum module configured to perform a weighted delay-and-sum beamforming; wherein multiple CS filter components and a single noise-reduction mask are provided; the multiple CS filter components are applied to the phase-aligned multiple sound signals to produce de-reverberated multiple sound signals; the weight-and-sum module performs a weighted delay-and-sum beamforming of the de-reverberated multiple sound signals to produce a beamformed signal; and at least one of i) the single noise-reduction mask is applied to the beamformed signal to produce the clean output sound signal, and ii) at least one voice activity detection (VAD) posterior is determined based at least in part on the clean output sound signal.
Number | Name | Date | Kind |
---|---|---|---|
11304000 | Kinoshita | Apr 2022 | B2 |
11894010 | Nakatani | Feb 2024 | B2 |
20020057734 | Sandberg | May 2002 | A1 |
20030210742 | Balakrishnan | Nov 2003 | A1 |
20040042543 | Li | Mar 2004 | A1 |
20050053127 | Shiue | Mar 2005 | A1 |
20070297499 | de Victoria | Dec 2007 | A1 |
20110255586 | Li | Oct 2011 | A1 |
20190318733 | Mani | Oct 2019 | A1 |
20210074316 | Souden | Mar 2021 | A1 |
20220068288 | Nakatani | Mar 2022 | A1 |
20220231738 | Haustein | Jul 2022 | A1 |
20230154480 | Xu | May 2023 | A1 |
20230239616 | Nakatani | Jul 2023 | A1 |
Entry |
---|
Nakatani et al.; “A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation”; IEEE Signal Processing Letters, vol. 26, No. 6, Jun. 2019, pp. 903-907. |
Number | Date | Country | |
---|---|---|---|
20230267944 A1 | Aug 2023 | US |