The present disclosure relates generally to speech enhancement and, more particularly, to reduction of reverberation in multiple signals (e.g., multichannel system) originating from a noisy, reverberant environment.
When speaking into an audio device—such as a smartphone, tablet, or laptop—from even a short distance (as opposed to speaking directly into the microphone), reflections of the speech signal can traverse various paths to the microphone of the device. These reflections of the signal (e.g., reverberations) can make the speech unintelligible. The effects of reverberation are often more noticeable in relatively empty or clear environments that lack objects, such as furniture and people, to absorb the sound reflections. The quality of VoIP (voice over internet phone) calls and the performance of many microphone array processing techniques, such as sound source localization, beam forming, and automatic speech recognition (ASR) used, e.g., for spoken commands and voicemail transcription, are generally degraded in reverberant environments.
A number of existing reverberation reduction methods suffer from a lack of processing speed (e.g., due to computational complexity of the methods) and an excess of memory consumption that make them impractical for real-time (e.g., “on-line”) use for applications such as speech command recognition, voicemail transcription, and VoIP communication. For applications involving processing of signals from microphone arrays—such as sound source localization, reducing noise and interference in Multiple Input Multiple Output (MIMO) applications, beam forming, and automatic speech recognition—the performance of many microphone array processing techniques increases with the number of microphones used, yet existing de-reverberation methods typically do not produce the same number of de-reverberated signals as there are microphones in the array, limiting their applicability. Thus, there is a continued need in the art for faster, more memory-efficient, MIMO, and more computationally efficient de-reverberation solutions for audio signal processing.
Systems and methods for Multiple Input Multiple Output (MIMO) audio signal processing are described herein. In various embodiments, systems and methods of adaptive de-reverberation are disclosed that use a least mean squares (LMS) filter that has improved convergence over conventional LMS filters, making embodiments practical for reducing the effects of reverberation for use in many portable audio devices, such as smartphones, tablets, and televisions, for applications like speech (e.g., command) recognition, voicemail transcription, and communication in general.
In one embodiment, a frequency-dependent adaptive step size is employed to speed up the convergence of the LMS filter process, such that the process arrives at its solution in fewer computational steps compared to a conventional LMS filter. In one embodiment, the improved convergence is achieved while retaining the computational efficiency, in terms of low memory consumption cost, that is characteristic of LMS filter methods compared to some other adaptive filtering methods. In one embodiment, a process of controlling the updates of the prediction filter of the LMS method using the voice activity detection in a high non-stationary condition of the acoustic channel improves the performance of the de-reverberation method under such conditions.
In one or more embodiments, systems and methods provide processing of multichannel audio signals from a plurality of microphones, each microphone corresponding to one of a plurality of channels, to produce de-reverberated enhanced output signals with the same number of de-reverberated signals as microphones.
One or more embodiments disclose a method including a subband analysis to transform the multichannel audio signals on each channel from time domain to under-sampled K-subband frequency domain signals, wherein K is the number of frequency bins, each frequency bin corresponding to one of K subbands, buffering, with a delay, to store for each channel a number Lk of frames for each frequency bin, estimating online (e.g., in an online manner, in other words in real time) a prediction filter at each frame using an adaptive method for online (real-time) convergence, performing a linear filtering on the K-subband frequency domain signals using the estimated prediction filter, and applying a subband synthesis to reconstruct the K-subband frequency domain signals to time-domain signals on the plurality of channels.
The method may further include estimating a variance σ(l,k) of the frequency-domain signals for each frame and frequency bin, and following the linear filtering, applying a nonlinear filtering using the estimated variance to reduce residual reverberation and noise after the linear filtering. Estimating the variance may comprise estimating a variance of reflections, a reverberation component variance, and a noise variance.
In various embodiments, the method may further include estimating the variance of reflections using a previously estimated prediction filter, estimating the reverberation component variance using a fixed exponentially decaying weighting function with a tuning parameter to optimize the prediction filter by application, and estimating the noise variance using single-microphone noise variance estimation for each channel. The method may further include performing linear filtering under control of a tuning parameter to adjust an amount of de-reverberation. In one embodiment, the adaptive method comprises using a least mean squares (LMS) process to estimate the prediction filter at each frame independently for each frequency bin, and using an adaptive step-size estimator that improves a convergence rate of the LMS process compared to using a fixed step-size estimator. The method may further comprise using voice activity detection to control the update of the prediction filter under noisy conditions.
In various embodiments, an audio signal processing system comprises a hardware system processor and a non-transitory system memory including a subband analysis module operable to transform a multichannel audio signal from a plurality of microphones, each microphone corresponding to one of a plurality of channels, from time domain to frequency domain as subband frames having a number K of frequency bins, each frequency bin corresponding to one of K subbands of a plurality of under-sampled K-subband frequency domain signals, a buffer, having a delay operable to store for each channel a number of subband frames for each frequency bin, a prediction filter operable to estimate in online manner a prediction filter at each subband frame using an adaptive method, a linear filter operable to apply the estimated prediction filter to a current subband frame, and a subband synthesizer operable to reconstruct the K-subband frequency domain signals from the current subband frame into a number of time-domain de-reverberated enhanced output signals on the plurality of channels, wherein the number of time-domain de-reverberated signals is the same as the number of microphones.
In various embodiments, the system may further include a variance estimator operable to estimate a variance of the K-subband frequency-domain signals for each frame and frequency bin, and a nonlinear filter operable to apply a nonlinear filter based on the estimated variance following the linear filtering of the current subband frame. The variance estimator may be further operable to estimate a variance of early reflections, a reverberation component variance, and a noise variance.
In various embodiments, the prediction filter is further operable to use a least mean squares (LMS) process to estimate the prediction filter at each frame independently for each frequency bin. The system may also include an adaptive step-size estimator that improves a convergence rate of LMS compared to using a fixed step-size estimator. The system may also include a voice activity detector to control the update of the prediction filter.
In one embodiment, the linear filter is operable to operate under control of a tuning parameter that adjusts an amount of de-reverberation applied by the estimated prediction filter to the current subband frame. In one embodiment, estimating the variance of early reflections comprises using a previously estimated prediction filter, estimating the reverberation component variance comprises using a fixed exponentially decaying weighting function with a tuning parameter, and estimating the noise variance comprises using single-microphone noise variance estimation for each channel.
In various embodiments, a system includes a non-transitory memory storing one or more subband frames and one or more hardware processors in communication with the memory and operable to execute instructions to cause the system to perform operations. The system may be operable to perform operations comprising estimating a prediction filter online at each subband frame using an adaptive method of least mean squares (LMS) estimation, performing a linear filtering on the subband frames using the estimated prediction filter, and applying a subband synthesis to reconstruct the subband frames into time-domain signals on a plurality of channels.
In various embodiments, the system is further operable to use an adaptive step-size estimator based on values of a gradient of a cost function or an adaptive step-size estimator that varies inversely to an average of values of a gradient of a cost function.
The scope of the invention is defined by the claims, which are incorporated into this section by reference. A more complete understanding of embodiments of the invention will be afforded to those skilled in the art, as well as a realization of additional advantages thereof, by a consideration of the following detailed description of one or more embodiments. Reference will be made to the appended sheets of drawings that will first be described briefly.
Embodiments of the present disclosure and their advantages are best understood by referring to the detailed description that follows. It should be appreciated that like reference numerals are used to identify like elements illustrated in one or more of the figures.
Embodiments of adaptive de-reverberation systems and method are disclosed. In various embodiments, an adaptive de-reverberation system uses a least mean squares (LMS) filter that achieves improved convergence over conventional LMS filters, making the embodiments practical for reducing the effects of reverberation for use in many portable audio devices, such as smartphones, tablets, and televisions, for applications like speech (e.g., command) recognition, voicemail transcription, and communication in general. In one embodiment, an frequency-dependent adaptive step size is employed to speed up the convergence of the LMS filter process, meaning that the process arrives at its solution in fewer computational steps compared to a conventional LMS filter. In another embodiment, an inventive process of controlling the updates of the prediction filter of the LMS method in a high non-stationary condition of the acoustic channel improves the performance of the de-reverberation method under such conditions.
In various embodiments, the improved convergence is achieved while retaining the computational efficiency, in terms of low memory consumption cost, that is characteristic of LMS filter methods compared to some other filter methods. For example, LMS methods can have a much lower cost in terms of memory consumption, because they do not require a correlation matrix as used with other methods such as recursive least squares (RLS) filter and Kalman filter methods. But LMS methods generally have a convergence rate less than other advanced methods like Kalman filtering and RLS filtering. Embodiments thus provide an LMS filter with improved speed of convergence that is closer to that of comparable Kalman filtering and RLS filtering but with memory consumption cost that is reduced by comparison. For example, embodiments feature a new adaptive de-reverberation using an LMS method that does not require a correlation matrix—as is the case with RLS and Kalman filter methods—and so the memory consumption is much lower.
The adaptive de-reverberation using an LMS filter according to one or more embodiments of this disclosure, by providing an LMS filter with a speed of convergence that is closer to that of comparable Kalman filtering and RLS filtering but with memory consumption cost that is reduced by comparison, improves the technology of audio signal processing used by many types of devices including smartphones, tablets, televisions, personal computers, and embedded devices such as car computers and audio codecs used in phones and other communication devices.
One application of de-reverberation is for speech enhancement in a noisy, reverberant environment. Such speech enhancement can be difficult to achieve because of various intrinsic properties of the speech signals, the noise signals, and the acoustic channel. For example, (i) speech signals are colored (e.g., the signal power varies depending on frequency) and non-stationary (e.g., statistical properties, such as average volume of the speech signal, change over time), (ii) noise signals (e.g., the environmental noise) can change dramatically over time, and (iii) the impulse response of an acoustic channel (e.g., room acoustics) is usually very long (e.g., enhancing the effect of reverberation) and has non-minimum phase (e.g., there is no direct inversion for the impulse response).
Conventional techniques for de-reverberation processing are typically application-specific in a way that limits or precludes their real-time or on-line use for audio devices and audio processing found in, for example, VoIP, hearing aids, smartphones, tablets, televisions, laptops, videoconferencing, and other embedded devices (processors) used in products such as appliances and automobiles. For example, the respective computational complexity for each technique may cause it to be impractical for real-time, on-line processing.
A number of other examples of limitations of the prior art techniques for de-reverberation processing are as follows. The memory consumption of many of the techniques is high and not suitable for embedded devices which require memory efficient techniques due to constraints on memory in such devices. In a real-world environment, the reverberant speech signals are usually contaminated with non-stationary additive background noise (e.g., non-constant or disruptive noise) that can greatly deteriorate the performance of de-reverberation techniques that do not explicitly consider the non-stationary noise in their model. Many of the prior art de-reverberation methods are batch approaches (e.g., imposing or incurring a delay or latency between input and output) that require a considerable amount of input data to provide good performance results. In most applications such as VoIP and hearing aids, however, there should not be any latency. Many of the prior art de-reverberation techniques do not produce the same number of de-reverberated signals as microphones, contrary to the requirements of many microphone array processing techniques for which the performance increases with the number of microphones. Many of the prior art de-reverberation techniques do not conserve the time differences of arrival (TDOAs) at (multiple) microphone positions, contrary to the requirements of many source localization techniques that are explicitly or implicitly based on time differences of arrival at the microphone positions. Many of the prior art de-reverberation techniques require knowledge (e.g., an input or configuration) of the number of sound sources, required because it is often difficult to estimate the correct number of sources with blind processing.
Embodiments as described herein provide qualities and features that address the above limitations, making them useful for a great variety of different applications. For example, processes that implement the embodiments can be designed to be memory efficient and speed efficient requiring, for example, less memory and lower processing speeds to order to be able to run with no latency (e.g., perform in real-time), which makes the embodiments desirable for applications like VoIP.
De-reverberation according to one or more embodiments of the present disclosure is robust to non-stationary noise, performs well in high reverb conditions with high reverberation time, can be both single-channel and multi-channel, and can be adapted for the case of more than one single-source. In one embodiment, by skipping the nonlinear filtering part of the method (which is used to further reduce noise and residual reverberation after the linear filtering), the processing can be converted into linear processing, which may be essential for some applications requiring linearity. In one embodiment, an adaptive filter for de-reverberation takes additive background noise into account, adaptively estimating the power spectral density (PSD) of the noise to adaptively estimate the prediction filter to provide real-time performance for on-line use.
The Multiple Input Multiple Output (MIMO) feature of one or more embodiments provides several capabilities, including ready integration into other modules for performing noise reduction or source location. In one embodiment, a blind method—e.g., one that processes a set of source signals from a set of mixed signals, without aid of information about the source signals or their mixing process—uses multi-channel input signals for shortening a room impulse response (RIR) between a set of sources of unknown number. The method uses subband-domain multi-channel linear prediction filters, and estimates the filter for each frequency band independently. One notable capability of the method is that it can conserve time differences of arrival (TDOA) at microphone positions as well as the linear relationship between sources and microphones. Such capability may be required for subsequent processing for localization and reducing noise and interference. In addition, the method can yield as many de-reverberated signals as microphones by estimating the prediction filter for each microphone separately.
As seen in
In real-world environments, a recorded speech signal is noisy and this noise can degrade the speech intelligibility for VoIP application, and it can decrease the speech recognition performance of devices such as phones and laptops. When microphone arrays (e.g., microphone array 101) are employed instead of a single microphone, it is easier to solve the problem of interference noise using beam forming methods that can exploit the spatial diversity to better detect or extract desired source signals and to suppress the unwanted interference. Beam forming methods represent a class of multichannel signal processing methods that perform a spatial filtering which points a beam of increased sensitivity to desired source locations while suppressing signals originating from all other locations. For these beam forming methods, the noise suppression is only sufficient in case the signal source is close to the microphones (near-field scenario). However, the problem can be more severe when the distance between source and microphones is greater, as shown in
In the example shown in
System 100 may include a subband analysis (subband decomposition) module 110 connected to a number of input audio signal sources, such as microphones, e.g., microphone array 101, or other transducer or signal processor devices, each source corresponding to a channel, to receive time domain audio signals 102 for each channel. Subband analysis module 110 may transform the time-domain audio signals 102 into subband frames 112 in the frequency domain. Subband frames 112 may be provided to buffer 120 with delay that stores the last Lk subband frames 112 for each channel, where Lk is further described below.
Buffer 120 may provide the frequency domain subband frames 112 to variance estimator 130. Variance estimator 130 may estimate the variance of the current subband frame 112 as each subband frame 112 becomes current. The variance of a subband frame 112 may be used for prediction filter estimation and nonlinear filtering. The estimated variances 132 may be provided from the variance estimator 130 to prediction filter estimator 140.
Buffer 120 also may provide the frequency domain subband frames 112 to prediction filter estimator 140. Prediction filter estimator 140 may receive the variance 132 of the current subband frame 112 from variance estimator 130. Prediction filter estimator 140 may implement a fast-converging, adaptive online (e.g., real-time) prediction filter estimation. A voice activity detector (VAD) 145 may be used to provide control in noisy environments over the prediction filter estimator 140 based on input to VAD 145 of subband frames 112 and providing an output 136 to filter prediction filter estimator 140. Linear filter 150 may apply the prediction filter estimation from prediction filter estimator 140 to subband frames 112 to reduce most of the reverberation from the source signal. Nonlinear filter 160 may be applied to the output of linear filter 150, as shown, to reduce the residual reverberation and noise. Synthesizer 170 may be applied to the output of nonlinear filter 160, transforming the enhanced subband frequency domain signals to time domain signals.
As shown in
where Zi(l,k) is the early reflection (or direct path or clean speech signal, see
Returning to
where σc (l,k), σr (l,k) and σν(l,k) are the variances, respectively, for early reflections (also referred to as “clean speech”), reverberation component, and noise. The equation σi=σ(l,k) is assumed to be identical for each of the i channels, hence the subscript i is suppressed. As seen in equations (2), it is assumed that the early reflections and the noise have zero mean. The variance of early reflections σc(l,k) may be approximated by zeros, using:
As further disclosed in the co-pending application, the reverberation component variance σr (l,k) is estimated using fixed weights. The noise variance σν(l,k) may be estimated using an efficient real-time single-channel method and the noise variance estimations may be averaged over all the channels to obtain a single value for noise variance σν(l,k).
Referring again to
The recursive least squares (RLS) method has been used to estimate the optimum prediction filter in an online manner (e.g., in real-time for online application) adaptively. Despite its efficiency and fast convergence, the RLS method requires correlation matrix to be used and for the case of multi-channel with long prediction filters which is important to capture long correlation, it cannot be deployed into the embedded devices with memory restriction. Also, the RLS method can converge fast and deep so that when the RIR is changed due to speaker or source movement, it requires longer time to converge to new filters. So, the RLS-based solution is not practical for many applications which have memory limitation and it has changing environments.
According to one embodiment, a novel method based on Least Mean Square estimation (LMS) is used. In general, the LMS based method does not have as fast a convergence rate as RLS, and so the LMS method cannot be used in time-varying environments. The novel method according to one embodiment is used to calculate an adaptive step-size for the LMS solution to make it as fast as RLS, but the LMS solution requires far less memory and can also react faster to sudden changes.
Using the adaptive LMS-based solution, the mean in equations (4) can be rewritten in vector form as:
Where gi(k) is the prediction filter for frequency band k and the i-th channel and (⋅)* denotes complex conjugate.
As disclosed in the co-pending application, the cost function can be simplified as:
In order to estimate gi(l)(k) in an online manner for the l-th frame, it should be initialized by zero values for all the frequencies and channels, and the gradient ∇(L(Xi(l,k))) of the cost function given in equations (6), which is a vector of Lk*M numbers, should be computed. The update rule using the LMS method can be written as follows.
gi(l)(k)=gi(l)(k)−η∇(L(Xi(l,k))) (7),
where η is a fixed step-size and gi(l)(k) denotes prediction filter at l-th frame. Now the gradient ∇(L(Xi(l,k))) of the cost function in equations (6) may be computed.
Although η is referred to here as a fixed step-size for purposes of illustrating the example, the step-size η need not be fixed and can be adaptively determined, based on values of the gradient, for example, in order to improve the performance of the LMS methods.
At act 403, the gradient of the prediction filter is computed and it is initialized by zero. Equation (7) with an adaptive step-size (l,k) can be rewritten as:
gi(l)(k)=gi(l)(k)−η(l,k)∇(L(Xi(l,k))) (9).
At act 404, the adaptive step-size η(l,k) by dividing a sufficiently low step-size (i.e., η0) by a running average of the magnitudes of recent gradients (the smoothed root mean square (RMS) average of magnitudes of gradients). Updating the prediction filter using the estimated gradient and the adaptive step-size proceeds at act 405. In the case of a large smoothed RMS average of gradients, the total value of the step-size will be low to avoid divergence, and likewise, when the smoothed RMS average of gradients value becomes small, then the step-size will be increased to speed up the convergence.
At act 404, to compute the smoothed RMS average of gradients, a buffer (Gi(l)(k)) of K values (corresponding to the number of frequency bands) for each channel i may store the values and may be initialized to zero. Each smoothed RMS average gradient (Gi(l)(k)) may be updated as follows.
where ρ is a smoothing factor which is close to one and (⋅)H denotes transpose conjugate.
The adaptive step-size η(l,k) can be calculated as:
where ε is a small value on the order of 1e-6 (e.g., 0.000001) to avoid division by zero, and η0 is the fixed step-size or initial step-size.
At act 405, the prediction filter is updated as given in (9) using (8), (10) and (11).
At act 406, the optimal filter weights may be passed to linear filter 150 and used to perform linear filtering of the subband frames 112, which are also passed to linear filter 150 as seen in
In the case of noisy conditions, the prediction filter (e.g., gi(l)(k)) may not only concentrate on reverberation, but it may also target the quite stationary noise as well. In that case, the prediction filter, if unmodified from the above description, will be estimated to reduce both stationary noise and the reverberation. In some applications, however, it is not desired to let the prediction filter be estimated to cancel the noise as it is mainly designed to reduce the reverberation. In addition, in very non-stationary noisy conditions the prediction filter may try to track the noise, which can change quite fast and will not allow the LMS method to converge, ultimately decreasing its de-reverberation performance.
To improve the performance of the LMS method in that case, method 500 supervises the LMS filter adaptation by using an external voice activity detection (e.g., VAD 145). For example, the VAD 145 may be configured to produce a probability value between 0 and 1 that the target speech is active in the frame l. The probability value is indicated by w(l) in the following equations. The cost function (see equations (6)) is modified as:
This modified cost function leads to the following modification for the gradient computation as:
Because the values of w(l) are less than 1.0, equations (13) show that method 500 can decrease the amount of update (see, e.g., equation (7)) in noisy frames or even skip them if the values of w(l) are very small. Thus, using the modified cost function and gradient at act 504, method 500 may compute the predictive filter to control updating the filter to compensate for noisy environments.
At act 505, the optimal filter weights may be passed to linear filter 150 and used to perform linear filtering of the subband frames 112, which are also passed to linear filter 150 as seen in
as shown at linear filter 150 in
For some applications like ASR or VoIP, performance may be enhanced by performing operations to limit the amount of reverberation reduction by a parameter. At act 604, the predictive filter may be applied at linear filter 150 based on one or more parameters determined for controlling the amount of reduction of reverberation. At act 605, linear filter 150 may perform the linear filtering under control of the one or more parameters. For example, linear filtering may be performed by linear filter 150 using one tuning parameter a to control the amount of de-reverberation using the following equations:
where α is the tuning or control parameter to control the amount of reduction of reverberation or amount of de-reverberation, β is a smoothing factor close to one, and εr is a small value (e.g., 0.000001) to avoid division by zero.
Returning again to
Following applying the nonlinear filtering 160, the enhanced speech spectrum for each band (e.g., Zi(l,k)) may be transformed from the frequency domain to time domain by applying subband synthesis to produce time domain output zi[n], (i=1 . . . M) where M is the number of microphones. For example, as described above, nonlinear filter 160 may be applied to the output of linear filter 150, as shown, to reduce the residual reverberation and noise. Synthesizer 170 may be applied to the output of nonlinear filter 160, transforming the enhanced subband frequency domain signals to time domain signals.
As discussed, the various techniques provided herein may be implemented by one or more systems which may include, in some embodiments, one or more subsystems and related components thereof. For example,
As shown, system 700 includes one or more audio inputs 710 which may include, for example, an array of spatially distributed microphones configured to receive sound from an environment of interest. Analog audio input signals provided by audio inputs 710 are converted to digital audio input signals by one or more analog-to-digital (A/D) converters 715. The digital audio input signals provided by analog-to-digital converters 715 are received by a processing system 720.
As shown, processing system 720 includes a processor 725, a memory 730, a network interface 740, a display 745, and user controls 750. Processor 725 may be implemented as one or more microprocessors, microcontrollers, application specific integrated circuits (ASIC), programmable logic devices (PLD)—e.g., field programmable gate arrays (FPGA), complex programmable logic devices (CPLD), field programmable systems on a chip (FPSC), or other types of programmable devices—codecs, or other processing devices.
In some embodiments, processor 725 may execute machine readable instructions (e.g., software, firmware, or other instructions) stored in memory 730. In this regard, processor 725 may perform any of the various operations, processes, and techniques described herein. For example, in some embodiments, the various processes and subsystems described herein (e.g., system 100, methods 400, 500, and 600) may be effectively implemented by processor 725 executing appropriate instructions. In other embodiments, processor 725 may be replaced or supplemented with dedicated hardware components to perform any desired combination of the various techniques described herein.
Memory 730 may be implemented as a machine readable medium storing various machine readable instructions and data. For example, in some embodiments, memory 730 may store an operating system 732 and one or more applications 734 as machine readable instructions that may be read and executed by processor 725 to perform the various techniques described herein. Memory 730 may also store data 736 used by operating system 732 or applications 734. In some embodiments, memory 720 may be implemented as non-volatile memory (e.g., flash memory, hard drive, solid state drive, or other non-transitory machine readable media), volatile memory, or combinations thereof.
Network interface 440 may be implemented as one or more wired network interfaces (e.g., Ethernet) or wireless interfaces (e.g., WiFi, Bluetooth, cellular, infrared, radio) for communication over appropriate networks. For example, in some embodiments, the various techniques described herein may be performed in a distributed manner with multiple processing systems 720.
Display 745 presents information to the user of system 700. In various embodiments, display 745 may be implemented, for example, as a liquid crystal display (LCD) or an organic light emitting diode (OLED) display. User controls 750 receive user input to operate system 700 (e.g., to provide user-defined parameters as discussed or to select operations performed by system 700). In various embodiments, user controls 750 may be implemented as one or more physical buttons, keyboards, levers, joysticks, mice, or other physical transducers, graphical user interface (GUI) inputs, or other controls. In some embodiments, user controls 750 may be integrated with display 745 as a touchscreen, for example.
Processing system 720 provides digital audio output signals that are converted to analog audio output signals by one or more digital-to-analog (D/A) converters 755. The analog audio output signals are provided to one or more audio output devices 760 such as one or more speakers, for example. Thus, system 700 may be used to process audio signals in accordance with the various techniques described herein to provide improved output audio signals with improved speech recognition.
Where applicable, various embodiments provided by the present disclosure may be implemented using hardware, software, or combinations of hardware and software. Also, where applicable, the various hardware components and/or software components set forth herein may be combined into composite components comprising software, hardware, and/or both without departing from the spirit of the present disclosure. Where applicable, the various hardware components and/or software components set forth herein may be separated into sub-components comprising software, hardware, or both without departing from the scope of the present disclosure. In addition, where applicable, it is contemplated that software components may be implemented as hardware components and vice-versa.
Software, in accordance with the present disclosure, such as program code and/or data, may be stored on one or more computer readable mediums. It is also contemplated that software identified herein may be implemented using one or more general purpose or specific purpose computers and/or computer systems, networked and/or otherwise. Where applicable, the ordering of various steps described herein may be changed, combined into composite steps, and/or separated into sub-steps to provide features described herein.
The foregoing disclosure is not intended to limit the present disclosure to the precise forms or particular fields of use disclosed. As such, it is contemplated that various alternate embodiments and/or modifications to the present disclosure, whether explicitly described or implied herein, are possible in light of the disclosure. Having thus described embodiments of the present disclosure, persons of ordinary skill in the art will recognize that changes may be made in form and detail without departing from the scope of the present disclosure. Thus, the present disclosure is limited only by the claims.
This application claims the benefit of and priority to U.S. Provisional Patent Application No. 62/438,848 filed Dec. 23, 2016, and entitled “MULTIPLE INPUT MULTIPLE OUTPUT (MIMO) AUDIO SIGNAL PROCESSING FOR SPEECH DE-REVERBERATION,” which is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5689572 | Ohki | Nov 1997 | A |
20030206640 | Malvar | Nov 2003 | A1 |
20060002546 | Stokes, III | Jan 2006 | A1 |
20080306739 | Nakajima | Dec 2008 | A1 |
20090214054 | Fujii et al. | Aug 2009 | A1 |
20100254555 | Elmedyb | Oct 2010 | A1 |
20110002473 | Nakatani et al. | Jan 2011 | A1 |
20110129096 | Raftery | Jun 2011 | A1 |
20120275613 | Soulodre | Nov 2012 | A1 |
20120310637 | Vitte | Dec 2012 | A1 |
20120322511 | Fox | Dec 2012 | A1 |
20140126745 | Dickins | May 2014 | A1 |
20150016622 | Togami | Jan 2015 | A1 |
20150063581 | Tani et al. | Mar 2015 | A1 |
20160322064 | Hsu | Nov 2016 | A1 |
Number | Date | Country |
---|---|---|
10-1401120 | May 2014 | KR |
Entry |
---|
Ito et al., “Probabilistic Integration of Diffuse Noise Suppression and Dereverberation,” 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), May 2014, pp. 5167-5171, Florence, Italy. |
Jukic et al., “Group Sparsity for MIMO Speech Dereverberation,” 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 18-21, 2015, 5 Pages, New Paltz, New York. |
Jukic et al., “Multi-channel Linear Prediction-Based Speech Dereverberation. With Sparse Priors,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Sep. 2015, pp. 1509-1520, vol. 23, No. 9. |
Keshavarz et al., “Speech-Model Based Accurate Blind Reverberation Time Estimation Using an LPC Filter,” IEEE Transactions on Audio, Speech, and Language Processing, Aug. 2012, pp. 1884-1893, vol. 20, No. 6. |
Mosayyebpour et al., “Single-Microphone Early and Late Reverberation Suppression in Noisy Speech,” IEEE Transactions on Audio, Speech, and Language Processing, Feb. 2013, pp. 322-335, vol. 21, No. 2. |
Mosayyebpour et al., “Single-Microphone LP Residual Skewness-Based for Inverse Filtering of the Room Impulse Response,” IEEE Transactions on Audio, Speech, and Language Processing, Jul. 2012, pp. 1617-1632, vol. 20, No. 5. |
Nakatani et al., “Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction,” IEEE Transactions on Audio, Speech, and Language Processing, Sep. 2010, pp. 1717-1731, vol. 17, No. 7. |
Schwartz et al., “Online Speech Dereverberation Using Kalman Filter and EM Algorithm,” IEEE/ACM Transaction on Audio, Speech, and Language Processing, Feb. 2015, pp. 394-406, vol. 23, No. 2. |
Togami et al., “Optimized Speech Dereverberation From Probabilistic Perspective for Time Varying Acoustic Transfer Function,” IEEE Transactions on Audio, Speech, and Language Processing, Jul. 2013, pp. 1369-1380, vol. 21, No. 7. |
Yoshioka et al., “Adaptive Dereverberation of Speech Signals with Speaker-Position Change Detection,” 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Apr. 19-24, 2009, pp. 3733-3736. |
Yoshioka, Takuya, “Dereverberation for Reverberation-Robust Microphone Arrays,” 21st European Signal Processing Conference (EUSIPCO 2013), Jan. 2013, pp. 1-5) Marrakech, Morocco. |
Yoshioka et al., “Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening,” IEEE Transactions on Audio, Speech, and Language Processing, Dec. 2012, pp. 2707-2720, vol. 20, No. 10. |
Yoshioka et al., “Integrated Speech Enhancement Method Using Noise Suppression and Dereverberation,” IEEE Transactions on Audio, Speech and Language Processing, Feb. 2009, pp. 231-246, vol. 17, No. 2. |
Number | Date | Country | |
---|---|---|---|
20180182411 A1 | Jun 2018 | US |
Number | Date | Country | |
---|---|---|---|
62438848 | Dec 2016 | US |