An embodiment of the invention relates generally to a system and method of speech enhancement using a deep neural network-based combined signal.
Currently, a number of consumer electronic devices are adapted to receive speech from a near-end talker (or environment) via microphone ports, transmit this signal to a far-end device, and concurrently output audio signals, including a far-end talker, that are received from a far-end device. While the typical example is a portable telecommunications device (mobile telephone), with the advent of Voice over IP (VoIP), desktop computers, laptop computers and tablet computers may also be used to perform voice communications.
When using these electronic devices, the user also has the option of using the speakerphone mode, at-ear handset mode, or a headset to receive his speech. However, a common complaint with any of these modes of operation is that the speech captured by the microphone port or the headset includes environmental noise, such as wind noise, secondary speakers in the background, or other background noises. This environmental noise often renders the user's speech unintelligible and thus, degrades the quality of the voice communication.
The embodiments of the invention are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” embodiment of the invention in this disclosure are not necessarily to the same embodiment, and they mean at least one. In the drawings:
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown to avoid obscuring the understanding of this description.
In the description, certain terminology is used to describe features of the invention. For example, in certain situations, the terms “component,” “unit,” “module,” and “logic” are representative of hardware and/or software configured to perform one or more functions. For instance, examples of “hardware” include, but are not limited or restricted to an integrated circuit such as a processor (e.g., a digital signal processor, microprocessor, application specific integrated circuit, a micro-controller, etc.). Of course, the hardware may be alternatively implemented as a finite state machine or even combinatorial logic. An example of “software” includes executable code in the form of an application, an applet, a routine or even a series of instructions. The software may be stored in any type of machine-readable medium.
While not shown, the electronic device 10 may also be used with a headset that includes a pair of earbuds and a headset wire. The user may place one or both the earbuds into his ears and the microphones in the headset may receive his speech. The headset 100 in
The microphone 120 may be an air interface sound pickup device that converts sound into an electrical signal. As the near-end user is using the electronic device 10 to transmit his speech, ambient noise may also be present. Thus, the microphone 120 captures the near-end user's speech as well as the ambient noise around the electronic device 10. Thus, the microphone 120 may receive at least one of: a near-end talker signal or ambient near-end noise signal. The microphone generates and transmits an acoustic signal.
The accelerometer 130 may be a sensing device that measures proper acceleration in three directions, X, Y, and Z or in only one or two directions. When the user is generating voiced speech, the vibrations of the user's vocal chords are filtered by the vocal tract and cause vibrations in the bones of the user's head which are detected by the accelerometer 130. In other embodiments, an inertial sensor, a force sensor or a position, orientation and movement sensor may be used in lieu of the accelerometer 130. The accelerometer 130 generates accelerometer audio signals (e.g., accelerometer signals), which may be band-limited microphone-like audio signal. For instance, in one embodiment, while the acoustic microphone 120 captures the full-band, the accelerometer 130 may be sensitive to (and capture) frequencies between 20 Hz-800 Hz. Similar to the microphone 120, the accelerometer 130 may also capture the near-end user's speech and the ambient noise around the electronic device 10. Thus, the accelerometer 130 receives at least one of: the near-end talker signal or the ambient near-end noise signal. The accelerometer generates and transmits an accelerometer signal.
In one embodiment, the accelerometer signals being generated by the accelerometer 130 may provide a strong output signal during the near-end user's speech while not providing a strong output signal during ambient background noise. Accordingly, the accelerometer 130 provides additional information to the information provided by the microphone 120. However, the accelerometer signal may fail to capture room impulse response and the accelerometer 130 may also produces many artifacts, especially in wind and handling noise.
While not shown, in one embodiment, a beamformer may also be included in system 200 to receive the acoustic signals from a plurality of microphones 120 and create beams which can be steered to a given direction by emphasizing and deemphasizing selected microphones 120. Similarly, the beams can also exhibit or provide nulls in other given directions. Accordingly, the beamforming process, also referred to as spatial filtering, may be a signal processing technique using the acoustic signals from the microphones 120 for directional sound reception.
When the power of the environmental noise is above a given threshold or when wind noise is detected in the microphone 120, the acoustic signals captured by the microphone 120 may not be adequate. Accordingly, in one embodiment of the invention, rather than only using the acoustic signal from the microphone 120, the system 200 includes a neural network 140 that receives both the acoustic signal from the microphone 120 and the accelerometer signal from the accelerometer 130 to generate a neural network-based combined signal. This neural network-based combined signal is a speech reference signal.
Current spectral blenders introduce artifacts due to stitching and combining the accelerometer signal and the acoustic signal. Accordingly, rather than perform spectral mixing of the accelerometer's 130 output signals and the acoustic signals received from microphone 120, the neural network 140 is trained offline, using a training accelerometer signal from the accelerometer 130 and a training acoustic signal from the microphone 120 which are correlated and generated during clean speech segments, to provide spatial localization of features, weight sharing and subsampling of hidden units.
The training accelerometer signals and training acoustic signals that are correlated during clean speech segments are used to train the neural network 140. In one embodiment, training signals include (i) 12 accelerometer energy bins and 64 bins of noisy input signals and (ii) 64 bins of clean microphone (acoustic) signals. The neural network 140 trains on these two time frequency distributions, i.e., speech distributions and correlated accelerometer distributions. In one embodiment, a plurality of training accelerometer signals and a plurality of training acoustic signals used to train the neural network 140 offline.
In one embodiment, offline training of the neural network 140 may include exciting the accelerometer 130 and the microphone 120 using a training accelerometer signal and a training acoustic signal, respectively. The neural network 140 may select speech included in the training accelerometer signal and in the training acoustic signal and spatially localize the speech by setting a weight parameter in the neural network 140 based on the selected speech included in the training accelerometer signal and in the training acoustic signal.
Once the neural network 140 is trained offline, the neural network 140 may be used to generate the speech reference signal. The neural network 140 is, for example, a multilayer perception (MLP) neural network or a convolution deep neural network (CDNN). The neural network 140 may also be a convolutional auto-encoder.
A typical deep neural network mapping function can be described by a equation of the following form:
X[n,k]i+1=ƒ(X[n,k]iWi+bi) (1)
ƒ is a network of nonlinear sigmoid, tan h, relu functions, with multiple layers of connections (i-layer subscripts). W is the weight matrix for each layer. X[r,k] is the input to the network, i.e., X[r,k]o=X[r,k].
In the CDNN embodiment, input layer to the neural network 140 is a 2D map, which include spectrograms of the accelerometer signal and the microphone signals, where time on x-axis, and frequency on y-axis. Feature maps are generated by convolving a section of the input layer with a kernel (K) using:
S[i,j]=(K*I)(i,j)=ΣmΣnI[i−m,j−n]K[m,n] (2)
S[i,j] is the output of this layer for one kernel (K).
The advantages of using a CDNN includes (i) the sparse interactions needed in CDNN, (ii) being able to use the same parameters for more than one function in the network (i.e., parameter sharing) and (iii) due to the special connections mapping each layer to similar region of the spectral map, geometric properties of the spectrum is maintained tightly though the network (i.e., equvariant representations).
In one embodiment, the neural network 140 is mapping two spectral plots: accelerometer and microphone to clean output signals. The transformation can be viewed as a convolutional auto-encoding. Nonlinear Principal component analysis (PCA)-like parameters consist of the center of the neural network 140.
In one embodiment, the neural network 140 is a CDNN able to learn a nonlinear mapping function between the two transducers, along with the latent phonetic structures, which is similar to a bandwidth extension, needed for reconstructing the high frequency phones.
In one embodiment, the neural network 140 is a CDNN that is initialized using Restricted Boltzmann Machines (RBM) training. Thereafter, suitable amount of training data at various signal-to-noise (SNR) is used to train the CDNN. In one embodiment, the input layer of the CDNN is fed magnitude spectrums (and derivative signals) of the accelerometer signal and acoustic signal. The target signal to the CDNN during the training process may be the magnitude spectrum of the clean speech. While operating in magnitude spectrum domain can greatly reduce computational complexity of training and operating a CDNN, another embodiment of input and output signals to the CDNN can include real and imaginary parts of the complex spectrums.
Referring back to
The speech suppressor 150 receives the speech reference signal from the neural network 140 and the acoustic signal from the microphone 120 and generates a noise reference signal using spectral subtraction. The noise reference signal may be a noise spectral estimate.
A typical speech suppressor could be described with the following equation
Where In( ) is the modified Bessel function (MBF) of order n and where νk is defined as follow:
The function ζk is the a priori signal to noise ratio (SNR) and the function γk is the a posterior SNR. They are given by
where the a priori signal-to-noise ratio is computed using the clean speech estimated using the output of the DNN, i.e., X[n,k]N, N denotes the output of the final layer. Note, that in the EM type noise suppressor, if used for the speech suppression, X[n,k]N plays the role of the unwanted “noise-signal”. In the speech suppressor the noise power is computed directly from the microphone signal. The speech suppressor, as the name implies, removes speech from the microphone signal and outputs a signal dominated with background noise.
The outputs of the speech suppressor is feed into a multichannel Noise suppressor described with the following equation:
Where In( ) is the modified Bessel function (MBF) of order n and where νk is defined as follow:
The function ζk is the a priori signal to noise ratio (SNR) and the function νk is the a posterior SNR. They are given by:
In this noise suppression stage, the a priori SNR is computed using the clean speech signal as estimated by the DNN, i.e., X[n,k]N, and the noise estimated as outputted by the speech suppressor.
The noise suppressor 160 receives the acoustic signal from the microphone 120, the noise reference signal from the speech suppressor 150, and the speech reference signal from the neural network 140 and generates an enhanced speech signal. In one embodiment, the noise reference signal is fed into an Ephraim and Malah suppression rule based on a noise suppressor, which is optimal in the minimum mean-square sense error and colorless residual error. In some embodiments, the noise suppressor 160 is a multi-channel noise suppressor. In this embodiment, since the noise removal is carried out with a multi-channel noise suppressor, artifacts of spectral blending are never introduced.
The SNR detector 170 receives the enhanced speech signal from the noise suppressor 160, the noise reference signal from the speech suppressor 150 and the acoustic signal from the microphone 120 to generate an SNR information signal.
The neural network training unit 180 receives the SNR information signal from the SNR detector 170, generates an update signal based on the SNR information signal, and transmits the update signal to the neural network 140 to cause updates to the weight parameter in the neural network 140. In one embodiment, the neural network training unit 180 causes in-the-field weight updates to the neural network.
In
Given that the systems 200, 300, 400, in
Moreover, accelerometer 130 related artifacts are also suppressed due to nonlinear mapping of accelerometer signals into noise spectrum and further, when the noise suppressor 160 is a multi-channel noise suppressor. The accelerometer-microphone misadjustments in gain and impulse response are also removed, since the accelerometer 130 is being used as a more robust speech detector rather than as a better speech source, and the main signal path is the acoustic signal from the microphone 120. The decision to combine the accelerometer signal as a speech reference or in turn noise reference is trained into the neural network 140 (e.g., CDNN), which further requires minimal manual adjustments (user/developer level tunings).
The following embodiments of the invention may be described as a process, which is usually depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed. A process may correspond to a method, a procedure, etc.
The method 500 starts at Block 501 by training a neural network offline. In one embodiment, training the neural network offline includes: (i) exciting at least one accelerometer and at least one microphone using a training accelerometer signal and a training acoustic signal, respectively. The training accelerometer signal and the training acoustic signal are correlated during clean speech segments. Training the neural network offline also includes (ii) selecting speech included in the training accelerometer signal and in the training acoustic signal, and (iii) spatially localizing the speech by setting a weight parameter in the neural network based on the selected speech included in the training accelerometer signal and in the training acoustic signal. At Block 502, the neural network that has been trained offline generates a speech reference signal based on an accelerometer signal from the at least one accelerometer and an acoustic signal received from the at least one microphone. In one embodiment, the neural network generates the speech reference signal based on the weight parameter set in the neural network. The neural network provides spatial localization of features, weight sharing and subsampling of hidden units. In one embodiment, the speech reference signal includes at least one of: speech presence probabilities, artificial speech or artificial speech magnitude.
At Block 503, a speech suppressor generates a noise reference signal using spectral subtraction of the speech reference signal from the acoustic signal. At Block 504, a noise suppressor generates an enhanced speech signal using the acoustic signal, the noise reference signal, and the speech reference signal.
In one embodiment, the neural network may be updated in-the-field. In this embodiment, an SNR detector generates an SNR information signal using the enhanced speech signal, the noise reference signal, and the acoustic signal, a neural network training unit generates an update signal based on the SNR information signal, and transmits the update signal to the neural network. The neural network may update the weight parameter based on the update signal. In one embodiment, the neural network training unit causes in-the-field weight updates to the neural network.
Keeping the above points in mind,
An embodiment of the invention may be a machine-readable medium having stored thereon instructions which program a processor to perform some or all of the operations described above. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), such as Compact Disc Read-Only Memory (CD-ROMs), Read-Only Memory (ROMs), Random Access Memory (RAM), and Erasable Programmable Read-Only Memory (EPROM). In other embodiments, some of these operations might be performed by specific hardware components that contain hardwired logic. Those operations might alternatively be performed by any combination of programmable computer components and fixed hardware circuit components.
While the invention has been described in terms of several embodiments, those of ordinary skill in the art will recognize that the invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting. There are numerous other variations to different aspects of the invention described above, which in the interest of conciseness have not been provided in detail. Accordingly, other embodiments are within the scope of the claims.
Number | Name | Date | Kind |
---|---|---|---|
5737485 | Flanagan | Apr 1998 | A |
7983907 | Visser et al. | Jul 2011 | B2 |
20140337021 | Kim | Nov 2014 | A1 |
20150006164 | Lu et al. | Jan 2015 | A1 |
20150086038 | Stein et al. | Mar 2015 | A1 |
20150339570 | Scheffler | Nov 2015 | A1 |
Number | Date | Country | |
---|---|---|---|
20180033449 A1 | Feb 2018 | US |