The present application is based on and claims the benefits of priority to Chinese Application No. 201910344914.2, filed Apr. 26, 2019, the entire content of which is incorporated herein by reference.
The present disclosure relates to systems and methods for audio signal processing, and more particularly to, systems and methods for enhancing an audio signal by reconfiguring audio signals separated from the audio signal.
Speech recognition technologies have been applied to many areas recently. Compared to the earlier applications for speech recognition such as automated telephone systems and medical dictation software, recent applications of speech recognition changed the way people interact with their devices, homes, and cars.
To obtain a satisfied speech recognition result, it is essential to have a high-quality audio signal as an input of a speech recognition system. However, in real-world, an acquired audio signal is usually a mixture of signals from multiple audio sources. For example, a speech recognition system may receive a mixed audio signal including a human speech and environmental noises. The speech signal can come from a point audio source and the noises can come from diffuse sound sources, e.g., natural sources such as echo, wind sound, waves, and other unnatural sound sources. In order to enhance the quality of the audio signal, separation of the speech signal from the noises is desirable.
BSS is a technique for separating specific sources from sound mixture without prior information, e.g., signal statistics, source location, etc. For example, independent component analysis (ICA) is one of the most commonly used BSS method. On the other hand, Nonnegative Matrix Factorization (NMF) is a popular dimension-reduction technique, employed for non-subtractive, part-based representation of non-negative data, i.e., speech magnitude or power spectrum. In particular, multi-channel nonnegative matrix factorization (MNMF) is developed to use the spatial covariance to model the mixing conditions of the recoding environment. Furthermore, post process is often deployed after multi-channels speech enhancement to further reduce the interference. Some conventional post processing methods includes the single-channel based methods and adaptive filter based methods.
However, while these conventional separation and post processing methods yield good performance in point source separation, they are often insufficient to suppress diffuse interferences. Techniques to enhance audio signals from diffuse sources need to be improved. Reducing diffuse noises in an audio signal and improving speech perceptual can greatly increase the accuracy of speech recognition results.
Embodiments of the disclosure address the above problems by methods and systems for enhancing audio signals.
Embodiments of the disclosure provide a system for enhancing audio signals. The system may include a communication interface configured to receive multi-channel audio signals acquired from a common signal source. The system may further include at least one processor. The at least one processor may be configured to separate the multi-channel audio signals into a first audio signal and a second audio signal in a time domain. The at least one processor may be further configured to decompose the first audio signal and the second audio signal in a frequency domain to obtain a first decomposition data and a second decomposition data, respectively. The at least one processor may be also configured to estimate a noise component in the frequency domain based on the first decomposition data and the second decomposition data. The at least one processor may be additionally configured to enhance the first audio signal based on the estimated noise component. The system may also include a speaker configured to output the enhanced first audio signal.
Embodiments of the disclosure also provide a method for enhancing audio signals. The method may include receiving, by a communication interface, multi-channel audio signals acquired from a common signal source. The method may further include separating, by at least one processor, the multi-channel audio signals into a first audio signal and a second audio signal originated in a time domain. The method may also include decomposing, by the at least one processor, the first audio signal and the second audio signal in a frequency domain to obtain a first decomposition data and a second decomposition data, respectively. The method may additionally include estimating, by the at least one processor, a noise component in the frequency domain based on the first decomposition data and the second decomposition data. The method may also include enhancing, by the at least one processor, the first audio signal based on the estimated noise component.
Embodiments of the disclosure further provide a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more processors, causes the one or more processors to perform a method for enhancing audio signals. The method may include receiving multi-channel audio signals acquired from a common signal source. The method may further include separating the multi-channel audio signals into a first audio signal and a second audio signal originated in a time domain. The method may also include decomposing the first audio signal and the second audio signal in a frequency domain to obtain a first decomposition data and a second decomposition data, respectively. The method may additionally include estimating a noise component in the frequency domain based on the first decomposition data and the second decomposition data. The method may also include enhancing the first audio signal based on the estimated noise component.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
In some embodiments, an audio processing system and method is disclosed to reduce interference after multi-channel speech enhancement (MSE) algorithms, including but not limited to MNMF. For example, MNMF may be performed to separate the inputs into separated speech and interference channels. Speech and interference basis matrices are obtained from the corresponding channels. First, speech component is removed from interference bases, in order to prevent speech distortion. Then interference bases are used to reconstruct the MNMF separated speech spectra under multiplicative update (MU) rules, where only activation matrix is updated. Since interference bases exclude speech component, large distance between the reconstructed and the original speech spectra should exist in the region where speech energy is concentrated, like harmonics, or unvoiced speech.
Consistent with the present disclosure, acquisition device 110 may acquire audio signals from audio source 101. In some embodiments, audio source 101 may be a person who gives a speech in a noisy environment, a speaker that plays a speech, an audio book, a news broadcast, or a song in the noisy environment, etc. In some embodiments, acquisition device 110 may be a microphone device, a sound recorder, or the like. In some embodiments, acquisition device 110 may be a standalone audio receiving device or part of another device, such as a mobile phone, a wearable device, a headphone, a vehicle, a surveillance system, etc.
In some embodiments, acquisition device 110 may be configured to receive multi-channel signals, including, e.g., a first-channel signal 103 of a first channel and a second-channel signal 105 of a second channel. For example, acquisition device 110 may include two or more acquisition channels, or include two or more individual acquisition units. In some embodiments, audio signal of each channel includes a human speech and diffuse noises. Server 120 may receive the multi-channel audio signals from acquisition device 110, and then reduce noises from the audio signal and enhance its quality. Server 120 may transform and decompose the two audio signals to obtain an enhanced speech signal based on an estimated noise component.
In some embodiments, as shown in
Communication interface 102 may send data to and receive data from components such as speaker 130 and acquisition device 110 via communication cables, a Wireless Local Area Network (WLAN), a Wide Area Network (WAN), wireless networks such as radio waves, a cellular network, and/or a local or short-range wireless network (e.g., Bluetooth™), or other communication methods. In some embodiments, communication interface 102 can be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection. As another example, communication interface 102 can be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links can also be implemented by communication interface 102. In such an implementation, communication interface 102 can send and receive electrical, electromagnetic or optical signals that carry digital data streams representing various types of information via a network.
Consistent with some embodiments, communication interface 102 may receive multi-channel audio data such as first-channel signal 103 and second-channel signal 105 of two channels acquired by acquisition device 110.
Communication interface 102 may further provide the received data to storage 108 for storage or to processor 104 for processing. Communication interface 102 may also receive an enhanced audio signal generated by processor 104, and provide the enhanced audio signal to a local speaker or any remote speaker (e.g., speaker 130) via a network.
Processor 104 may include any appropriate type of general-purpose or special-purpose microprocessor, digital signal processor, or microcontroller. Processor 104 may be configured as a separate processor module dedicated to enhancing audio signals. Alternatively, processor 104 may be configured as a shared processor module for performing other functions unrelated to audio signal enhancement.
As shown in
Consistent with some embodiments, signal separation unit 142 may be configured to separate the multi-channel audio signals (e.g., first-channel signal 103 and second-channel signal 105) into a first audio signal and a second audio signal. In some embodiments, a blind source separation (BSS) method may be performed for separating the speech and interference channel signals. Blind source separation is a technique for separating specific sources from sound mixture without prior information, e.g., signal statistics, source location, etc.
In some embodiments, a multi-channel nonnegative matrix factorization (MNMF) algorithm is employed for the blind source separation. MNMF utilizes a spatial covariance to model a mixing condition of a recoding environment. Under an assumption of instantaneous mixing in the frequency domain, MNMF with rank-1 can be implemented for the separation tasks. For example, as shown in
By utilizing information between channels, MNMF clusters the decomposed bases into specific sources in a blind situation. In some embodiments, using the rank-1 MNMF algorithm, most speech component goes to the separated speech channel. However, there is no complete separation between speech and noise. In particular, rank-1 MNMF suppresses little interference in the separated speech channel and some speech component may leak into the separated interference channel. As a result, the speech channel signal may consist mainly of the speech signal but also include some noises, while the interference channel signal may consist largely of noises but include a small amount of speech signal. That is, in general, the speech signal ratio of the speech channel signal is higher than the speech signal ratio of the interference channel signal. Consistent with some embodiments, a first speech signal ratio of the speech channel signal is higher than a first threshold and a second speech signal ratio of the interference channel signal is lower than a second threshold, and the second threshold is smaller than the first threshold. It is contemplated that other blind source separation methods may also be used to separate the multi-channel audio signals to achieve the same or similar separation results.
In some embodiments, the remaining units of processor 104, including NMF decomposition unit 144, noise estimation unit 146 and signal enhancing unit 148, may implement postprocessing module 160 of
NMF decomposition unit 144 may be further configured to decompose each Fourier-transformed audio signal using NMF to obtain an NMF basis matrix and an activation matrix. NMF algorithm is a dimension-reduction technique and aims to factorize a nonnegative matrix X∈RI×J, into a product of two nonnegative matrices, X≈TV, where T∈RI×b are several spectral bases and V∈Rb×J are temporal activations, where I and J denote the numbers of frequency bins and time frames, respectively, and b is the number of basis vectors. Typically, b (I+J)<I×J. T∈RI×b and V∈Rb×J minimize some divergence metric, d(X, TV).
Consistent with some embodiments, each basis matrix T may consist of a speech basis matrix Ts and a noise basis matrix Tn, i.e., T=[Ts Tn], while the corresponding activation matrix V=[VsVn]′ with ′ denoting matrix transpose. In a training stage, Ts and Tn can be trained separately with clean speech and noise data, respectively. At the speech enhancement stage, the basis matrix may be fixed and only activation matrix is updated. In some embodiments, once the algorithm converges, an optimal spectral gain G, i.e., Wiener gain, may be determined based on the speech and noise estimates derived from the NMF analysis, e.g., according to equation (1).
In some embodiments, basis and activation matrices may be updated in a MU procedure according to some cost functions. For example, three special instances of β-divergence may be applied as metrics in MU rules, e.g., Euclidean distance (β=2), Kullback-Leibler (KL) divergence (β=1), and Itakura-Saito (IS) divergence (β=0). In some embodiments, MU rules update basis matrix T and activation matrix V alternatively according to equations (2) and (3).
For example, as shown in
Noise estimation unit 146 may be configured to obtain a modified NMF interference bases in a frequency domain based on the first decomposition data (e.g., the NMF speech bases) and the second decomposition data (e.g., the NMF interference bases). In some embodiments, a third NMF basis matrix corresponding to a noise signal is generated based on a first NMF basis matrix and a second NMF basis matrix.
Generally, basis matrix represents the frequency structure of the signal (e.g., harmonics of speech). In the separated speech channel, it is expected that speech related basis has larger value. Frequency sub-bands are labeled as speech if those speech related basis exceeds some pre-defined thresholds. Accordingly, elements of the first NMF basis matrix exceeding a third threshold are considered attributable to a speech component.
The corresponding elements of the second NMF basis matrix are then substituted with a predetermined value. The overwritten second NMF basis matrix is saved as a third NMF basis matrix. In some embodiments, in the separated interference channel, NMF basis matrix within the frequency sub-bins labeled above can be set to zero. For example, as shown in
Noise estimation unit 146 may be further configured to obtain an estimated noise component (e.g., the reconstructed speech spectrum). In some embodiments, the third NMF basis matrix may be used to reconstruct the first audio signal, by implementing, e.g., module 164 in
In some embodiments, speech signal enhancing unit 148 may be configured to calculate the Euclidean distances between elements of a Fourier-transformed first audio signal and the corresponding elements of an estimated noise component in a frequency domain. For example, speech signal enhancing unit 148 may implement a module 165 of
In some embodiments, speech signal enhancing unit 148 may be further configured to adjust the elements of the Fourier-transformed first audio signal by gains determined based on the respective Euclidean distances. For example, speech signal enhancing unit 148 may implement a module 166 of
di,j=|Xi,j−{circumflex over (X)}i,j| (5)
di,j=di,j/∥di,j∥ (6)
In some embodiments, speech signal enhancing unit 148 may inverse Fourier transform on the adjusted Fourier-transformed first audio signal to obtain an enhanced audio signal in a time domain.
Memory 106 and storage 108 may include any appropriate type of mass storage provided to store any type of information that processor 104 may need to operate. Memory 106 and storage 108 may be a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other type of storage device or tangible (i.e., non-transitory) computer-readable medium including, but not limited to, a ROM, a flash memory, a dynamic RAM, and a static RAM. Memory 106 and/or storage 108 may be configured to store one or more computer programs that may be executed by processor 104 to perform noise reducing and audio signal enhancing functions disclosed herein. For example, memory 106 and/or storage 108 may be configured to store program(s) that may be executed by processor 104 to enhance an audio signal acquired from an audio source.
Memory 106 and/or storage 108 may be further configured to store information and data used by processor 104. For instance, memory 106 and/or storage 108 may be configured to store the various types of data (e.g., audio signals, metadata, etc.) acquired by acquisition device 110. Memory 106 and/or storage 108 may also store intermediate data such as machine learning models, thresholds, and parameters, etc. The various types of data may be stored permanently, removed periodically, or disregarded immediately after each audio signal is processed.
Speaker 130 may be configured to output an enhanced audio signal received from communication interface 102. Speaker 130 may connect to a speech recognition system as an audio input device. In some embodiments, speaker 130 may be a standalone audio display/output device or part of another device, such as a mobile phone, a wearable device, a headphone, a vehicle, a surveillance system, etc.
In step S202, a multi-channel audio signal is received from acquisition device 110. In some embodiments, acquisition device 110 may include at least two acquisition channels, or include at least two individual acquisition units, to acquire multi-channel audio signals, such as first-channel signal 103 and second-channel signal 105. For example, a speech may be acquired by acquisition device 110 in a noisy stadium environment through different microphones. In some embodiments, both channel signals 103 and 105 are mixtures of speech signals and environmental noise signals. The audio information acquired through multiple channels can be later utilized for a blind source separation. Acquisition device 110 sends a first-channel signal 103 and a second-channel signal 105 to communication interface 102.
In step S204, processor 104 uses a blind source separation method to separate the multi-channel audio signals acquired from audio source 101. In some embodiments, multi-channel NMF(MNMF) which is a natural extension of simple NMF method for multi-channel signals may be used to separate the multi-channel audio signals. By utilizing information between channels (e.g., the first-channel audio signal 103 and the second-channel audio signal 105), MNMF can cluster the decomposed bases into specific sources in the blind situation. As shown in the example of
Referring back to
If NMF decomposition unit 144 shown in
Referring back to
For example, a first NMF basis matrix T1 and a second NMF basis matrix T2 may be 3 by 3 matrices. Each row and column have three elements. For example, T1=[a11 a12 a13; a21 a22 a23; a31 a32 a33]. T2=[b11 b12 b13; b21 b22 b23; b31 b32 b33]. a13 is the element of a first NMF basis matrix T1 in the first row and the third column. a22 is the element of the first NMF basis matrix T1 in the second row and the second column. Values of a13 and a22 are less than a third threshold, and other elements of the first NMF basis matrix T1 are greater than or equal to the third threshold. The predetermined value is set to 0. A third NMF basis matrix T3=[b11 b12 0; b21 0b23; b31 b32 b33]. In step S406 shown in
In some special cases, the separated noise channel can include pure noise signals. It does not include any speech signals. In this case, the second NMF basis matrix is used as the third NMF basis matrix to estimate the noise component.
Referring back to
Consistent with some embodiments, in step S504, gains are calculated based on Euclidean distances. In some embodiments, gains are linearly proportional to the respective Euclidean distances. For example, regularization can be used to obtain gains based on the Euclidean distances and the value of a gain is between 0 and 1. In some embodiments, a sigmoid-like activation function is used to convert the distance into the gain ranged in [0, 1].
In step S506, elements of the Fourier-transformed first audio signal generated in step 302 as shown in
Referring back to
Another aspect of the disclosure is directed to a non-transitory computer-readable medium storing instructions which, when executed, cause one or more processors to perform the methods, as discussed above. The computer-readable medium may include volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage devices. For example, the computer-readable medium may be the storage device or the memory module having the computer instructions stored thereon, as disclosed. In some embodiments, the computer-readable medium may be a disc or a flash drive having the computer instructions stored thereon.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed system and related methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and related methods.
It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
201910344914.2 | Apr 2019 | CN | national |
Number | Name | Date | Kind |
---|---|---|---|
10373628 | Taniguchi | Aug 2019 | B2 |
Entry |
---|
Nikunen et al., “Source Separation and Reconstruction of Spatial Audio Using Spectrogram Factorization,” in Parametric Time-Frequency Domain Spatial Audio , IEEE, 2018, pp. 215-250, doi: 10.1002/9781119252634.ch9. (Year: 2018). |
Carabias-Orti et al., “Multichannel Blind Sound Source Separation Using Spatial Covariance Model With Level and Time Differences and Nonnegative Matrix Factorization,” in IEEE/ACM Transactions on Audio, Speech, and Lang. Proc., vol. 26, No. 9, pp. 1512-1527, Sep. 2018 (Year: 2018). |
Byun et al., “Initialization for NMF-based audio source separation using priors on encoding vectors,” in China Communications, vol. 16, No. 9, pp. 177-186, Sep. 2019, doi: 10.23919/JCC.2019.09.013. (Year: 2019). |
Fan et al., “Speech enhancement using segmental nonnegative matrix factorization,” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4483-4487, doi: 10.1109/ICASSP.2014.6854450. (Year: 2014). |
Number | Date | Country | |
---|---|---|---|
20200342889 A1 | Oct 2020 | US |