The present disclosure relates to attenuation and/or removal of noise in an audio signal.
In a speech enhancement system, a digital signal processor (DSP) receives an input signal including samples of an analog audio signal. The analog audio signal may be a speech signal. The input signal includes noise and thus is referred to as a “noisy speech” signal with noisy speech samples. The DSP signal processes the noisy speech signal to attenuate the noise and output a “cleaned” speech signal with a reduced amount of noise as compared to the input signal. Attenuation of the noise is a challenging problem because there is no side information included in the input signal defining the speech and/or noise. The only available information is the received noisy speech samples.
Traditional methods exist for attenuating the noise in a noisy speech signal. These methods, however, introduce and/or result in output of “music noise”. Music noise does not necessarily refer to noise of a music signal, but rather refers to a “music-like” sounding noise that is within a narrow frequency band. The music noise is included in cleaned speech signals that are output as a result of performing these traditional methods. The music noise can be heard by a listener and may annoy the listener.
As an example, samples of an input signal can be divided into overlapping frames and a priori signal-to-noise ratio (SNR) ξ(k,l) and a posteriori SNR γ(k,l) may be determined, where: ξ(k,l) is the a priori SNR of the input signal; γ(k,l) is a posteriori (or instantaneous) SNR of the input signal; l is a frame index to identify a particular one of the frames; and k is a frequency bin (or range) index that identifies a frequency range of a short time Fourier transform (STFT) of the input signal. The a priori SNR ξ(k,l) is a ratio of a power level (or frequency amplitude of speech) of a clean speech signal to a power level of noise (or frequency amplitude of noise). The a posteriori SNR γ(k,l) is a ratio of a squared magnitude of an observed noisy speech signal to a power level of the noise. Both the a priori SNR ξ(k,l) and the a posteriori SNR γ(k,l) may be computed for each frequency bin of the input signal. The a priori SNR ξ(k,l) may be determined using equation 1, where λX(k,l) is a priori estimated variance of amplitude of speech of the STFT of the input signal and λN(k,l) is an estimated a priori variance of noise of the STFT of the input signal.
The a posteriori SNR γ(k,l) may be determined using equation 2, where R(k,l) is an amplitude of noisy speech of the STFT of the input signal.
For each k and l, a gain G is calculated as a function of ξ(k,l) and γ(k,l). The gain G is multiplied by R(k,l) to provide an estimate of an amplitude of clean speech Â(k,l). Each gain value may be greater than or equal to 0 and less than or equal to 1. Values of the gain G are calculated based on ξ(k,l) and γ(k,l), such that frequency bands (or bins) of speech are kept and frequency bands (or bins) of noise are attenuated. An inverse fast Fourier transform (IFFT) of the amplitude of clean speech Â(k,l) is performed to provide time domain samples of the cleaned speech. The cleaned speech refers to the noisy speech portion of the STFT of the input signal that is cleaned (i.e. the noise has been attenuated).
For example, when ξ(k,l) is high, amplitude of speech for the corresponding frequency is high and little noise exists (i.e. amplitude of noise is low). For this condition, the gain G is set close to 1 (or 0 dB) to maintain amplitude of the speech. As a result, the amplitude of clean speech Â(k,l) is set approximately equal to R(k,l). As another example, when ξ(k,l) is low, amplitude of speech for the corresponding frequency is low and strong noise exists (i.e. amplitude of noise is high). For this condition, the gain G is set close to 0 to attenuate the noise. As a result, the amplitude of the clean speech Â(k,l) is set close to 0.
The a priori signal-to-noise ratio (SNR) ξ(k,l) may be estimated using equation 3, where α is a constant between 0 and 1 and P(k,l) is an operator, which may be expressed by equation 4.
As illustrated in
A low value of the a priori SNR ξ(k,l) can lead to a gain that is much smaller than 1 (e.g., close to 0 and greater than or equal to 0). A high value of the a priori SNR ξ(k,l) leads to a gain close to 1 and less than or equal to 1. As a result, the estimated speech amplitude Â(k,l), which is the gain multiplied by the amplitude of noisy speech R(k,l), has isolated peaks at the frequency bins where P(k,l) has isolated peaks. This is shown in
R(k,l)2 and λN(k,l) are at a similar average level for the above-stated frame designated by box 14. This is because content of the frame designated by box 14 is mostly noise. For this reason, R(k,l)2 is the instantaneous noise level. λN(k,l) is an estimated smoothed noise level or as stated above the estimated a priori variance of noise. The fact that R(k,l)2 has a similar average level as λN(k,l) indicates λN(k,l) is estimated correctly.
A system is provided and includes a first gain module, an operator module, an a priori module, a posteriori module, and a second gain module. The first gain module is configured to apply a non-linear function to generate a gain signal based on (i) an amplitude of a first speech signal, and (ii) an estimated a priori variance of noise contained in the first speech signal. The operator module is configured to generate an operator based on (i) the gain signal, and (ii) the estimated a priori variance of noise. The a priori module is configured to determine an a priori signal-to-noise ratio based on the operator. The posteriori module is configured to determine a posteriori signal-to-noise ratio based on (i) the amplitude of the first speech signal, and (ii) the estimated a priori variance of noise. The second gain module is configured to: determine a gain value based on (i) the a priori signal-to-noise ratio, and (ii) the a posteriori signal-to-noise ratio, and generate, based on (i) the amplitude of the first speech signal and (ii) the gain value, a second speech signal that corresponds to an estimate of an amplitude of the speech signal, where the second speech signal is substantially void of music noise.
In other features, a method is provided and includes: applying a non-linear function to generate a gain signal based on (i) an amplitude of a first speech signal, and (ii) an estimated a priori variance of noise included in the first speech signal; generating an operator based on (i) the gain signal, and (ii) the estimated a priori variance of noise; determining an a priori signal-to-noise ratio based on the operator; and determining a posteriori signal-to-noise ratio based on (i) the amplitude of the first speech signal, and (ii) the estimated a priori variance of noise. The method further includes: determining a gain value based on (i) the a priori signal-to-noise ratio, and (ii) the a posteriori signal-to-noise ratio; and based on (i) the amplitude of the first speech signal, and (ii) the gain value, generating a second speech signal that corresponds to an estimate of an amplitude of the first speech signal, where the second speech signal is substantially void of music noise.
Further areas of applicability of the present disclosure will become apparent from the detailed description, the claims and the drawings. The detailed description and specific examples are intended for purposes of illustration only and are not intended to limit the scope of the disclosure.
In the drawings, reference numbers may be reused to identify similar and/or identical elements.
In review of
The larger the value of s, the fewer isolated peaks in P(k,l). However, as long as isolated peaks exist in P(k,l), music noise is produced. With fewer isolated peaks, the music noise is more narrowly banded and as a result can be more annoying to a listener. To completely eliminate the isolated peaks, s must be increased to a large value such that R(k,l)2<s·λN(k,l) for all values of k. This requires a large value of s, since R(k,l) is instantaneous (not smoothed). Referring now to the example noisy speech signal 12 of
As another example,
The network device 52 may include: a control module 70 with a speech estimation module 72; a physical layer (PHY) module 74, a medium access control (MAC) module 76, a microphone 78, a speaker 80 and a memory 82. The speech estimation module 72 receives a noisy speech signal, attenuates noise in the noisy speech signal and eliminates and/or prevents generation of music noise with minimal or no speech distortion. The noisy speech signal may be received by the network device 52 from the network device 54 via the network 60 or by the network device 52 directly from the network device 56. The noisy speech signal may be received via an antenna 84 at the PHY module 74 and forwarded to the control module 70 via the MAC module 76. As an alternative, the noisy speech signal may be generated based on an analog audio signal detected by the microphone 78. The noisy speech signal may be generated by the microphone 78 and provided from the microphone 78 to the control module 70.
The speech estimation module 72 provides an estimated speech amplitude signal Â(k,l) (sometimes referred to as an estimated clean speech signal) based on the noisy speech signal. The speech estimation module 72 may perform an inverse fast Fourier transform (IFFT) and a digital-to-analog (D/A) conversion of the estimated speech amplitude signal Â(k,l) to provide an output signal. The output signal may be provided to the speaker 80 for playout or may be transmitted back to one of the network devices 54, 56 via the modules 74, 76 and the antenna 84.
An audio (or noisy speech) signal may be originated at the network device 52 via the microphone 78 and/or accessed from the memory 82 and passed through the speech estimation module 72. The resultant signal generated by the speech estimation module 72 corresponding to the audio signal may be played out on the speaker 80 and/or transmitted to the network devices 54, 56 via the modules 74, 76 and the antenna 84.
Referring now also to
The speech estimation module 72 may include a fast Fourier transform (FFT) module 110, an amplitude module 112, a noise module 114, an attenuation/gain module 116, a squaring module 117, a divider module 118, an a priori SNR module 120, an a posteriori (or instantaneous) SNR module 122, a second gain module 124, and an IFFT module 126. Modules 116, 117, 118 may be included in and/or implemented as a single non-linear function module. Modules 117 and 118 may be included in and/or implemented as a single operator module. Operation of the modules 110, 112, 114, 116, 117, 118, 120, 122, 124 and 126 are described with respect to the method of
The systems disclosed herein may be operated using numerous methods, an example method is illustrated in
The method may begin at 150. At 152, the FFT module 110 may perform a fast Fourier transform on a received and/or accessed audio (or noisy speech) signal y(t) to provide a digital noisy speech signal Yk, where t is time and k is a frequency bin index. At 154, the amplitude module 112 may determine amplitudes of the digital noisy speech signal Yk and generate a noisy speech amplitude signal R(k,l). The noisy speech amplitude signal R(k,l) may be generated as the amplitude of the complex digital noisy speech signal Yk. At 156, the noise module 114 determines an estimated a priori variance of noise λN(k,l) based on the digital noisy speech signal Yk.
Tasks 158 and 160 may be performed according to equation 6, where g[ ] is a non-linear attenuation/gain function with inputs R(k,l) and λN(k,l).
At 158, the attenuation/gain (or first function) module 116 generates an attenuated/gain signal ag(k,l) based on the noisy speech amplitude signal R(k,l) and the estimated a priori variance of noise λN(k,l). The attenuated/gain signal ag(k,l) is the result of the non-linear attenuation/gain function g[ ] and may be generated according to the following rule:
At 160, the squaring (or second function) module 117 squares the output ag(k,l) to provide ag(k,l)2. At 162, the divider (or third function) module 118 divides ag(k,l)2 by the λN(k,l) to provide P(k,l) of equation 6.
By using the above-described rule and equation 6, music noise is eliminated by avoiding creation of isolated peaks. Note that equation 6 does not include the subtractions in equations 4 and/or 5. Since speech energy is greater than noise energy, if R(k,l)2>>λN(k,l), then the corresponding signal energy is most likely speech energy, not noise energy. For this reason, the signal is not modified. In other words, the output ag(k,l) is equal to R(k,l). Otherwise, the likelihood of the signal energy being speech decreases and the likelihood of the signal energy being noise increases with decreasing R(k,l). For this reason, a reduced amount of gain and/or an attenuated P(k,l) is generated leading to a reduced amount of noise. When R(k,l)2 is about the same as (e.g., within a predetermined amount of) λN(k,l) or is less than λN(k,l), then R(k,l) is most likely noise and is heavily attenuated. This reduces noise and also aids in preventing formation of isolated peaks.
Isolated peaks are formed because of discontinuities associated with, for example, equation 4. This is because at one particular frequency bin when R(k,l)2<λN(k,l) equation 4 results in P(k,l) being equal to 0, but at a next frequency bin when R(k+1,l)2>λN(k+1,l) equation 4 provides a nonzero large value for
In the proposed algorithm, because of feature 3 of the above-stated rule associated with equation 6, P(k,l)>0. Also, because of feature 2 of the above-stated rule, P(k+1,l) may be a heavily attenuated value. For these reasons, an isolated peak that would result in music noise is not created.
There are numerous possible non-linear attenuation/gain functions that may be used for g[ ].
At 164, the a priori SNR module (or first SNR module) 120 determines a priori SNR ξ(k,l) based on the P(k,l) and λN(k,l) and a previous amplitude Â(k,l−1). The previous amplitude Â(k,l−1) may be generated by the gain module 124 for a previous frame of the received and/or accessed speech signal. At 166, the a posteriori SNR module (or second SNR module) 122 may determine a posteriori SNR γ(k,l) based on the R(k,l) and λN(k,l).
At 168, the gain (or second gain) module 124 may generate an estimated speech amplitude signal Â(k,l) as a function of ξ(k,l) and/or γ(k,l). As an example, equations 7-10 may be used to generate the estimated speech amplitude signal Â(k,l), where v is a parameter defined by equation 7 and G is gain applied to R(k,l).
The estimated speech amplitude signal Â(k,l) may be provided from the gain module 124 to the IFFT module 126. Values of the gain G may be greater than or equal to 0 and less than or equal to 1. The values of the gain G are set to attenuate noise and maintain amplitudes of speech. At 170, the IFFT module 126 performs an IFFT of the estimated speech amplitude signal Â(k,l) to provide an output signal, which may be provided to the D/A converter 102. The method may end at 172.
The above-described tasks are meant to be illustrative examples; the tasks may be performed sequentially, synchronously, simultaneously, continuously, during overlapping time periods or in a different order depending upon the application. Also, any of the tasks may not be performed or skipped depending on the implementation and/or sequence of events. For example, tasks 152 and/or 170 may be skipped.
By applying the non-linear attenuation/gain functions described above to provide an operator P(k,l), the subsequent determination of a priori SNR ξ(k,l) and the generation of the estimated clean speech signal Â(k,l) do not introduce music noise. For example, by applying the non-linear attenuation/gain function of
By applying the non-linear attenuation/gain function of
As can be seen in
The wireless communications described in the present disclosure can be conducted in full or partial compliance with IEEE standard 802.11-2012, IEEE standard 802.16-2009, IEEE standard 802.20-2008, and/or Bluetooth Core Specification v4.0. In various implementations, Bluetooth Core Specification v4.0 may be modified by one or more of Bluetooth Core Specification Addendums 2, 3, or 4. In various implementations, IEEE 802.11-2012 may be supplemented by draft IEEE standard 802.11ac, draft IEEE standard 802.11ad, and/or draft IEEE standard 802.11ah.
The foregoing description is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. The broad teachings of the disclosure can be implemented in a variety of forms. Therefore, while this disclosure includes particular examples, the true scope of the disclosure should not be so limited since other modifications will become apparent upon a study of the drawings, the specification, and the following claims. It should be understood that one or more steps within a method may be executed in different order (or concurrently) without altering the principles of the present disclosure. Further, although each of the embodiments is described above as having certain features, any one or more of those features described with respect to any embodiment of the disclosure can be implemented in and/or combined with features of any of the other embodiments, even if that combination is not explicitly described. In other words, the described embodiments are not mutually exclusive, and permutations of one or more embodiments with one another remain within the scope of this disclosure.
Spatial and functional relationships between elements (for example, between modules, circuit elements, semiconductor layers, etc.) are described using various terms, including “connected,” “engaged,” “coupled,” “adjacent,” “next to,” “on top of,” “above,” “below,” and “disposed.” Unless explicitly described as being “direct,” when a relationship between first and second elements is described in the above disclosure, that relationship can be a direct relationship where no other intervening elements are present between the first and second elements, but can also be an indirect relationship where one or more intervening elements are present (either spatially or functionally) between the first and second elements. As used herein, the phrase at least one of A, B, and C should be construed to mean a logical (A OR B OR C), using a non-exclusive logical OR, and should not be construed to mean “at least one of A, at least one of B, and at least one of C.”
In this application, including the definitions below, the term “module” or the term “controller” may be replaced with the term “circuit.” The term “module” refers to or includes: an Application Specific Integrated Circuit (ASIC); a digital, analog, or mixed analog/digital discrete circuit; a digital, analog, or mixed analog/digital integrated circuit; a combinational logic circuit; a field programmable gate array (FPGA); a processor circuit (shared, dedicated, or group) that executes code; a memory circuit (shared, dedicated, or group) that stores code executed by the processor circuit; other suitable hardware components that provide the described functionality; or a combination of some or all of the above, such as in a system-on-chip.
The module may include one or more interface circuits. In some examples, the interface circuits may include wired or wireless interfaces that are connected to a local area network (LAN), the Internet, a wide area network (WAN), or combinations thereof. The functionality of any given module of the present disclosure may be distributed among multiple modules that are connected via interface circuits. For example, multiple modules may allow load balancing. In a further example, a server (also known as remote, or cloud) module may accomplish some functionality on behalf of a client module.
The term code, as used above, may include software, firmware, and/or microcode, and may refer to programs, routines, functions, classes, data structures, and/or objects. The term shared processor circuit encompasses a single processor circuit that executes some or all code from multiple modules. The term group processor circuit encompasses a processor circuit that, in combination with additional processor circuits, executes some or all code from one or more modules. References to multiple processor circuits encompass multiple processor circuits on discrete dies, multiple processor circuits on a single die, multiple cores of a single processor circuit, multiple threads of a single processor circuit, or a combination of the above. The term shared memory circuit encompasses a single memory circuit that stores some or all code from multiple modules. The term group memory circuit encompasses a memory circuit that, in combination with additional memories, stores some or all code from one or more modules.
The term memory circuit is a subset of the term computer-readable medium. The term computer-readable medium, as used herein, does not encompass transitory electrical or electromagnetic signals propagating through a medium (such as on a carrier wave); the term computer-readable medium may therefore be considered tangible and non-transitory. Non-limiting examples of a non-transitory, tangible computer-readable medium are nonvolatile memory circuits (such as a flash memory circuit, an erasable programmable read-only memory circuit, or a mask read-only memory circuit), volatile memory circuits (such as a static random access memory circuit or a dynamic random access memory circuit), magnetic storage media (such as an analog or digital magnetic tape or a hard disk drive), and optical storage media (such as a CD, a DVD, or a Blu-ray Disc).
The apparatuses and methods described in this application may be partially or fully implemented by a special purpose computer created by configuring a general purpose computer to execute one or more particular functions embodied in computer programs. The functional blocks, flowchart components, and other elements described above serve as software specifications, which can be translated into the computer programs by the routine work of a skilled technician or programmer.
The computer programs include processor-executable instructions that are stored on at least one non-transitory, tangible computer-readable medium. The computer programs may also include or rely on stored data. The computer programs may encompass a basic input/output system (BIOS) that interacts with hardware of the special purpose computer, device drivers that interact with particular devices of the special purpose computer, one or more operating systems, user applications, background services, background applications, etc.
The computer programs may include: (i) descriptive text to be parsed, such as HTML (hypertext markup language) or XML (extensible markup language), (ii) assembly code, (iii) object code generated from source code by a compiler, (iv) source code for execution by an interpreter, (v) source code for compilation and execution by a just-in-time compiler, etc. As examples only, source code may be written using syntax from languages including C, C++, C#, Objective C, Haskell, Go, SQL, R, Lisp, Java®, Fortran, Perl, Pascal, Curl, OCaml, Javascript®, HTML5, Ada, ASP (active server pages), PHP, Scala, Eiffel, Smalltalk, Erlang, Ruby, Flash®, Visual Basic®, Lua, and Python®.
None of the elements recited in the claims are intended to be a means-plus-function element within the meaning of 35 U.S.C. § 112(f) unless an element is expressly recited using the phrase “means for,” or in the case of a method claim using the phrases “operation for” or “step for.”
This application claims the benefit of U.S. Provisional Application No. 62/045,367, filed on Sep. 3, 2014. The entire disclosure of the application referenced above is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
9130643 | Chen | Sep 2015 | B2 |
9437212 | Jain | Sep 2016 | B1 |
9626987 | Matsuo | Apr 2017 | B2 |
20020002455 | Accardi | Jan 2002 | A1 |
20080082328 | Lee | Apr 2008 | A1 |
20080167866 | Hetherington | Jul 2008 | A1 |
20090177468 | Yu | Jul 2009 | A1 |
20090310796 | Seydoux | Dec 2009 | A1 |
20100076769 | Yu | Mar 2010 | A1 |
20110305345 | Bouchard | Dec 2011 | A1 |
20120057711 | Makino | Mar 2012 | A1 |
Number | Date | Country |
---|---|---|
WO-2005114656 | Dec 2005 | WO |
Entry |
---|
U.S. Appl. No. 14/546,552, filed Nov. 18, 2014, Kapil Jain. |
Y. Ephraim and D. Malah; “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator”; IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-32, No. 6; Dec. 1984; pp. 1109-1121. |
Kapil Jain; “Speech Enhancement Using a Mathematically Efficient Spectral Amplitude Estimator”; Oct. 22, 2013; 20 pages. |
IEEE Std. 802.11-2012; IEEE Standard for Information technology—Telecommunications and information exchange between systems Local and metropolitan area networks—Specific requirements; Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications; IEEE Computer Society; Sponsored by the LAN/MAN Standards Committee; Mar. 29, 2012; 2793 pages. |
802.16-2009 IEEE Standard for Local and Metropolitan area networks; Part 16: Air Interface for Broadband Wireless Access Systems; IEEE Computer Society and the IEEE Microwave Theory and Techniques Society; Sponsored by the LAN/MAN Standard Committee; May 29, 2009; 2082 pages. |
IEEE Std 802.20-2008; IEEE Standard for Local and metropolitan area networks; Part 20: Air Interface for Mobile Broadband Wireless Access Systems Supporting Vehicular Mobility—Physical and Media Access Control Layer Specification; IEEE Computer Society; Sponsored by the LAN/MAN Standards Committee; Aug. 29, 2008; 1032 pages. |
“Specification of the Bluetooth System” Master Table of Contents & Compliance Requirements—Covered Core Package version: 4.0; Jun. 30, 2010; 2302 pages. |
IEEE P802.11ac / D2.0; Draft Standard for Information Technology—Telecommunications and information exchange between systems—Local and metropolitan area networks—Specific requirements; Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications; Amendment 4: Enhancements for Very High Throughput for Operation in Bands below 6 GHz; Prepared by the 802.11 Working Group of the 802 Committee; Jan. 2012; 359 pages. |
IEEE P802.11ad / D5.0 (Draft Amendment based on IEEE P802.11REVmb D10.0) (Amendment to IEEE 802.11REVmb D10.0 as amended by IEEE 802.11ae D5.0 and IEEE 802.11aa D6.0); Draft Standard for Information Technology—Telecommunications and Information Exchange Between Systems—Local and Metropolitan Area Networks—Specific Requirements; Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications—Amendment 3: Enhancements for Very High Throughput in the 60 GHz Band; Sponsor IEEE 802.11 Committee of the IEEE Computer Society; Sep. 2011; 601 pages. |
IEEE P802.11ah / D1.0 (Amendment to IEEE Std 802.11REVmc / D1.1, IEEE Std 802.11ac / D5.0 and IEEE Std 802.11af / D3.0) Draft Standard for Information technology—Telecommunications and information exchange between systems Local and metropolitan area networks—Specific requirements; Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications; Amendment 6: Sub 1 GHz License Exempt Operation; Prepared by the 802.11 Working Group of the LAN/MAN Standards Committee of the IEEE Computer Society; Oct. 2013; 394 pages. |
Nakai Shunsuke et al.; “Theoretical Analysis of Biased MMSE Short-Time Spectral Amplitude Estimator and Its Extension to Musical-Noise-Free Speech Enhancement”; 2014 4th Joint Workshop on Hands-Free Speech Communication and Microphone Arrays (HSCMA); IEEE; May 12, 2014; pp. 122-126. |
International Search Report and Written Opinion for PCT Application No. PCT/US2015/046979 dated Dec. 15, 2015; 13 pages. |
Number | Date | Country | |
---|---|---|---|
20160064010 A1 | Mar 2016 | US |
Number | Date | Country | |
---|---|---|---|
62045367 | Sep 2014 | US |