The present disclosure is generally related to technologies used for suppressing residual noise from preprocessed audio signals. More specifically, for a preprocessed audio signal that includes portions of speech, the disclosed technologies are used for suppressing residual noise from portions of the preprocessed audio signal between the portions of speech without distorting the speech portions.
A microphone of an audio receiver, e.g., of a mobile device, can receive (i) a speech signal (or simply speech) that arrives at the audio receiver along a “speech direction”, from where a user of the mobile device is expected to speak, and (ii) ambient noise along other directions, (in large part) different from the speech direction. Typically, the speech includes utterances separated by silence. As such, the microphone provides to the audio receiver an audio signal that includes portions of noisy speech (corresponding to a combination of the utterances and ambient noise) separated by portions of ambient noise (corresponding only to the ambient noise that “fills” the silence between the utterances). The audio receiver can use conventional technologies for suppressing the ambient noise from the audio signal without distorting the speech, thus forming a “speech beam” that appears to have been received at the audio receiver along the speech direction. The speech beam, referred here as a preprocessed audio signal, includes portions of speech (corresponding to a combination of the utterances and suppressed ambient noise) separated by portions of residual noise (corresponding only to the suppressed ambient noise). Although the speech included in the input audio signal can be reproduced in the portions of speech of the preprocessed audio signal with minor distortion, such that the speech distortion is hardly noticeable when a user listens to the preprocessed audio signal, the portions of residual noise of the preprocessed audio signal may sound too loud for the user.
In this disclosure, technologies are described that can be used, for a preprocessed audio signal that includes portions of speech separated by portions of residual noise, to suppress the preprocessed audio signal over the portions of residual noise without distorting the portions of speech.
One aspect of the disclosure can be implemented as a method that includes determining a preprocessed audio signal by removing some noise from an input audio signal. Here, portions of the preprocessed audio signal that include speech are separated by portions of the preprocessed audio signal that include residual noise. Additionally, the method includes determining an amplified signal by suppressing the preprocessed audio signal over the portions that include residual noise, and maintaining the preprocessed audio signal over the portions that include speech.
Implementations can include one or more of the following features. In some implementations, the method can include determining the portions of the preprocessed audio signal that include residual noise as corresponding to times when an envelope of the preprocessed audio signal is less than or equal to a first threshold signal; and determining the portions of the preprocessed signal that include speech as corresponding to times when the envelope of the preprocessed audio signal is larger than the first threshold signal.
In some cases, a value of the first threshold signal can be in a range from 5% to 20% of a maximum value of the envelope of the preprocessed audio signal. In some cases, the method can include setting a gain signal for controlling gain of an amplifier used on the preprocessed audio signal to (i) a value equal to a maximum gain value for the portions of the preprocessed audio signal that include speech, and (ii) at least one value smaller than the maximum gain value and larger than or equal to a threshold ratio for the portions of the preprocessed audio signal that include residual noise. For example, a value of the threshold ratio can be from 1% to 5% of a maximum value of the maximum gain value.
In some cases, the method can include determining a filtered signal using a nonlinear filter on the preprocessed audio signal; and determining the first threshold signal as the filtered signal biased by a bias factor, and a second threshold signal as the first threshold signal biased by a threshold ratio. Values of the gain signal for the portions of the preprocessed audio signal that include residual noise can include (i) a ratio of the envelope of the preprocessed audio signal to the first threshold signal, when the envelope of the preprocessed audio signal is larger than or equal to the second threshold signal, and (ii) a ratio of the second threshold signal to the first threshold signal, when the envelope of the preprocessed audio signal is smaller than the second threshold signal. For example, the bias factor can be in a range from 5% to 20% of a maximum value of the envelope of the preprocessed audio signal. Also, the determining of the filtered signal using the nonlinear filter on the preprocessed audio signal can include using a low pass filter having a cutoff frequency on a magnitude of the preprocessed audio signal; limiting an increase of the filtered signal to a positive value of an envelope limit when the filtered signal increases by more than the positive value of the envelope limit; and limiting a decrease of the filtered signal to a negative value of the envelope limit when the filtered signal decreases by more than the negative value of the envelope limit.
In some cases, the method can include determining the envelope of the preprocessed audio signal by (i) using a low pass filter having a cutoff frequency on a magnitude of the preprocessed audio signal when the envelope of the preprocessed audio signal increases, and (ii) scaling the envelope of the preprocessed audio signal by a release time when the envelope of the preprocessed audio signal decreases.
In some implementations, the input audio signal can include speech and ambient noise. In such case, the method can include obtaining (i) the portions of the preprocessed audio signal that include speech based on the removing of some noise from portions of the input audio signal that include both the speech and the ambient noise, and (ii) the portions of the preprocessed audio signal that include residual noise based on the removing of some noise from portions of the input audio signal that include only the ambient noise.
Another aspect of the disclosure can be implemented as a signal processing system that includes an amplifier to determine an amplified signal from a preprocessed audio signal and based on a gain signal. The preprocessed audio signal includes portions of speech separated by portions of residual noise. Additionally, the signal processing system includes a gain suppressor to (i) determine the portions of residual noise of the preprocessed audio signal as corresponding to times when an envelope of the preprocessed audio signal is at most equal to a first threshold signal; (ii) determine the portions of speech of the preprocessed audio signal as corresponding to times when the envelope of the preprocessed audio signal is larger than the first threshold signal; and (iii) set the gain signal to (1) a value equal to a maximum gain value for the portions of speech of the preprocessed audio signal, and (2) at least one value smaller than the maximum gain value and larger than or equal to a threshold ratio for the portions of residual noise of the preprocessed audio signal.
Implementations can include one or more of the following features. In some implementations, a value of the first threshold signal can be in a range from 5% to 20% of a maximum value of the envelope of the preprocessed audio signal. In some implementations, a value of the threshold ratio can be in a range from 1% to 5% of a maximum value of the maximum gain value.
In some implementations, the signal processing system can include a nonlinear filter to determine a filtered signal from the preprocessed audio signal; and a threshold generator to generate (i) the first threshold signal as the filtered signal weighted by a bias factor, and (ii) a second threshold signal as the first threshold signal weighted by the threshold ratio. Here, the at least one value of the gain signal for the portions of residual noise of the preprocessed audio signal can include (1) a ratio of the envelope of the preprocessed audio signal to the first threshold signal, when the envelope of the preprocessed audio signal is larger than or equal to the second threshold signal, and (2) a ratio of the second threshold signal to the first threshold signal, when the envelope of the preprocessed audio signal is smaller than the second threshold signal. In some cases, the bias factor can be in a range from 5% to 20% of a maximum value of the envelope of the preprocessed audio signal. In some cases, wherein, to determine the filtered signal, the nonlinear filter can low pass filter, based on a first cutoff frequency, a magnitude of the preprocessed audio signal; and limit an increase of the filtered signal to a positive value of an envelope limit, when the filtered signal increases by more than the positive value of the envelope limit, and limit a decrease of the filtered signal to a negative value of the envelope limit, when the filtered signal decreases by more than the negative value of the envelope limit.
In some implementations, the signal processing system can include an envelope generator to low pass filter, based on a cutoff frequency, the magnitude of the preprocessed audio signal when the envelope increases; and scale the envelope by a release time when the envelope decreases.
In some implementations, the signal processing system can include a hardware processor; and storage medium encoded with instructions that, when executed by the hardware processor, cause the signal processing system to use the gain suppressor. In some implementations, the signal processing system can be a system on chip.
In some implementations, the signal processing system can include a beam-former to receive an input audio signal, wherein the input audio signal includes speech and ambient noise; and obtain the speech portions of the preprocessed audio signal by removing some noise from portions of the input audio signal that include both the speech and the ambient noise, and obtain the residual noise portions of the preprocessed audio signal by removing some noise from portions of the input audio signal that include only the ambient noise.
The disclosed technologies can result in one or more of the following potential advantages. For example, an audio signal that includes speech received from a speech direction and ambient noise received from other directions can be processed. A first signal processing stage obtains a preprocessed audio signal that includes residual noise representing a suppressed version of the ambient noise. The disclosed technologies can be used to obtain a processed audio signal in which the residual noise included in the preprocessed audio signal has been suppressed, and the speech included in the preprocessed audio signal has been maintained with minor distortion. As such, the speech distortion is hardly noticeable when a user listens to the processed audio signal.
Details of one or more implementations of the disclosed technologies are set forth in the accompanying drawings and the description below. Other features, aspects, descriptions and potential advantages will become apparent from the description, the drawings and the claims.
Certain illustrative aspects of the systems, apparatuses, and methods according to the disclosed technologies are described herein in connection with the following description and the accompanying figures. These aspects are, however, indicative of but a few of the various ways in which the principles of the disclosed technologies may be employed and the disclosed technologies are intended to include all such aspects and their equivalents. Other advantages and novel features of the disclosed technologies may become apparent from the following detailed description when considered in conjunction with the figures.
A preprocessed audio signal 101 received at the input port 102 includes portions of speech and portions of residual noise.
The gain controller 120 accesses the preprocessed audio signal 101 and generates a gain signal 121 based on information determined from the preprocessed audio signal, as described below in connection with
The envelope generator 222 determines (as described below in connection with
Additionally in the flow chart of
At 310, the zeroth sample of the envelope E, i.e., E(0), is initialized to an initial value. For example, the initial value of E(0) can be initialized to zero. As another example, the initial value of E(0) can be set to the magnitude of the zeroth sample of the preprocessed audio signal SRN(0), i.e., E(0)=abs(SRN(0)).
Loop 315 is used to determine the remaining samples of the envelope E. Each iteration is used to determine a sample of the envelope E(k) in the following manner.
At 320, it is determined whether a magnitude of the kth sample of the preprocessed audio signal SRN(k) is smaller than the priori (k−1)th sample of the envelope E(k−1), abs(SRN(k))<E(k−1). If a result of the determination performed at 320 is true, then it is inferred that the envelope E of the preprocessed audio signal SRN is decreasing. As such, at 330, the envelope E of the preprocessed audio signal SRN is scaled by a release time constant CRT. For example, the kth sample of the envelope E(k) is determined as:
E(k)=CRTE(k−1) (1).
At this point, a next iteration of the loop 315 is triggered to determine the next sample of the envelope E(k+1), and so on.
However, if a result of the determination performed at 320 is false, then it is inferred that the envelope E of the preprocessed audio signal SRN is increasing. As such, at 340, the envelope E of the preprocessed audio signal SRN is filtered using a first low pass filter having a first cutoff frequency fC1 that depends on the value of an attack time constant CAT, where the attack time constant CAT satisfies the inequality, 0≦CAT≦1. In this manner, the kth sample of the envelope E(k) is determined as a weighted sum of the magnitude of the kth sample of the audio signal NR(k) and a previous sample of the envelope E(k−1) in the following manner:
E(k)=CATE(k−1)+(1−CAT)abs(SRN(k)) (2).
A small value of the attack time constant CAT corresponds to a small value of the first cutoff frequency fC1 associated with a slow first low pass filter; and a large value of the attack time constant CAT corresponds to a large value of the first cutoff frequency fC1 associated with a fast first low pass filter.
At this point, a next iteration of the loop 315 is triggered to determine the next sample of the envelope E(k+1), and so on.
At 410, the zeroth sample of the filtered signal ES(0) is initialized to an initial value. For example, the initial value of ES(0) can be initialized to zero. As another example, the initial value of ES(0) can be set to the magnitude of the zeroth sample of the preprocessed audio signal SRN(0), i.e., ES(0)=abs(SRN(0)).
Loop 415 is used to determine the remaining samples of the filtered signal ES. Each iteration is used to determine a sample of the filtered signal ES(k) in the following manner.
At 420, a kth sample of the filtered signal ES(k) is determined as a weighted sum of the magnitude of the kth sample of the preprocessed audio signal SRN(k) and a previous sample of the filtered signal ES(k−1). For example, the kth sample of the filtered signal ES(k) is determined in the following manner:
E
S(k)=αES(k−1)+(1−α)abs(SRN(k)) (3),
where α is a weight, 0≦α≦1.
At 430, a change ΔES in the filtered signal is determined, e.g., based on:
ΔES=ES(k)−ES(k−1) (4).
At 440, it is determined whether the filtered signal increases by more than a positive value of an envelope limit, ΔES>+EL, where a magnitude of the envelope limit is EL. If a result of the determination performed at 440 is true, then, at 450, the change ΔES in the filtered signal is limited to the positive value of the envelope limit, such that the kth sample of the filtered signal ES(k) is determined as:
E
S(k)=ES(k−1)+EL (5).
At this point, a next iteration of the loop 415 is triggered to determine the next sample of the filtered signal ES(k+1), and so on.
However, if a result of the determination performed at 440 is false, then, at 460, it is determined whether the filtered signal decreases by more than a negative value of the envelope limit, ΔES<−EL. If a result of the determination performed at 460 is true, then, at 470, the change ΔES in the filtered signal is limited to the negative value of the envelope limit, such that the kth sample of the filtered signal ES(k) is determined as:
E
S(k)=ES(k−1)−EL (6).
At this point, a next iteration of the loop 415 is triggered to determine the next sample of the second filtered signal ES(k+1), and so on. Moreover, if a result of the determination performed at 460 is false, then a next iteration of the loop 415 is still triggered to determine the next sample of the filtered signal ES(k+1), and so on.
When both results of the determination performed at 440 and the determination performed at 460 are false, a magnitude of the change ΔES in the filtered signal is smaller than a magnitude of the envelope limit, i.e., abs(ΔES)≦EL. Only when the foregoing inequality is satisfied, a value of the kth sample of the filtered signal ES(k) remains as determined at 420, in accordance with Eq. No. (3). As discussed above in connection with
The flow chart of the process 424 can be summarized using the following portion of pseudo-code:
ΔES=αES(k−1)+(1−α)abs(SRN(k))−ES(k−1);
If ΔES>+EL, then ΔES=+EL;
If ΔES<−EL, then ΔES=−EL;
E
S(k)=ES(k−1)+ΔES.
At 510, the filtered signal ES is biased using a bias factor B, such that the kth sample of the first threshold Th1(k) is determined as:
Th
1(k)=BES(k) (7).
The first threshold signal Th1 will be used by the gain suppressor 228 to determine a level of the envelope E of the preprocessed audio signal SRN to be suppressed. In other words, the first threshold signal will be used to differentiate between the portions of residual noise 105 and the portions of speech 103 of the preprocessed audio signal SRN. As such, the bias factor B can be used as a tuning parameter in accordance with Eq. No. (7) to determine the level of the envelope E of the preprocessed audio signal SRN to be suppressed, as described below in connection with
In some implementations, the first threshold signal can be set to a single constant value, e.g., Th1(k)=Th1, for all k=1 . . . N. In this case, the constant value Th1 can be the bias factor B, for instance.
At 520, the first threshold signal Th1 is biased using a threshold ratio R, such that the kth sample of the second threshold Th2(k) is determined as:
Th
2(k)=RTh1(k) (8).
The second threshold signal Th2 will be used by the gain suppressor 228 to determine an amount of the envelope E of the preprocessed audio signal SRN to be suppressed. In other words, the second threshold signal will be used to prevent complete suppression of the preprocessed audio signal SRN over its portions of residual noise 105, such that the processed audio signal 111 output by the amplifier 110 does not include portions of complete silence between the portions of speech 103. As such, the threshold ratio R can be used as a tuning parameter in accordance with Eq. No. (8) to determine the amount of the envelope E of the preprocessed audio signal SRN to be suppressed. For instance, the threshold ratio R can be in a range from 0.1 to 0.9.
In some implementations, the tuning of the bias factor B, or the threshold ratio R, or both, is carried out at design time, before fabrication of the gain controller 120. In some implementations, the tuning of the bias factor B, or the threshold ratio R, or both, is carried out at fabrication time, before shipping the gain controller 120 (e.g., either by itself or as part of the residual noise suppressor 100). In some implementations, the tuning of the bias factor B, or the threshold ratio R, or both, is carried out at run time (i.e., in the field), either by a user through a user interface of the gain controller 120, or by another process that interacts with the gain controller through an application programming interface (API).
At 610, it is determined whether a sampling time associated with the kth sample of the gain signal G(k) belongs to a portion of the envelope E of the preprocessed audio signal SRN that corresponds to residual noise 105. To make this determination, it is tested whether a value of the kth sample of the first threshold signal Th1(k) is larger than a value of the kth sample of the envelope E(k), i.e., E(k)<Th1(k).
Referring again to
Referring again to
Because it has been determined at 610 that E(k)<Th1(k) is satisfied, Eq. No. (9) ensures that a value of the kth sample of the gain signal G(k) is less than 1. In this manner, portions of the preprocessed audio signal 101 that do correspond to residual noise will be suppressed. At this point, a next iteration of the loop 605 is triggered to determine the next sample of the gain signal G(k+1), and so on.
The first threshold signal Th1 represents a tuning parameter of the gain suppressor 125, as suggested in
Referring again to
As the second threshold signal Th2 is determined, in accordance with Eq. No. (8), to be a biased value of the first threshold signal Th1, where the bias factor is the threshold ratio R, the kth sample of the gain signal G(k) can be expressed as:
G(k)=R (10′),
for values of the portions of residual noise 105 of the preprocessed audio signal 101 that are smaller than the second threshold signal Th2. Sampling times corresponding to the foregoing condition can be identified in
Referring again to graph 670
In some implementations, the residual noise suppressor 100 can be implemented in software, as illustrated in
Applications are disclosed below, in which the residual noise suppressor 100, described above in connection with
The beam former 802 has two input ports 805A and 805B configured to receive (i) speech that arrives at the signal processing system 800 along a speech direction, and (ii) ambient noise along other directions, (in large part) different from the speech direction. Typically, the speech includes utterances separated by silence. As such, respective microphones included in the input ports 805A and 805B convert the received speech and ambient noise to input audio signals 801A and 801B. As such, each of the input audio signals 801A, 801B includes portions of noisy speech (corresponding to a combination of the utterances and ambient noise) separated by portions of ambient noise (corresponding only to the ambient noise that “fills” the silence between the utterances). The beam former 802 is configured to suppress the ambient noise from the input audio signals 801A, 801B, and maintain, undistorted, the portions of speech of the input audio signals. As such, the beam former 802 directionally filters the input audio signals 801A, 801B and outputs a preprocessed audio signal 101. In other words, the beam former 802 outputs a preprocessed audio signal 101 that corresponds to a beam that reaches the input ports 805A, 805B along the speech direction associated with the speech. Moreover, the preprocessed audio signal 101 includes portions of speech and portions of residual noise that separate the portions of speech. The residual noise suppressor 100 (i) receives the preprocessed audio signal 101, and (ii) further suppresses the preprocessed audio signal over portions of residual noise, and maintains, undistorted, the preprocessed audio signal over portions of speech. As such, the residual noise suppressor 100 outputs a processed audio signal 111 from which the residual noise has been suppressed.
In some implementations, the input ports 805A, 805B further include analog to digital converters (ADCs), such that the input audio signals 801A, 801B to be processed by the beam former 802 are digital signals. In such case, a sampling rate of the ADCs can be fS=8 kHz or 16 kHz, for instance, so the speech received by the input ports 805A, 805B can be adequately sampled.
The beam former 802 includes an averager 810 linked to the input ports 805A, 805B; and a subtractor 834 linked to the averager 810. The beam former 802 further includes a subtractor 824A; a gain and phase loop 820A linked to both the averager 810 and the subtractor 824A; and a delay 822A linked to both the input port 805A and the subtractor 824A. Also, the beam former 802 includes an adder 832 linked to the subtractor 834; and a noise cancellation adaptive (NCA) filter 830A linked to both the subtractor 824A and the adder 832. In addition, the beam former 802 includes a subtractor 824B; a gain and phase loop 820B linked to both the averager 810 and the subtractor 824B; a delay 822B linked to both the input port 805B and the subtractor 824B; and a NCA filter 830B linked to both the subtractor 824B and the adder 832. In some embodiments, the beam former 802 is implemented in accordance with the systems and techniques described in U.S. Pat. No. 9,276,618, issued on Mar. 1, 2016, which is hereby incorporated by reference in its entirety.
The components of the residual noise suppressor 100 were described in detail above in connection with
Functional aspects of the signal processing system 800 are described below as it is implemented to perform process 900 for suppressing ambient noise from audio signals, using multiple suppression stages.
At 910, the beam former 802 determines the preprocessed audio signal 101 that includes portions of speech 103 separated by portions of residual noise 105. To determine the preprocessed audio signal 101, the beam former 802 performs the following operations.
At 912, the beam former 802 receives the input audio signals 805A, 805B, where each of the input audio signals includes speech and ambient noise. Speech arriving at the input ports 805A, 805B of the beam former 802 along a speech direction is received by the input ports substantially at the same time, while the ambient noise arriving at the input ports along directions different from the speech direction is received by the input ports at different times. In this manner, portions of speech of the input audio signals 801A, 801B are in phase with each other, while portions of ambient noise of the input audio signals are out of phase with, or delayed with respect to, each other.
At 914, the beam former 802 suppresses some of the ambient noise 804 from the input audio signals 801A, 801B, as explained below. Referring to
The preprocessed audio signal 101 includes portions of speech (which correspond to the portions of speech of the average input audio signal 815 that have been reproduced without distortion) and portions of residual noise 105 that separate the portions of speech. The portions of residual noise 105 of the preprocessed audio signal 101 correspond to the portions of ambient noise 804 over which the average input audio signal 815 has been suppressed by the beam former 802.
Process 900 continues, at 920, where the residual noise suppressor 100 determines the processed audio signal 111 from the preprocessed audio signal 101. As the residual noise suppressor 100 uses the amplifier 110 to determine the processed signal 111, the latter is also referred to as the amplified signal 111. To determine the processed audio signal 111, the residual noise suppressor 100 performs the following operations.
At 922, the residual noise suppressor 100 determines the portions of speech 103 and portions of residual noise 105 of preprocessed audio signal 101. To perform 922, the residual noise suppressor 100 uses the gain controller 120 described above in connection with
At 924, the residual noise suppressor 100 controls the gain of the amplifier 110, based on the gain signal 121, to (i) reproduce the preprocessed audio signal 101 undistorted over the portions of speech 103, and (ii) suppress the preprocessed audio signal over the portions of residual noise 105. The residual noise suppressor 100 generates the gain signal 121, by using the gain controller 120, in accordance with operations 620-650 of process 628, as described above in connection with
In addition, the processed audio signal 111 output by the residual noise suppressor 100 includes portions of speech 103 (which correspond to the portions of speech of the preprocessed audio signal 101 that have been reproduced without distortion and suppression), and portions of suppressed residual noise 115 that separate the portions of speech. The portions of suppressed residual noise 115 of the processed audio signal 111 correspond to the portions of ambient noise 804 over which the average input audio signal 815 has been suppressed by the beam former 802 and the preprocessed audio signal 101 has been suppressed by the residual noise suppressor 100.
In some implementations, the beam former 802 and the residual noise suppresser 100 of the signal processing system 800 can be implemented in software, as illustrated in
A few embodiments have been described in detail above, and various modifications are possible. The disclosed subject matter, including the functional operations described in this specification, can be implemented in electronic circuitry, computer hardware, firmware, software, or in combinations of them, such as the structural means disclosed in this specification and structural equivalents thereof, including system on chip (SoC) implementations, which can include one or more controllers and embedded code.
While this specification contains many specifics, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments.
Other embodiments fall within the scope of the following claims.
This disclosure claims priority to U.S. Provisional Application Ser. No. 62/222,541, filed Sep. 23, 2015, the disclosure of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62222541 | Sep 2015 | US |