Embodiments of this disclosure relate to an apparatus and method for enhancing a signal that has a component that is wanted and a component that is unwanted.
It can be helpful to enhance a speech component in a noisy signal. For example, speech enhancement is desirable to improve the subjective quality of voice communication, e.g., over a telecommunications network. Another example is automatic speech recognition (ASR). If the use of ASR is to be extended, it needs to improve its robustness to noisy conditions. Some commercial ASR solutions claim to offer good performance, e.g., a word error rate (WER) of less than 10%. However, this performance is often only realisable under good conditions, with little noise. The WER can be larger than 40% under complex noise conditions.
One approach to enhance speech is to capture the audio signal with multiple microphones and filter those signals with an optimum filter. The optimum filter is typically an adaptive filter that is subject to certain constraints, such as maximising the signal-to-noise ratio (SNR). This technique is based primarily on noise control and gives little consideration to auditory perception. It is not robust under high noise levels. Too strong processing can also attenuate the speech component, resulting in poor ASR performance.
Another approach is based primarily on control of the foreground speech, as speech components tend to have distinctive features compared to noise. This approach increases the power difference between speech and noise using the so-called “masking effect”. According to psychoacoustics, if the power difference between two signal components is large enough, the masker (with higher power) will mask the maskee (with lower power) so that the maskee is no longer audibly perceptible. The resulting signal is an enhanced signal with higher intelligibility.
A technique that makes use of the masking effect is Computational Auditory Scene Analysis (CASA). It works by detecting the speech component and the noise component in a signal and masking the noise component. One example of a specific CASA method is described in CN105096961. An overview is shown in
It is an object of the disclosure to provide improved concepts for enhancing a wanted component in a signal.
The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
According to a first aspect, a signal enhancer is provided that comprises an input configured to receive an audio signal that has a wanted component and an unwanted component. It also comprises a perception analyser that is configured to split the audio signal into a plurality of spectral components. The perception analyser is also configured to, for each spectral component, designate that spectral component as belonging to the wanted component or the unwanted component in dependence on a power estimate associated with that spectral component. If a spectral component is designated as belonging to the unwanted component, the perception analyser is configured to adjust its power by applying an adaptive gain to that spectral component, wherein the adaptive gain is selected in dependence on how perceptible the spectral component is expected to be to a user. This improves the intelligibility of the wanted component.
In a first implementation form of the first aspect, the perceptual analyser may be configured to, for each spectral component that is designated as belonging to the unwanted component, compare its power estimate with a power threshold. The perceptual analyser may be configured to select the adaptive gain to be a gain that will leave the power associated with that spectral component unchanged if the power estimate is below the power threshold. The perceptual analyser may be configured to select the adaptive gain to be a gain that will reduce the power associated with that spectral component if the power estimate is above the power threshold. This increases the relative power of the wanted component relative to the unwanted component, which improves the intelligibility of the wanted component.
In a second implementation form of the first aspect, the power threshold of the first implementation form may be selected in dependence on a power at which that spectral component is expected to become perceptible to the user. This improves the intelligibility of the wanted component in a practical sense, since different frequency components are perceived differently by a human user.
In a third implementation form of the first aspect, the power threshold of the first or second implementation forms may be selected in dependence on how perceptible a spectral component is expected to be to a user given a power associated with one or more of the other spectral components. This improves the perceptibility of the wanted component in the enhanced signal.
In a fourth implementation form of the first aspect, the perception analyser of any of the first to third implementation forms may be configured to select the power threshold for each spectral component in dependence on a group associated with that spectral component, wherein the same power threshold is applied to the power estimates for all the spectral components comprised in a specific group. This is consistent with the principles of psychoacoustics.
In a fifth implementation form of the first aspect, the perception analyser of the fourth implementation form may be configured to select the power threshold for each group of spectral components to be a predetermined threshold that is assigned to that specific group in dependence on one or more frequencies that are represented by the spectral components in that group. This is consistent with the principles of psychoacoustics.
In a sixth implementation form of the first aspect, the perception analyser of the fourth or fifth implementation forms may be configured to determine the power threshold for a group of spectral components in dependence on the power estimates for the spectral components in that specific group. This considers the relative strength, in the signal, of spectral components that are similarly perceptible to a human user.
In a seventh implementation form of the first aspect, the perception analyser of the sixth implementation form may be configured to determine the power threshold for a specific group of spectral components by identifying the highest power estimated for a spectral component in that specific group and generating the power threshold by decrementing that highest power by a predetermined amount. This considers how perceptible a particular spectral component is likely to be given the power of other spectral components in its spectral group.
In an eighth implementation form of the first aspect, the perception analyser of any of the fourth to seventh implementation forms may be configured to select the power threshold for a group of spectral components by comparing a first threshold and a second threshold. The first threshold may be assigned to a specific group in dependence on one or more frequencies that are represented by the spectral components in that group. The second threshold may be determined in dependence on the power estimates for the spectral components in that group. The perception analyser may be configured to select, as the power threshold for the group, the lower of the first and second thresholds. The signal enhancer is thus able to select the more appropriate threshold.
In a ninth implementation form of the first aspect, the perception analyser of any of the first to eighth implementation forms may be configured to, for each spectral component that is designated as: (i) belonging to the unwanted component; and (ii) having a power estimate that is above the power threshold, select the adaptive gain to be a ratio between the power threshold and the power estimate for that spectral component. This reduces the power of the unwanted spectral component to an acceptable level.
In a tenth implementation form of the first aspect, the signal enhancer, in particular the signal enhancer of any of the above-mentioned implementation forms, comprises a transform unit. The transform unit may be configured to receive the audio signal in the time domain and convert that signal into the frequency domain, whereby the frequency domain version of the audio signal represents each spectral component of the audio signal by a respective coefficient. The perception analyser may be configured to adjust the power associated with a spectral component by applying the adaptive gain to the coefficient that represents that spectral component in the frequency domain version of the audio signal. Performing this adjustment in the frequency domain is convenient, because it is in the frequency domain that the perceptual differences between different parts of the audio signal become apparent.
In an eleventh implementation form of the first aspect, the perception analyser of the tenth implementation form may be configured to form a target audio signal to comprise non-adjusted coefficients, which represent the spectral components designated as belonging to the wanted component of the audio signal, and adjusted coefficients, which represent the spectral components designated as belonging to the unwanted component of the audio signal. This target audio signal can form a constraint for optimising the filtering of the audio signal and other audio signals. The target audio signal could be formed in the frequency domain or in the time domain.
In a twelfth implementation form of the first aspect, the transform unit of a signal enhancer of the eleventh aspect may be configured to receive the target audio signal in the frequency domain convert it into the time domain, wherein the output is configured to output the time domain version of the target audio signal. This generates a time domain signal that can be used as a target audio signal.
According to a second aspect, a method is provided that comprises obtaining an audio signal that has a wanted component and an unwanted component. The method comprises splitting the audio signal into a plurality of spectral components. It also comprises, for each spectral component, designating that spectral component as belonging to the wanted component or the unwanted component in dependence on a power estimate associated with that spectral component. The method comprises, if a spectral component is designated as belonging to the unwanted component, adjusting its power by applying an adaptive gain to that spectral component, wherein the adaptive gain is selected in dependence on how perceptible the spectral component is expected to be to a user.
According to a third aspect, a non-transitory machine readable storage medium is provided having stored thereon processor executable instructions implementing a method. That method comprises obtaining an audio signal that has a wanted component and an unwanted component. It also comprises, for each spectral component, designating that spectral component as belonging to the wanted component or the unwanted component in dependence on a power estimate associated with that spectral component. The method comprises, if a spectral component is designated as belonging to the unwanted component, adjusting its power by applying an adaptive gain to that spectral component, wherein the adaptive gain is selected in dependence on how perceptible the spectral component is expected to be to a user.
These and other aspects will now be described by way of example with reference to the accompanying drawings. In the drawings:
A signal enhancer is shown in
The perception analyser 202 comprises a frequency transform unit 207 that is configured to split the input signal into a plurality of spectral components. Each spectral component represents part of the input signal in a particular frequency band or bin. The perception analyser also includes a masking unit 203 that is configured to analyse each spectral component and designate it as being part of the wanted component or part of the unwanted component. The masking unit makes this decision in dependence on a power estimate that is associated with the spectral component. The perception analyser also includes an adaptive gain controller 204. If a spectral component is designated as being part of the unwanted component, the adaptive gain controller applies an adaptive gain to that spectral component. The adaptive gain is selected in dependence on how perceptible the spectral component is expected to be to a user.
The signal enhancer shown in
An example of a method for enhancing a signal is shown in
The structures shown in
The apparatus and method described herein can be used to implement speech enhancement in a system that uses signals from any number of microphones. In one example, the techniques described herein can be incorporated in a multi-channel microphone array speech enhancement system that uses spatial filtering to filter multiple inputs and to produce a single-channel, enhanced output signal. The enhanced signal that results from these techniques can provide a new constraint for spatial filtering by acting as a target signal. A signal that is intended to act as a target signal is preferably generated by taking account of psychoacoustic principles. This can be achieved by using an adaptive gain control that considers the estimated perceptual thresholds of different frequency components in the frequency domain.
A more detailed embodiment is shown in
The spectral components that are designated as belonging to speech are then compared with a masking power threshold 412 (which is output by masking power generator 209). This additional power threshold THD(i) relates to the power at which different spectral components are expected to become perceptible to a user. (There are various parameters that could be used to set this threshold, and some examples are described below.) The masking power threshold controls the adaptive gain decision 413 that is made by the adaptive gain controller 204. The gain g(i) that is applied by the controller to the spectral component in bin i if that spectral component has been designated as noise changes in dependence on whether that spectral component meets the masking power threshold or not (in 407). In one example, the power of a spectral component is left unchanged if its power estimate is below the masking power threshold and is reduced if its power estimate is above the masking power threshold. Spectral components that have been designated as including speech are to be left unchanged, so a gain of one is selected for them in 406. This is just an example as any suitable gain could be applied. For example, in one example the spectral components that are designated as including speech could be amplified.
The gain selected for each spectral component is applied to each respective coefficient 408. The new coefficients X′(i) form the basis for an inverse frequency transform 409 to construct an output frame in the time domain 410.
In
In block 502 a time-frequency transform is performed on the audio signal received by the input to obtain its frequency spectrum. This step may be implemented by performing a Short-Time Discrete Fourier Transform (SDFT) algorithm. The SDFT is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time. The SDFT may be computed by dividing the audio signal into short segments of equal length (such as a frame 501) and then computing the Fourier transform separately on each short segment. The result is the Fourier spectrum for each short segment of the audio signal, which captures the changing frequency spectra of the audio signal as a function of time. Each spectral component thus has an amplitude and a time extension.
An SDFT 502 is performed for each frame of the input signal 501. If the sampling rate is 16 kHz, the frame size might be set as 16 ms. This is just an example and other sampling rates and frame sizes could be used. It should also be noted that there is no fixed relationship between sampling rate and frame size. So, for example, the sampling rate could be 48 kHz with a frame size of 16 ms. A 512-point SDFT can be implemented over the input signal. Performing the SDFT generates a series of complex-valued coefficients X(i) in the frequency domain (where coefficient index i=0, 1, 2, 3, etc. can be used to designate the index of the signal in the time domain or the index of coefficients in the frequency domain). These coefficients are Fourier coefficients and can also be referred to as spectral coefficients or frequency coefficients.
For each coefficient X(i) a corresponding power P(i)=|X(i)|2 is computed 503. This can be defined by:
P(i)=real(X(i))2+imag(X(i))2
where real(*) and imag(*) are the real part and the imaginary part of the respective SDFT coefficient.
A reference power Pf(i) is also estimated for each Fourier coefficient X(i) 516. The perceptual analyser 202 in
The next step is to implement binary masking by comparing the power P(i) for each coefficient against the corresponding reference power Pf(i) 505. This generates a binary masking matrix M(i):
where γ is a pre-defined power spectral density factor.
The value M(i)=1 in the binary mask indicates that the corresponding coefficient P(i) is voice-like. In this case, there is no need to change the coefficient P(i). The value M(i)=0 indicates that the corresponding coefficient P(i) is noise-like. In this case, the corresponding coefficient should be adjusted using adaptive gain control 506,507.
In the example of
In a first operation, an absolute hearing threshold (THD1) Th1(i) (i=0, 1, 2, . . . ) is provided for each frequency index (i=0, 1, 2, . . . ) 511. This threshold is set in dependence on how perceptible each spectral component is expected to be to a user. Each frequency can be associated with a respective power threshold THD1, which is determined by psychoacoustic principles. Above the threshold, a spectral component at that frequency will be perceptible to the human auditory system; otherwise, it will not be perceptible. THD1 can thus be pre-defined and can be provided as a look-up table to masking power generator 209.
In practice, an absolute hearing threshold does not necessarily need to be defined for each individual spectral component. Instead the SDFT coefficients can be divided into several groups, and all coefficients within the same group can be associated with the same absolute hearing threshold. In other words, Th1(i)=Th1(j) for any coefficient indices i, j that are part of the same respective group.
The set of SDFT coefficients is preferably divided into groups of coefficients that are adjacent to each other (i.e. the coefficients represent adjacent frequency bins and thus have adjacent indices). A simple approach is to uniformly divide the coefficients into N groups (e.g., N=30), where each group has the same number of SDFT coefficients. Alternatively, the groups may be concentrated at certain frequencies. The number of coefficients assigned to a particular group may be different for a low frequency group than for a high frequency group. A preferred approach is to use the so-called bark scale, which is similar to a log scale. This is consistent with the basics of psychoacoustics. In general, the number of coefficients in a low frequency group should be less than the number of coefficients in a high frequency group. The absolute hearing thresholds for different bark bands is shown in
In a second operation, a relative masking threshold (THD2) is estimated for each frequency index (i=0, 1, 2, . . . ) 513. THD2 can be set by considering the masking effect of different frequencies. THD2 is preferably not determined individually for each coefficient representing a frequency index i, but is instead determined for each group of frequency indices. The coefficients may be grouped together following any of the approaches described above. In each group, the coefficient that has the maximum power in the current group is set as the “masker” 512. THD2 may be set to the power of the masker minus some predetermined amount α, where α may be set according to the principles of psychoacoustics. A suitable value for the predetermined amount α might be 13 dB, for example.
The final masking threshold for each coefficient index is then selected to be the minimum of THD1 and THD2514, i.e. THD(i)=min{THD1(i), THD2(i)}.
The third operation uses the binary mask determined in 505. For coefficients whose corresponding binary mask element is M(i)=1, the gain may be set to one, i.e. no change is made to that spectral component. For coefficients whose corresponding binary mask element is M(i)=0, the appropriate gain is determined by comparing the power determined for that spectral component in 504 with the threshold THD decided on in 514. This comparison is shown in block 515, and it can be expressed as follows:
where g(i) is the adaptive gain.
Essentially, if P(i)<THD (i) the spectral component is too weak to be heard and gain control is not required. If if P(i)>THD (i) the spectral component is sufficiently strong to be heard and so its power is adjusted by applying an appropriate gain to its coefficient in the frequency domain. The adaptive gain control is applied in 508 by computing a new coefficient X′(i):
X′(i)=g(i)*X(i)
The new coefficients X′(i) (i=0, 1, 2, . . . ) form the basis for the inverse Fourier transform in 509 that constructs the output frame 510. In this example the adjusted and non-adjusted coefficients are combined in the frequency domain and are then transformed together into the time domain. The combination could equally take place after the transformation into the time domain, so that the adjusted and non-adjusted coefficients are transformed into the time domain separately and then combined together to form a single output frame.
In other implementations, the threshold THD(i) may be determined differently. For example,
The coefficient-wise gain control described above changes the frequency domain spectrum and masks the noise components so that they become less perceptible. This effect is illustrated in
In
min(∥÷X−Tλ2)
where X is a matrix expression of microphone signals x1(i) and x2(i) and T is a matrix expression of target signal t(i). The optimum filters parameters are defined by the matrix expression à The set of optimum filter parameters à can then be used to filter the microphone signals to generate a single, enhanced signal.
The primary aim in an ASR scenario is to increase the intelligibility of the audio signal that is input to the ASR block. The original microphone signals are optimally filtered. Preferably, no additional noise reduction is performed to avoid removing critical voice information. For a voice communication scenario, a good trade-off between subjective quality and intelligibility should be maintained. Noise reduction should be considered for this application. Therefore, the microphone signals may be subjected to noise reduction before being optimally filtered. These alternatives are illustrated in
In
In
In the two-microphone arrangement shown in
It should be understood that where this explanation and the accompanying claims refer to the device doing something by performing certain steps or procedures or by implementing particular techniques that does not preclude the device from performing other steps or procedures or implementing other techniques as part of the same process. In other words, where the device is described as doing something “by” certain specified means, the word “by” is meant in the sense of the device performing a process “comprising” the specified means rather than “consisting of” them.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present disclosure may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the disclosure.
This application is a continuation of International Application No. PCT/EP2017/051311, filed on Jan. 23, 2017, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2017/051311 | Jan 2017 | US |
Child | 16520050 | US |