The present invention relates to a computer-implemented method, a server, a video-conferencing endpoint, and a non-transitory storage medium.
During video calls, acoustic noises such as kitchen noises, dogs barking, or interfering speech from other people who are not part of the call can be annoying and distracting to the call participants and disruptive to the meeting. This is especially true for noise sources which are not visible in the camera view, as the human auditory system is less capable of filtering out noises that are not simultaneously detected by the visual system.
An existing solution to this problem is to combine multiple microphone signals into a spatial filter (or beam-former) that is capable of filtering out acoustic signals coming from certain directions that are said to be out-of-beam, for example from outside the camera view. This technique works well for suppressing out-of-beam noise sources if the video system is used outdoors or in a very acoustically dry room i.e. one where acoustic reflections are extremely weak. However in the majority of rooms where a video conferencing system is used, an out-of-beam noise source will generate a plethora of acoustic reflections coming from directions which are in-beam. These in-beam reflections of the noise source are not filtered out by the spatial filter, and are therefore transmitted un-attenuated to the far-end participants. Thus, even with an ideal spatial filter, out-of-beam noises can still be transmitted and disrupt the video conference.
US 2016/0066092 A1 proposes approaching this issue by filtering source signals from an output based on directional-filter coefficients using a non-linear approach. Owens A., Efros A. A. (2018) Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In: Ferrari V., Hebert M., Sminchisescu C., Weiss Y. (eds) Computer Vision—ECCV 2018. ECCV 2018. Lecture Notes in Computer Science, vol 11210. Springer, Cham. proposes approaching this issue through the application of deep-learning based models.
Accordingly, in a first aspect, embodiments of the invention provide a computer-implemented method of processing an audio signal, the method comprising:
The above recited method allows out-of-beam noise sources to be suppressed, and so improves the intelligibility of the in-beam audio sources.
Optional features of the invention will now be set out. These are applicable singly or in any combination with any aspect of the invention.
The invention includes the combination of the aspects and preferred features described except where such a combination is clearly impermissible or expressly avoided.
Determining in-beam components of the audio signal may include applying a beam-forming process to the received audio signals. The beam-forming process may include estimating an in-beam signal as a linear combination of time-frequency signals from each of the plurality of microphones. The linear combination may take the form:
xIB(t, f)=w1(f)·x1(t, f)+w2(f)·x2(t, f)+ . . . wn(f)·xn(t, f), where wi are complex combination weights and xi(t, f) are time-frequency signals, one for each of n microphones.
In some examples, the in-beam signal xIB(t, f) (not necessarily calculated using the equation above) corresponds to the in-beam level, and therefore computing an in-beam level involves computing the in-beam signal and computing the post-processing gain can include utilising the in-beam level to calculate a further parameter for use in the post-processing gain. In other examples, the in-beam level is calculated using the in-beam signal xIB(t, f). Both variants are discussed in more detail below.
At least one microphone of the two or more microphones may be a unidirectional microphone, and another microphone of the two or more microphones may be an omnidirectional microphone, and determining in-beam components of the audio signals may include utilising the audio signals received by the unidirectional microphone as a spatial filter.
The microphones may be installed within a video-conferencing endpoint.
The reference level may be computed as:
L
ref(t,f)=γ×|xi(t,f)|p+(1−γ)×Lref(t−1,f);
where γ is a smoothing factor, p is a positive number which may take a value of 1 or 2, and xi(t, f) is a time-frequency component resulting from the discrete Fourier transform of the received audio signals. The smoothing factor may take a value between 0 and 1 inclusive.
The in-beam level may be computed as:
L
IB(t,f)=γ×|xIB(t,f)|p+(1−γ)×LIB(t−1,f);
where γ is a smoothing factor, p is a positive number which may take a value of 1 or 2, and xIB(t, f) is the in-beam time-frequency component resulting from the discrete Fourier transformer of the received audio signals. The smoothing factor may take a value between 0 and 1 inclusive.
The post-processing gain may be computed as:
g(t,f)=LIB(t f)/Lref(t,f).
The method may further comprise applying a squashing function to the post-processing gain, such that the post-processing gain takes a value of at least 0 and no more than 1. The squashing function may utilise a threshold T, and may be take the form:
h(s)=0 if s<0
h(s)=β·sα if 0≤s≤T
h(s)=1 if s>T
where α and β are positive real values. In some examples, α=1 and β=1. In other examples the squashing function is an implementation of the generalised logistic function.
In a further example, when LIB(t, f)≤T·Lref(t, f) the post-processing gain is computed as:
where α and β are positive real numbers, otherwise the post-processing gain is computed as:
g(t,f)=1.
Applying the post-processing gain to the in-beam components may include multiplying the post-processing gain by the in-beam components.
In a further example, the in-beam level may be used to compute a covariance, c(t, f), between the determined in-beam components of the audio-signals and the received audio signals and wherein the computed covariance is used to compute the post-processing gain. For example, the covariance may be computed as:
c(t,f)=γ×
where xi(t, f) is a reference time-frequency component resulting from the discrete Fourier transform of the received audio signals, xIB(t, f) is the in-beam time-frequency component resulting from the discrete Fourier transform of the received audio signals corresponding to the in-beam level, and
In this case, the post-processing gain may be computed as
g(t,f)=c(t,f)/Lref(t,f).
A squashing function may also be applied to this variant of the post-processing gain, such that the post-processing gain takes a value of at least 0 and no more than 1. Therefore, the post-processing gain is:
where h(s) is the squashing function. For instance, using a threshold, T, as described for h(s) above. Using the covariance c(t, f) can improve the performance of the post-processing filter because the in-beam signal xIB(t, f) may be correlated with a received out-of-beam signal xOB(t, f)=xi(t, f)−xIB(t, f) which would be reflected in the covariance c(t, f).
Alternatively, the post-processing gain may be computed using a linear, or widely linear, filter. This may involve computing the post-processing gain using a pseudo-reference level and a pseudo-covariance. For example, the post-processing gain may be computed as:
where g0(t, f) is computed as:
g1(t, f) is computed as:
LPref(t, f) is a pseudo-reference level, for example, computed as:
L
Pref(t,f)=γ×xi(t,f)2+(1−γ)×LPref(t−1,f);
cP(t, f) is a pseudo-covariance, for example, computed as:
c
P(t,f)=γ×xi(t,f)×xIB(t,f)+(1−γ)×cP(t−1,f);
and h is a squashing function, such that the post-processing gain takes a value between 0 and 1.
The method may further comprise computing a common gain factor from one or more of the plurality of time-frequency signals, and applying the common gain factor to one or more of the other time-frequency signals as the post-processing gain. Applying the common gain factor may include multiplying the common gain factor with the post-processing gain before applying the post-processing gain to one or more of the other time-frequency signals.
The method may further comprise taking as an input a frame of samples from the received audio signals and multiplying the frame with a window function. The method may further comprise transforming the windowed frame into the frequency domain through application of a discrete Fourier transform, the transformed audio signals comprises a plurality of time-frequency signals.
Determining in-beam components of the audio signals may include receiving, from a video camera, a visual field, and defining in-beam to be the spatial region corresponding to the visual field covered by the video camera.
In a second aspect, embodiments of the invention provide a server, comprising a processor and memory, the memory containing instructions which cause the processor to:
The memory of the second aspect may contain machine executable instructions which, when executed by the processor, cause the processor to perform the method of the first aspect including any one, or any combination insofar as they are compatible, of the optional features set out with reference thereto.
In a third aspect, embodiments of the invention provide a video-conferencing endpoint, comprising:
The memory of the third aspect may contain machine executable instructions which, when executed by the processor, cause the processor to perform the method of the first aspect including any one, or any combination insofar as they are compatible, of the optional features set out with reference thereto.
In a fourth aspect, embodiments of the invention provide a computer, containing a processor and memory, wherein the memory contains machine executable instructions which, when executed on the processor, cause the processor to perform the method of the first aspect including any one, or any combination insofar as they are compatible, of the optional features set out with reference thereto. The computer may be, for example, a video-conferencing end point and may be configured to receive a plurality of audio signals over a network.
Further aspects of the present invention provide: a computer program comprising code which, when run on a computer, causes the computer to perform the method of the first aspect; a computer readable medium storing a computer program comprising code which, when run on a computer, causes the computer to perform the method of the first aspect; and a computer system programmed to perform the method of the first aspect.
Embodiments of the invention will now be described by way of example with reference to the accompanying drawings in which:
Aspects and embodiments of the present invention will now be discussed with reference to the accompanying figures. Further aspects and embodiments will be apparent to those skilled in the art. All documents mentioned in this text are incorporated herein by reference
Each digitized signal is then fed into an analysis filter bank. This filter bank transforms it into the time-frequency domain. More specifically, at regular intervals (such as every 10 ms) the analysis filter bank takes as input a frame of samples (e.g., 40 ms), multiples that frame with a window function (e.g. a Hann window function) and transforms the windowed frame into the frequency domain using a discrete Fourier transform (DFT). In other words, every 10 ms for example, each analysis filter bank outputs a set of N complex DFT coefficients (e.g. N=256). These coefficients can be interpreted as the amplitudes and phases of a sequence of frequency components ranging from 0 Hz to half the sampling frequency (the upper half of the frequencies are ignored as they do not contain any additional information). These signals are referred to as time-frequency signals and are denoted by: x1(t, f), x2(t, f), and x3(t, f), one for each microphone. t is the time frame index, which takes integer values e.g. 0, 1, 2 . . . , and f is the frequency index which takes integer values from 0, 1, . . . , N−1.
The time-frequency signal for each frequency index f is then processed independently of the other frequency indexes. Hence, for simplicity,
For each frequency index f, a spatial filter is used to filter out sound signals coming from certain directions, which are referred to as out-of-beam directions. The out-of-beam directions are typically chosen to be the directions not visible in the camera view. The spatial filter computes an in-beam signal xIB(t, f) as a linear combination of the time-frequency signals for the microphones. The estimate of the in-beam for time index t and frequency index f is a linear combination of the time-frequency signals for all microphones, that is:
x
IB(t,f)=w1(f)·x1(t,f)+w2(f)·x2(t,f)+w3(f)·x3(t,f)
where the complex combination weights w1(f), w2(f), and w3(f) are time independent and can be found using beamforming design approaches known per se in the art.
At this stage, the in-beam signal, which is the output of the spatial filter, may contain a significant amount of in-beam reflections generated by one or more out-of-beam sound sources. These unwanted reflections are filtered out by the post-processor which is discussed in detail below. After post-processing each frequency index f, a synthesis filter bank is used to transform the signals back into the time domain. This is the inverse operation of the analysis filter bank, which amounts to converting N complex DFT coefficients into a frame comprising, for example, 10 ms of samples.
The post-processor takes two time-frequency signals as inputs. The first is a reference signal, here chosen to be the first time-frequency signal x1(t, f), although any of the other time-frequency signals could instead be used as the reference signal. The second input is the in-beam signal xIB(t, f), which is the output of the spatial filter. For each of these two inputs, a level is computed using exponential smoothing. That is, the reference level is:
L
ref(t,f)=γ·|x1(t,f)|p+(1−γ)·Lref(t−1,f)
where γ is a smoothing factor and p is a positive number which may take a value of 1 or 2. γ may take a value of between 0 and 1 inclusive. Similarly, the in-beam level in this example is
L
IB(t,f)=γ·|xIB(t,f)|p+(1−γ)·LIB(t−1,f)
Whilst in this example exponential smoothing has been used, instead a different formula could be used to compute the level such as a sample variance of a sliding window. For example, the last 1 ms of the samples. The reference level and in-beam level are then used to compute a post-processing gain g(t, f) which is to be applied to the in-beam signal xIB(t, f). This gain g(t, f) is a number between 0 and 1, where 0 indicates that the in-beam signal for the time index t and frequency index f is completely suppressed and 1 indicates that the in-beam signal for time index t and frequency index f is left un-attenuated. Therefore, ideally, the gain should be close to zero when the in-beam signal for a time index t and frequency index f is dominated by noisy reflections from an out-of-beam signal sound source and close to one when the in-beam signal for time index t and frequency index f is dominated by an in-beam sound source. In this way, if the time-frequency representation is appropriately chosen, out-of-beam sound sources will be heavily suppressed and in-beam sound sources will go through the post-processor largely un-attenuated. To provide this, an approximation to the Wiener filter can be used, which corresponds to the gain:
g(t,f)=SNR(t,f)/(SNR(t,f)+1)
Where SNR(t, f) is the estimated signal to noise ratio (SNR) at a time index t and frequency index f. This type of gain is known per se for conventional noise reduction, such as single-microphone spectral subtraction, where the stationary background signal is considered as noise and everything else considered as signal. In applying this to the present method however, different definitions are used: the in-beam signal xIB(t, f) is taken to be the signal and out-of-beam signal xOB(t, f)=x1(t, f)−xIB(t, f) is taken to be the noise. Substituting these definitions of signal and noise into the Wiener filter formula gives:
g(t,f)=LIB(t,f)/Lref(t,f)
Due to the way in which the two levels are computed, the ratio LIB(t, f)/Lref(t, f) is not guaranteed to be less than or equal to one. For this reason, and in order to give more flexibility in tuning of the suppression performance, a squashing function h is applied to the ratio LIB(t, f)/Lref(t, f) so that the final post-processing gain is given by:
g(t,f)=h(LIB(t,f)/Lref(t,f))
Here, the squashing function h is defined as a non-decreasing mapping from the set of real numbers to the set [0, 1]. An example of such a squashing function is as follows:
h(s)=0 if s<0
h(s)=s if s≥0 and s≤T
h(s)=1 if s>T
where T≤1 is a positive threshold. This leads to the following formula for the post processing gain:
c(t,f)=γ×
where xIB(r, f) is the in-beam time frequency signal corresponding to the in-beam level in this example, γ is a smoothing factor,
g(t,f)=h(c(t,f)/v(t,f))
Where v(t, f) is the short-time estimate of the co-variance of the reference signal with itself, which is the same as an estimate of the variance of the reference signal and is calculated using the same equation as for Lref(t, f) in the previous variant. Thus, the post-processing gain is:
g(t,f)=h(c(t,f)/Lref(t,f)).
Whilst in this example exponential smoothing has been used, instead a different formula could be used to compute short-time co-variance such as a sample co-variance of a sliding window. For example, the last 1 ms of the samples.
To demonstrate how the post-processing gain works, consider a spatial filter which creates a single beam in front of a video system and that T is set to 0.5. In the scenario shown in
Turning now to the scenario shown in
In other words, an in-beam sound source that is close to the video system is not attenuated by the post processor. But, beyond a certain distance from the video system, an in-beam sound source will be attenuated. This distance is determined at least in part by the room acoustics. An acoustically dry room will have a larger distance than a wet (e.g. highly reverberant) room.
In other words, an out-of-beam sound source that is close to the video system will be heavily attenuated by the post-processor. At larger distances, an out-of-beam sound source will still be attenuated, but not as much.
The post-processing gain described above for a given frequency index f is computed based on the information available for that frequency index only. It is beneficial to have a good spatial filter for it to function well. Typically, it is difficult to design good spatial filters for very low and very high frequencies. This is because of the limited physical volume for microphone placement, and practical limitations on the number of microphones and their pairwise distances. Therefore an additional common gain factor can be computed from the frequency indexes which have a good spatial filter, and subsequently applied to the frequency indexes that do not have a good spatial filter. For example, the additional gain factor may be computed as:
where Tcommon<1 is a positive threshold, and Σf is a sum over all frequency indexes where a good spatial filter can be applied. If this additional factor is used, it is multiplied with the time-frequency gains g(t, f) before they are applied to the in-beam signals. This common gain factor can also serve as an effective way to further suppress out-of-beam sound sources whilst leaving in-beam sound sources un-attenuated.
These methods of post-processing allow through in-beam sound sources that are close to the microphone array whilst also significantly suppressing out-of-beam sound sources. The post-processor gain can be tuned to also significantly suppress in-beam sound sources which are far away from the microphone array. When applied to videoconferencing endpoints, the user will experience a bubble-shaped microphone pick-up pattern where the bubble extends from the microphone array and reaches out in front of the camera.
where
g1(t, f) is computed as:
LPref(t, f) is the pseudo-reference level, for example, computed as:
L
Pref(t,f)=γ×xi(t,f)2+(1−γ)×LPref(t−1,f);
cP(t, f) is a pseudo-covariance, for example, computed as:
c
p(t,f)=γ×xi(t,f)×xIB(t,f)+(1−γ)×cP(t−1,f);
and h is a squashing function, such that the post-processing gain takes a value between 0 and 1. Alternatively, the post-processing gain may be computed as:
The features disclosed in the description, or in the following claims, or in the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for obtaining the disclosed results, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof.
While the invention has been described in conjunction with the exemplary embodiments described above, many equivalent modifications and variations will be apparent to those skilled in the art when given this disclosure. Accordingly, the exemplary embodiments of the invention set forth above are considered to be illustrative and not limiting. Various changes to the described embodiments may be made without departing from the spirit and scope of the invention.
For the avoidance of any doubt, any theoretical explanations provided herein are provided for the purposes of improving the understanding of a reader. The inventors do not wish to be bound by any of these theoretical explanations.
Any section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described.
Throughout this specification, including the claims which follow, unless the context requires otherwise, the word “comprise” and “include”, and variations such as “comprises”, “comprising”, and “including” will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.
It must be noted that, as used in the specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by the use of the antecedent “about,” it will be understood that the particular value forms another embodiment. The term “about” in relation to a numerical value is optional and means for example+/−10%.
Number | Date | Country | Kind |
---|---|---|---|
2101561.5 | Feb 2021 | GB | national |
2106897.8 | May 2021 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/052641 | 2/3/2022 | WO |