The invention concerns a method for filtering of spatial noise of at least one sound signal, whereby the invention may be implemented as a computer algorithm or a system for filtering spatial noise comprising at least two microphones or an array of microphones.
Spaced pressure microphone arrays allow the design of spatial filters that can focus on one specific direction while suppressing noise or interfering sources from other directions, which can be also referred as beamforming. The most basic beamforming approaches are the conventional delay and sum and the filter and sum. Delay and sum beamformer algorithm estimates the time delays of signals received by each microphone of an array and compensates for the time difference of arrival [5]. Narrow directivity patterns can be obtained, but this requires a large spacing between the microphones and a large number of microphones. An even frequency response for all audible frequencies can be created by using the filter and sum technique.
In international patent application published under publication number WO 2007/106399 A2, a directional microphone array having at least two microphones generates forward and backward cardioid signals from two omnidirectional microphone signals. An adaptation factor is applied to the backward cardioid signal, and the resulting adjusted backward cardioid signal is subtracted from the forward cardioid signal to generate a first-order output audio signal corresponding to a beam pattern having no nulls for negative values of the adaptation factor. After low-pass filtering, it is proposed to apply spatial noise suppression to the output audio signal. Time-variant methods have been proposed to combine the microphones optimally to minimize the level of unwanted sources while retaining the signal arriving from the desired direction. One of the most well known techniques in adaptive beamforming is the Minimum Variance Distortionless Response (MVDR), based on minimizing the power of the output while preserving the signal from the look direction by employing a set of weights and placing nulls at the directions of the interferes [6]. Such beamformers require still relatively high number of microphones in a spatial arrangement with considerable dimensions.
A closely-spaced microphone array technique can also be used for beamforming, where microphone patterns of different orders are derived [7]. In that technique, the microphones are summed together in same or opposite phase with different gains and frequency equalization, where typically microphone signals having directivity patterns following the spherical harmonics of different orders are targeted. Unfortunately, typically the response has tolerable quality only in a limited frequency window; at low frequencies the system suffers from amplification of the self noise of microphones and at high frequencies the directivity patterns are deformed.
These beamforming techniques do not assume anything about the signals of the sources. Recently some techniques have been proposed, which assume that the signals arriving from different directions to the microphone array are sparse in time-frequency domain, i.e., one of the sources is dominant at one time-frequency position [19]. Each time-frequency frame is then attenuated or amplified according to spatial parameters analyzed for corresponding time-frequency position, which essentially assembles the beam. It is clear that such methods may produce distortion to the output, however, the assumption is that the distortion is most prominent with weakest time-frequency slots of the signals making the artifact inaudible or at least tolerable.
In such techniques a microphone array consisting of two cardioid capsules facing opposite directions has been proposed in [15] and [16]. Correlation measures are used between the cardioid capsules and Wiener filtering is used to reduce the level of coherent sound in one of the microphone signals. This produces a directive microphone signal, whose beam width can be controlled. An inherent result is that the width varies depending on the sound field. For example, with few speech sources in relatively anechoic conditions prominent narrowing of the cardioid pattern is obtained. However, with many uncorrelated sources, and in diffuse field, the method does not change the directivity pattern of the cardioid microphone at all. The method is still advantageous, as the number of microphones is low, and the setup does not require large spatial arrangement.
The assumption of the sparsity of the source signals is also utilized in another technique, Directivity Audio Coding (DirAC) [11], which is a method to capture, process and reproduce spatial sound over different reproduction setups. The most prominent direction-of-arrival (DOA) and the diffuseness of sound field are computed or measured as spatial parameters for each time-frequency position of sound. DOA is estimated as the opposite direction of the intensity vector, and the diffuseness is estimated by comparing the magnitude of the intensity vector with total energy. In the original version of DirAC the parameters are utilized in reproduction to enhance audio quality. A variant of DirAC has been used for beamforming [12], where each time-frequency position of sound is gained or attenuated depending on the spatial parameters and a specified spatial filter pattern. In practice, if the DOA of a time-frequency position is far from the desired direction, it is attenuated. Additionally, if the diffuseness is high, the attenuation is made milder as the DOA is considered to be less certain. However, in cases when two sources are active in the same time-frequency position, the analyzed DOA provides erroneous data, and artifacts may occur. In Simeon Delikaris-Manias, Simulations of second order microphones in audio coding, 1 Jan. 2012, pages 1 to 6, XP055104330, as retrieved from the Internet under http://hal.archivesouvertes.fr/docs/00/61/67/63/PDF/report.pdf, a theoretical model for comparing higher order with first order inputs in DirAC analysis has been presented. In the theoretical model, the proposed gain is obtained by computing cross-correlation between two signals normalized with a normalization coefficient. The calculated virtual microphones that contain the signal information may be filtered through DirAC gain and the gain proposed.
One aim of the invention is to substantially improve the signal-to-spatial noise ratio (SSNR) of an acoustic signal captured by an electric or electronic apparatus such as microphone arrays, even in real-time. Ideally, the spatial noise filtering should not leave acoustic artifacts or give rise to self-noise amplification resulting from the desired spatial noise filtering method. With the term “spatial noise” we in this document mean sounds coming from undesired or unwanted directions. So our aim is not only to improve signal-to-spatial noise ratio but also to enhance spatial noise filtering and suppress other sound sources.
A second aim of the invention is to reduce the number of microphones and similar hardware used for spatial filtering, since nowadays telecom devices in general need to be small and light, in order to minimize the electric and electronic installation efforts as well as improve practicability of the audio device, such as a mobile phone, computer, tablet or similar.
A third aim of the invention is to use established—that is—already existing audio recording devices, to be employed with a minimum or no additional hardware, by implementing the desired method into a computer executable algorithm.
The above mentioned aims are reached by the parametric spatial filtering method according to claim 1, by the computer readable storage medium according to claim 13, when executed in a machine or computer carrying out the method, and by the spatial filtering system according to claim 14.
The dependent claims describe various advantageous aspects and embodiments of the method and of the spatial filtering system. This method and the corresponding algorithm and system utilize Cross Pattern Correlation or even Cross Pattern Coherence (CPC) between microphone signals, in particular of microphone signals with directivity patterns of different orders, as a criterion for focusing in specific directions. The cross-pattern correlation between microphone signals is estimated in time-frequency domain where the similarity of the microphone signals is measured for each time frequency frame. A spatial parameter is extracted which is used to assign gain/attenuation values to a coincidentally captured audio signal.
The parametric method for spatial filtering of at least one sound signal includes the following steps:
The method can be applied advantageously to systems that use focusing, or background noise suppression such as teleconferencing. Moreover, although this method is rendered for monophonic reproduction, as the beam is aiming towards one direction at a time, it can be extended to multichannel reproduction systems by having multiple beams towards each loudspeaker direction.
Ideally, the cross-pattern correlation or coherence is used to define a correlation measure or coherence measure between the captured signals for the same look direction, where the measure of correlation or coherence is high (exceeds a pre-defined threshold), and/or where the first and second directivity patterns have high sensitivity (exceeding a pre-defined threshold) and/or equal phase for the same look direction. Like this either the proper microphone with the most convenient order of directivity pattern can be selected, for instance a dipole microphone and a quadrupole microphone, to fit the direction of intended operation or alternatively the best look direction of a particular microphone setup can be determined, if the method is carried out for many or all possible look directions in order to define a look direction of optimal signal-to-spatial noise ratio and attenuation performance for the first and second microphone at peak values of the measure of coherence. The coherence between two microphone signals of different orders receives its maximum value when the directivity patterns of the microphones have equal phase and high sensitivity in amplitude towards the arrival direction of the desired signal. Advantageously, a first and second sound signal could be captured and treated simultaneously. The method has proven very effective even to distinguish two independent sound signals. With this quality our method has an advantage over the DirAC technique. Our method can be used to produce much narrower directivity pattern than DirAC.
One embodiment described in the figures could be the first directivity pattern being equivalent to the directivity pattern of first order, and the second directivity pattern being equivalent to the directivity pattern of second order. Due to the different spatial patterns special optimized look directions may be created. The method proves very flexible as to generate optimized (with high SSNR values) look directions in the desired direction.
A normalization of the cross-pattern correlation can be used in such a way to compensate for the magnitudes of the first and second captured signals, for instance, normalized by the energy of both captured signals. The normalization is effective and easy to implement, because it takes into account common features of the multiple order signals.
The gain factor depends on the cross-pattern correlation or the normalized cross-pattern correlation, which is why it should be ideally time averaged to eliminate signal level fluctuations and to provide a smoothing. Like this the systematic error of the gain factor can be reduced regardless what temporal magnitude characteristic the captured sound signal shows.
If the gain factor is half wave rectified in order to obtain a unique beamformer at the desired look direction then the possible artifacts can be avoided since the correlation also would allow negative values, which could be troublesome during a signal synthesis, where the gain factor is applied to a microphone stream or a third captured signal imposing the gain dependent on direction on the stream or the third captured signal, thereby attenuating input from directions with low coherence measure. Therefore the gain factor may very well also be called an attenuation factor, which attenuates unwanted (non-coherent) parts of the captured signals stronger than the coherent ones.
The method may be implemented as a computer programme, an algorithm or machine code, which might be stored on a computer readable storage medium, such as a hard drive, disc, CD, DVD, smart card, USB-stick or similar. This medium would be holding one or more sequence of instructions for a machine or computer to carry out the method according to the invention with at least the first microphone, the second and the third microphone. This would be the easiest and most economic way to employ the method on already existing (tele-) communication systems having at least three or more microphones.
Spatial filtering system based on cross-pattern coherence comprises acoustic streaming inputs for a microphone array with at least a first microphone and a second microphone and an analysis module configured to perform the steps:
The system can be adapted to suppress noise in multi-party telecommunication systems or mobile phones with a hands-free option.
The system may further comprise an equalization module equalizing the first captured signal and second captured signal to both have the same phase and magnitude responses before the analysis module calculates the gain factor. This type of equalization is especially advantageous when employed to condition sound signal streams for the proposed inventive spatial filtering method.
The invention is based on insights stemming from the idea of Modal Microphone Array Processing. This technique was chosen to be employed for the mathematical approach of the invention. For known general information of Modal Microphone Array Processing the reader is referred to references [3] and [4]. Relevant for the invention are the zeroth and higher-order signals of the resulting microphone signals for each sample n:
(according to Equation) (1)
where Hm(n) is a matrix containing the signals from each microphone m and Ypqσ(φ, θ) the spherical harmonic coefficients for azimuth φ and elevation θ for the pth order and qth degree. Apqσ are the resulting microphone signals. Each spherical harmonic function consists of the gain matrix for each separate microphone. The term {[Ypq□(φ, θ)]T Ypq□(φ, θ)}−1 [Ypq□(φ, θ)]T is the Moore-Penrose inverse matrix of Ypq□(φ, θ) [2]. The encoding process is illustrated in
(according to Equation) (2)
(according to Equation) (3)
and Pqp(cos(θ)) are the Legendre functions. In a general fashion these functions have been extensively discussed in [1].
The algorithm according to the invention is simple to implement and offers the capability of coping with interfering sources at different spatial locations with or without the presence of background noise. It can be implemented by using any kind of microphones that are on the same look direction and have the same magnitude and phase response.
The signals obtained from a microphone array are transformed into the time frequency domain through a Fourier Transform, such as a Short Time Fourier Transform (STFT). Given a microphone signal Apqσ(n) the corresponding complex time-frequency representation is denoted as Apqσ(k, i), where k is the frequency frame and i the time frame.
As mentioned before, the correlation and the coherence are measured between signals originating from different orders of spherical harmonics. For this operation, the output signals from the matrixing process are equalized in a way that the resulting spectra of each order is matched with each other. In other words, the responses need not to be spectrally flat, however, both the phase and the magnitude responses need to be equal in the signals of different orders. This is different from conventional equalization methods, where the microphone signals are equalized according to the direct inversion radial weightings [7] or modified radial weighting when the microphone array is baffled [21]. Such matching is achieved by using a regularized inversion of the radial weightings Wr [7] to control the inversion.
The resulting equalized signals are:
(according to Equation) (4).
The equalizer EQpqσ(k, i) for each sign as is calculated by using a regularization coefficient to control the output [8],[9]:
(according to Equation) (5)
where β is the regularization coefficient. The regularization parameter is frequency dependent and specifies the amount of inversion within a frequency region and it can be used to control the power output. A regularization value of order 10−6 is applied within the frequency limits where the performance is designed to work optimally.
The aim of the method according to the invention is to capture a sound signal originating from one specific direction while attenuating signals from different directions. It employes a spatial filtering technique that reduces background noise and interfering sources from the desired sound source by using a coherence measure. The main idea behind this contribution is that the correlation or coherence between two microphone signals of different orders receives its maximum value when the directivity patterns of the microphones have equal phase and high sensitivity in amplitude towards the arrival direction of the sound signal. In other words, a plane wave signal is captured by carefully selected microphone signals of different orders coherently only in the case when the DOA of the plane wave coincides within the selected direction. In all other cases the correlation/coherence is reduced.
The method/algorithm indicates that for spatial filtering microphone signals bearing the positive phase of their directivity patterns on the same direction should be utilized. The spherical or cylindrical harmonic framework can be used for a straightforward matrixing to derive microphone patterns.
One important step of the method according to the invention is to compute the cross-pattern correlation Γ between two different microphone signals:
(according to Equation) (6)
where M11(k, i) and M12(k, i) are the time-frequency representation of separate microphone signals that their directivity patterns have the same look direction. From (6) is clear that Γ(k, i) depends on the magnitudes of the microphone signals, which is not desired as the spatial parameter should depend only on the direction of arrival of the sound. To circumvent this in the present approach a normalization is used to derive a spatial parameter G:
(according to Equation) (7)
where R is the real part of the cross-pattern correlation Γ. In this document we refer with G to the normalized correlation and it is indicated as the spatial parameter of the Cross-Pattern Coherence (CPC) algorithm. In (7), M1−1 and M2−1 are microphone signals with directivity patterns M1−1(ψ) and M2−1(ψ) selected in a way that:
(according to Equation) (8)
for n=1 and n=2, M0(ψ) is the directivity pattern of the signal M0 that will be used as audio signal attenuated selectively in time-frequency domain, ψε[0, 360) and M11(ψ), M21(ψ) the directivity patterns of signals M11 and M21 Equation (8) should be satisfied for all plane waves with direction of arrival of ψ. The normalization process in (7) ensures that with all inputs the computed coherence value is bound within the interval [−1, 1], and that values near unity are obtained only when the signals M11(k, i) and M21(k, i) are equivalent in both phase and magnitude.
As the coherence values near unity imply that there is some sound arriving from the look direction, the values near zero or below it indicate that the sound of analyzed time-frequency frame does not originate from the look direction. By taking this into consideration, a rule might be defined where only the positive part of this lobe is chosen for a unique beamformer at the look direction.
This may be performed as a half wave rectifier. If Mx and My, where x and y represent the different microphone orders, are identical for one specific direction, then their power spectrum is equal and the value of G is unity. If Mx and My are completely uncorrelated, G receives a value of zero. Therefore the interval [0,1] indicates the level of coherence between microphone signals and the higher the coherence the higher the value of G is. Up to this moment we have introduced an attenuation/gain value G that can be used to synthesize the output signal of the proposed spatial filtering technique. The synthesis part would consist of a single output signal S which could be computed using straightforward multiplication of the half-wave rectified function G with a microphone signal M0:
S(k,i)=max(0,G(k,i))M0(k,i). (9)
In order to obtain good sound quality, the signal M0 needs to have a spectrally flat response. The level of self-noise produced by the microphone should also be low. An exemplary solution is to use zeroth-order microphone for this purpose, as available pressure microphones have typically flat magnitude response with tolerable noise level.
The value of the spatial parameter G for each time frequency frame is calculated according to the correlation/coherence between microphone signals. In a recording from a real sound scenario the levels of sound sources with different directions of arrival may fluctuate rapidly and result in rapid changes in the calculated spatial parameter G. By taking the product of the microphone signal and the spatial parameter in (9), clearly audible artifacts are produced in the output. The main cause is the relatively fast fluctuation of G and the artifact is referred as the bubbling effect. Similar effects have been reported in adaptive feedback cancellation processors used in hearing aids [22], [23] and spatial filtering techniques using DirAC [13]. In order to mitigate these artifacts in the reproduction chain, temporal averaging could be performed in the parameter G. This type of averaging, or smoothing, which is essentially a single-pole recursive filter is defined as:
Ĝ(k,i)=α(k)max(0,G(k,i))−(1−α(k))Ĝ(k,i−1) (10)
Where Ĝ(k, i) are the smoothed gain coefficients for a frequency bin k and time bin i and α(k) the smoothing coefficients for each frequency frame. Informal listening of the output signal with input from various acoustical conditions, such as cases with single and multiple talker and with or without background noise, revealed that the level of the artifacts is clearly lowered when using Ĝinstead of G. An additional rule can be defined, which was found to further suppress these remaining artifacts. A minimum value A may be introduced for the Ĝfunction, which limits the minimum attenuation further, following the averaging process:
(according to Equation) (11)
where λ is a lower bound for the parameter Ĝ. The minimum value of the derived parameter Ĝ+ using the method according to the invention or its algorithm can be adjusted according to the application being a compensation between the effectiveness of the spatial filtering method and the preservation of the quality of the unprocessed signal. By modifying (9) accordingly, the output Ŝ is:
Ŝ(k,i)=Ĝ+(k,i)M0(k,i), (12)
in which an inverse Short Time Fourier Transform (iSTFT) could be applied to obtain the time domain signal Ŝ(n). The signal M0(k, i) being attenuated by the time-frequency factors contained in Ĝ+(k, i), should originate from a microphone pattern with low order, not suffering from amplified low frequency noise. The attenuation parameters of Ĝ+(k, i) though are computed using higher-order microphone signals with time averaging. M0 can originate from any kind of microphone as long as it satisfies (8). The low-frequency noise in higher-order signals potentially causes only some erroneous analysis results in the computation of the parameters, however, the temporal averaging mitigates the noise effects. The low-frequency noise in M1 and M2 is not audible in the resulting audio signal Ŝ(n) as noise, since the higher-order signals are not used as audio signals in reproduction.
Optional. Multi-Resolution Short Time Fourier Transform (STFT) Implementation of Cross Pattern Coherence
The use of multi resolution STFT in the proposed algorithm offers a great advantage as it increases temporal resolution. Each microphone signal is first divided into different frequency regions and the method/algorithm is applied to each different region separately. An inverse STFT is applied then to transform the signal back to time domain. Different window sizes in the initial STFT shift the resulting signals in time and thus a time alignment process is needed before the summation.
Further advantageous implementations of the invention can be taken from the description of the figures as well as the dependent claims.
In the following, the invention is disclosed in more detail with reference to the exemplary embodiments illustrated in the accompanying drawings in
Same reference symbols refer to same features in all FIG
In the following the method is demonstrated with some embodiments in various scenarios, where the input consists of microphone signals with three different arbitrary orders, for example of zeroth, first and second-order signals. More and/or other orders of the signal may be employed. The method measures the correlation/coherence between two of the captured sound signals having the positive-phase maximum in directivity response towards the desired direction in each time-frequency position. A time-dependent attenuation factor is computed for each time-frequency position based on the time-averaged coherence between two captured sound signals. The corresponding time-frequency positions in the third captured signal are then attenuated at the positions where low coherence is found. In other words, the application of the method according to the invention is feasible with any order of directivity patterns available, and the directivity of the beam can be altered by changing the formation of the directivity patterns of the signals from where the correlation/coherence is computed.
Even though matrixing 10 and equalization unit 11 are advantageously carried out as here proposed and in
The sound signal 13 inputs of different order stem from the respective microphones 12 which may be of any order, in particular higher orders. These are put into the proper matrixing 10 for consecutively being treated in the equalization unit 11. After the equalization they are ready to be fed into the CPC module CPCM.
Numerical Simulations using an Ideal Array
1) Implementation of a Cross Pattern Coherence (CPC) algorithm according to the spatial filtering method is now derived for a typical case, where the signals of zeroth (Wns), first (Xns and Yns) and second order (Uns and Vns) signals are available. The subscript ns indicates that the signals are calculated for the numerical simulation. The flow diagram of the method in this case is according to
The CPC module (CPCM) employs five microphone stream 23 inputs to feed the captured signals 23 into the CPC module to immediately have them Fourier transformed by the Short Time Fourier Transformation (STFT) units. Optional energy unit 24 computes the energy based on the higher order captured microphone signals to feed the result to the normalization unit 27. Two streams of higher order signals are processed in the correlation unit 26. The correlation is then passed through the normalization unit 27, which leads to the gain parameter G(k,i).
The optional but very effective time averaging step is carried out in the time averaging unit 28. The “half-wave” rectification is carried out in the following recitifier 29. After that the gain parameter is given to the synthesis module 22 to apply the gain parameter onto separate microphone stream 23 for imposing the spatial noise suppression. It is to be noted here that even though the number of microphone stream inputs 23 and stream arrays 20 is five in our example, it is clear that more or less of them can be used. However, a minimum of three is required.
The microphone patterns are derived on the simple basis of cosine and sinusoidal functions. For two sound sources s1(n) and s2(n) the 0th, 1st and 2nd order signals are defined as:
(according to Equation) (13)
where φ1 and φ2 indicate the azimuth directions of each separate source. In that way we are able to position sound sources in specific azimuthal locations around the ideal microphone signals. The noise components are indicated with nw(n), nx(n), ny(n), nu(n), nv(n) for each order. Filtered white gaussian zero mean processes with unit variance are added to each ideal microphone signal to simulate the internal microphone noise: a 0th order low pass filter is applied to nw(n) to simulate the 0th order microphone signal internal noise, a 1st order low pass for nx(n), ny(n) and 2nd order for nu(n), nv(n). The Signal-to-noise Ratio (SnR) between the test signals and nw(n) is 20 dB. The time-frequency representation of each microphone component (Wns, Xns, Uns, Vns) is then computed. By substituting M11=Xns, M12=Uns, M−11=Yns and M−12=Vns in Eq (7) the spatial parameter Gns in the analysis part of the CPC algorithm is:
(according to Equation) (14)
The process of CPC for this case is summarized in a block diagram in
In the multi-resolution STFT, three different frequency regions are used, the first with an upper cut-off frequency of 380 Hz, the second with a lower cutoff of 380 Hz and upper cutoff of 1500 Hz and the third one with a lower cutoff of 1500 Hz. The STFT window sizes of each frequency band were N=1024, 128 and 32 accordingly with a hop size of N/2. Two talker sources are virtually positioned at φ1=0° and φ2=90° in the azimuthal plane. The parameter Gns is then calculated for different beam directions starting at 0° and rotating every 45°.
The functioning of the CPC algorithm is demonstrated by deriving the directivity attenuation patterns in different sound scenarios. A similar method for assessing the performance of a real-weighted beamformer has been used in [25] by employing the ratio of the power of the beamformer output in the steering direction over the power of the average power of the system. The directivity patterns in this case are derived by steering the beamformer every 5° and calculating the Ĝ+ value for each position, while maintaining the sound sources at their initial position. In this example a scenario with single and multiple sound sources has been simulated. Sound sources with and without background noise levels and different SnRs are positioned at various angles around the virtual microphone array.
In
The sound source 50 is positioned at 0°. The diffuse noise has been generated with 23 noise sources 51 positioned around the virtual microphone array equidistantly. The directivity pattern shows the performance of the beamformer under different SnR values between the single sound source and the sum of the noise sources. While the beam is steered towards the target source at 0° the attenuation is 4 dB with an SnR of 20 dB. The corresponding pattern S20 is the most asymmetric an most advantageous choice. As the beam is steered away form the target source there is a noticeable attenuation of up to 12 dB in the area of ±60°. Outside the area of ±60° the attenuation level varies between 15 to 19 dB. With an SnR of 10 dB the level that the beamformer applies to the target source is −10 dB and attenuates the output to 18 dB outside the area of ±30°, as it can be seen on the pattern S10. For lower SnR values of 0, pattern S0, and −inf, pattern SI, in diffuse field conditions the beamformer assigns a uniform attenuation of 18 dB for all directions. This part of the simulation thus suggests that in diffuse conditions the SnR has to be approximately 20 dB in a given time-frequency frame for CPC to be effective.
The directivity attenuation patterns in double sound source scenarios are illustrated in
In a multiple talker scenario in
It is thus evident that in the case of one or two interfering sources the performance of CPC is consistent and provides stable filtering results, not only for the cases of high SnR (20 and 10 dB), but also for some cases where the SnR is 0 dB. The advantages that are shown through this simulation are that the algorithm provides a high response when the direction of the beamformer coincides with the direction of a sound source. This is evident through the calculation of Ĝ+ for the diffuse field case with positive SnR values. For the cases of 20 and 10 dB SnR in a single or multi sound source scenario, the Ĝ+ values towards the direction of the main sound source differ to the original level by 1-2 dB. It is also evident that in all cases there is no high response towards any direction where there is no sound source, even in the case of diffuse noise only.
If we consider speech signals as sound sources, due to the sparsity and the varying nature of speech, the spectrum of the two speech signals when added can be approximated by the maximum of the two individual spectra at each time-frequency frame. It is then unlikely that two speech signals carry significant energy in the same time-frequency frame [26]. Hence, when the coherence between the microphone patterns is calculated, in the analysis part of the CPC, the Ĝ+ values will be well calculated for the steered direction which motivates the use of the CPC algorithm in teleconferencing applications. In other words, for simultaneous talkers the resulted directivity of the CPC algorithm can be assumed that falls into the case (a) in
The performance of the CPC algorithm is also tested with a real microphone array. An eight-microphone, rigid body, cylindrical array of 1.3 cm radius and 16 cm height is employed with equidistant sensor in the horizontal plane every 45°. The microphones are mounted on the half-height of the rigid cylinder perimetrically. The more sensors we have, the more we can increase the aliasing frequency, if compared to the same radius array with fewer sensors.
The encoding equations to derive the microphone components for the specific array up to second-order, following (4) and the equalization process of (5), using the cylindrical harmonic framework, are:
(according to Equation) (15)
where Wre(k, i), Xre(k, i), Yre(k, i), Ure(k, i) and Vre(k, i) are the equalized microphone components. In contrary to the numerical simulation the equalization process when using a real array is more demanding as we are not employing ideal microphones and the directivity patterns of the microphone components vary along the frequency.
All other parameters such as the minimum value of attenuation λ, the temporal averaging a and the frequency regions for the multi-resolution STFT are set previously.
As shown in
An example case of the performance of the CPC algorithm in a multi speaker scenario is shown in
Directivity measurements are performed in an anechoic environment to show the performance of the CPC algorithm utilizing the cylindrical microphone array. White noise is used as a stimulus signal of two seconds duration. The stimulus is fed to a single loudspeaker and the array is placed 1.5 meters away from the loudspeaker. The microphone array is mounted on a turntable able to perform consecutive rotations of 5 degrees and one measurement is performed for each angle.
Each set of measurements is transformed into the STFT domain and the spatial parameter Ĝ+ values are calculated for each rotation angle with static sources. In that way a directivity plot of the specific microphone array is obtained in this sound setting.
A stable performance is obtained in the horizontal plane where the Ĝ+ function is constant in the frequency range between 50 Hz to 10 kHz which is approximately the spatial aliasing frequency. The beamformer receives a constant Ĝ+ value in the horizontal plane in the look direction of 0° with an angle span of approximately ±20°. In the vertical plane the method is capable of delivering valid Ĝ+ values for elevated sources that are not on the same plane as the microphone of the array. The maximum angle span where the beamformer provides high Ĝ+ values in that case is ±50° in elevation. In that case a noticeable spectral coloration is shown for directions that are between [20°, 50° ] and [300°, 340° ] due to the frequency dependent Ĝ+ values.
In summary, the Cross Pattern Coherence (CPC) Method is a parametric beamforming technique utilizing microphone components of different order, which have otherwise different directivity patterns. However, response is equal towards the direction of the beam. A normalized correlation value between two signals is computed in time frequency domain, which is used to derive a gain/attenuation function for each time frequency position. A third audio signal, measured in the same spatial location, is then attenuated or amplified using these factors in corresponding time-frequency positions. Practical implementation in both the numerical simulation and the real array incite that the method is resilient to few sound sources and becomes less resilient with diffuse noise and low SnR values.
In other words, the spatial filtering system is comprised in the teleconference apparatus comprising an array of microphones, or connected to the teleconference apparatus, and configured to apply the gain factor Ĝ+ to the corresponding time-frequency positions in the third captured sound signal M0 or W(n) in the microphone streams 23 real-time during a meeting or teleconference.
The system or the apparatus may comprise a database 91 or another data repository and be configured or configurable to apply the gain factor Ĝ+ to the corresponding time-frequency positions in the third captured sound signal M0 or W(n) that have been stored in the database 91 or in the other data repository.
The system may further comprise a means for manually or automatically entering or selecting the desired look direction □. By selecting the desired look direction □ it is at least in principle possible to differentiate between a number of simultaneous talkers that are seated around a conference table. The differentiation (i.e. separation of each talker's voice, a particular talker's voice or of some talkers' voice) may be carried in real-time or afterwards.
In other words, the parametric method for spatial filtering of at least one first sound signal includes the following steps:
In still other words, the spatial filtering system based on cross-pattern correlation or cross-pattern coherence comprising acoustic streaming inputs for a microphone array with at least a first microphone and a second microphone and an analysis module performing the steps:
The invention is not to be understood to be limited in the attached patent claims but must be understood to encompass all their legal equivalents.
The following references are being used in the description of the prior art of the technical field as well as for the characterization of the mathematical modelling of the invention:
Number | Date | Country | Kind |
---|---|---|---|
12194934.1 | Nov 2012 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2013/060507 | 11/29/2012 | WO | 00 |