The invention concerns Microphone probe, method for processing of audio signals from microphone probe, audio acquisition software and computer program product for audio acquisition. More particularly the invention concerns microphone probe, method for audio acquisition and audio acquisition system dedicated for recording multisource audio data into the channels corresponding to the particular sources.
Recording and distribution of the music is known to be difficult, time-consuming and expensive. A musical band wishing to publish its music needs to hire professional studio to record the music, then process the music and finally arched the music in a carrier. The last step has nearly been eliminated by replacing distribution of music carriers, i.e. tapes, compact discs etc. with network transmission.
If professional studio could be eliminated from the process, the recording of music would become definitely simpler. However, that would require possibility of extraction from the sound generated by playing musical band tracks corresponding to particular sources with relatively simple device and with elimination of echo and interferences.
US patent application no. US 20030147539 A1 discloses a microphone array-based audio system that is designed to support representation of auditory scenes using second-order harmonic expansions based on the audio signals recorded with the microphone array. For example, in one embodiment, the quoted invention comprises a plurality of microphones i.e., audio sensors mounted on the surface of an acoustically rigid sphere.
US patent document no. US 2008247565 A discloses an audio system that generates position-independent auditory scenes using harmonic expansions based on audio signals recorded with a microphone array. In one embodiment, a plurality of audio sensors are mounted on the surface of a sphere The number and location of the audio sensors on the sphere are designed so as to enable the audio signals generated by those sensors to be decomposed into a set of eigenbeam outputs. Compensation data corresponding to at least one of the estimated distance and the estimated orientation of the sound source relative to the array are generated from eigenbeam outputs and used to generate an auditory scene. Compensation based on estimated orientation involves steering a beam formed from the eigenbeam outputs in the estimated direction of the sound source to increase direction independence, while compensation based on estimated distance involves frequency compensation of the steered beam to increase distance independence.
Audio systems disclosed in US applications nos. US 2000147539 A1 and US 2008247565 A have a disadvantage related to the need of performing analog to digital conversion of the signal from every audio sensor in the matrix. They are also susceptible to external interferences. Manufacturing process of spherical arrays of analogue audio sensors proved to be quite time-consuming and complicated.
International patent application no. PCT/US2010/061445 and US patent applications no. US 20140270245 A1 disclose that using PCB technology and surface-mounted MEMS microphones and associated electronics can greatly simplify the construction of a 3D array and thereby can result in a design that is less expensive to manufacture. The physical microphone design results in some physical limitations that are made to optimize the acoustic performance of the microphone. However MEMS as digital audio sensors proved to have low signal-to-noise ratio, which makes them unsuitable for applications in recording music. Solution to this problem suggested in US 20140270245 A1 was to use multiple MEMS elements serving as a single audio sensor.
Generally, state of the art methods, devices and systems seem to be susceptible to noise and interferences, which is tolerable in numerous applications, but not in recording music.
It is an object of the present invention to provide Microphone probe, method for processing of audio signals from microphone probe, audio acquisition software and audio acquisition software that would allow recording high quality multichannel sound generated by playing instruments in quite random environment.
A microphone probe according to the invention has a body being substantially a first solid of revolution with a number of audio sensors distributed thereon. The audio sensors are located in recesses having substantially a shape of a second body of revolution having an axis of symmetry perpendicular to the surface of the body. The audio sensors are connected to an acquisition unit that delivers audio signals received by the audio sensors to an output. The audio sensors are digital audio sensors, each comprising a printed circuit hoard with a MEMS microphone element mounted thereon. The MEWS microphone element is mounted on the side of the printed circuit board facing the interior of the body so that the sound reaches the MEMS microphone element via the recess and an opening in the printed circuit board. This method of mounting results in the microphone being mounted in a spatial filtering element formed by a recess and an additional conduit formed by the opening in the PCB. Such configuration proved to be efficient in reventing spatial aliasing. The depth of the recesses is within a range between 3 and 20 mm. The acquisition unit has a clocking device determining common time base for the audio sensors. The acquisition unit is adapted to feed the signals from particular audio sensors to a processing unit. Such configuration results in a synchronization between the audio sensors good enough to provide data for further beamforming and processing.
Preferably processing unit is integrated with microphone probe.
Preferably acquisition unit is implemented as FPGA unit with BF bit logic while digital audio sensors provide Bs bit samples, wherein BF is lower or equal Bs, and wherein a conversion is done with module having (2BF-BF) bit buffer. Preferably BF is equal to 16 and Bs is equal to 24. The module is adapted to:
Preferably the body of the microphone probe is substantially spherical and preferably has at least 20, advantageously 32 digital audio sensors or even more preferably 62 digital audio sensors. The term substantially spherical refers to any sphere-like shape in particular dodecahedron or other spherical polyhedron. If probe is supposed to be located on the table it is possible to eliminate bottom (south pole) sensor and reduce the number of sensors to 19 still keeping the functionality of the brobe.
Digital audio sensors are preferably distributed in evenly spaced layers or parallel layers corresponding to evenly distributed angles of latitude.
Also preferably the body of the microphone probe is substantially cylindrical and digital audio sensors re uniformly distributed on its lateral surface.
A method according to the invention refers to processing audio signals method comprising the steps of:
Consequently the method according to the invention can be used in broader frequency band than the method known in the state of the art and provide processing required in processing sound originating from musical instruments.
Preferably the determining direction of arrival of the sound from the number of sources includes receiving at least partial indication of the location of at least one source with user interface prior during or after the acquisition.
The reception of at least partial indication of the location of at least one source with user interfaces preferably precedes the acquisition of signals from the audio signals. Additionally the method preferably includes additional step of determining the impulse response or transmittance of a link between at least one source and the digital audio sensors of the probe. This step is executed before acquisition. The measured impulse response or the spatial channel transfer function is used to compensate the effect of environment on the sound from, at least one source.
Preferably number of digital audio sensors used in beamforming depend on the frequency band and is selected so that the spacing between sensors is greater than 0.05 of the wavelength and lower than 0.5 of the wavelength in each of the frequency bands.
The upper limit of 0.5 wavelength corresponds to possibility of implementing a beamforming without spatial aliasing. The lower limit is dictated by the increase of the noise of the related to beamforming. Keeping that limits is difficult when processing the music because of the large bandwidth resulting in a wide range of wavelengths for which the condition has to be met. Having a greater number of audio sensors and using only part of them in frequency bands for which lower condition is not met solves the problem.
Preferably the method includes adaptive Wiener filtration of at least first channel, preferably involving adaptive filtering and subtraction of signals from at least two other channels. That kind of filtration increases signal to interference ration in the first channel taking benefit of the signals collected in the other channels.
Preferably, the beamforming is based on a correlation matrix Sxx between the signals from the, audio sensors of the microphone probe or alternatively on the frequency response matrix of the microphone probe, preferably frequency response matrix measured earlier in an anechoic chamber.
An audio acquisition system according to the invention comprises a microphone probe according to the invention, a processing unit capable of carrying on a method according to the invention, and external interface to output the channels containing sound originating from particular sources.
A computer program product according to the invention is adapted to be executed on a computer connected via USB interface with a probe according to the invention and is adapted to carry on a method according to the invention. Preferably, this product contains measurement results of frequency response matrix of at least one particular microphone probe.
The invention has been described in detail below with reference to the attached drawings, wherein:
In its first embodiment shown in
In an another embodiment shown in
In yet another embodiment shown in
Further alternative is a conical shape illustrated in
As presented schematically in
The audio sensors 2.1, 2.2, 2.3, 2.4, 2.N comprise MEMS microphone elements InvenSense ICS-434342 providing 24 bit audio samples with sampling frequency of f, provided by a clock module 5 connected to the acquisition unit. Sampling frequency is selected from the range of 8+96 kHz. Any of the typical values of 8000 Hz, 11025 Hz, 16000 Hz, 22050 Hz, 24000 Hz, 32000 Hz, 44100 Hz, 48000 Hz, 96000 Hz can be used. Experiments made by the inventors have shown that beamforming gives better results for higher sampling frequencies, preferably above 40000 Hz. The acquisition unit 3 comprises an FPGA unit with 16-bit logic mounted on a second printed circuit board with peripherals as shown in
Also such configuration makes it easy to use two or more MEMS microphone elements per one sensor location and increase SNP by averaging their signals or other more advanced processing techniques. This way also directivity of sensor can be increased.
It should be noted that the FPGA unit used in the acquisition unit 3 uses 16-bit logic, as opposed to the 24-bit logic of the MEMS microphone elements. Hence, a conversion is required. It is done as follows:
That approach can be generalized to any combination of Bs-bit sample X and BF-bit logic of the acquisition unit, when Bs<BF. The method may be denoted as follows:
1, Expand the Bs-bit word of sample X with replication of sign to 2Bs-BF word temp:
temp[Bs: 2Bs-BF-1]=X[Bs-1]
temp[0:Bs-1]=X[0:Bs-1]
“x:y” denotes a vector comprising bits from the x-th one to the y-th one,
2. Apply gain by shifting bits to the left:
temp=temp<<G
where gain is equal to 2G and G is a number selected from 0 to (Bs-BF-1).
3. Return either saturation information or the value of the bits form (Bs-1) to (2Bs-BF) of the buffer “temp” as a return value. Saturation is detected when either
temp[2Bs-BF-1]==0 and temp[Bs-1:2Bs-BF-2] !=0, which implies plus sign saturation or
temp[2Bs-BF-1]==1 i temp[s-1:2Bs-BF-2] !=1, which implies minus sign saturation.
The probe 1 according to the present invention has 32 MEMS audio sensors in total. They are arranged in such a way that they form apexes of a body highly resembling a pentakis dodecahedron. However, as it is impossible to circumscribe a sphere on all pentakis dodecahedron apexes, the ones laying below or above spherical surface are shifted along sphere radius to this surface. Hence, all audio sensors are lying on the spherical surface of the body 2. A method of distributing audio sensors on a sphere was disclosed by P. Santos, G. Kearney and M. Gorzel in “Construction of a Multipurpose Spherical Microphone Array”, ESMAE—IPP, 7-8 Oct. 2011.
Every array of audio sensors has its cut-off frequency above which beamforming results in additional interference—so called spatial aliasing. The in the spatial domain the cut-off frequency is equal to 1/(2d), where d stands for the distance between the audio sensors. This frequency
is expressed in [1/meter]. All frequencies above this limit are biased with so called aliasing effect which causes irregularities in directivity characteristics. Sound spatial aliasing cut-off frequency fcutoff expressed in Hz and corresponding to this spatial frequency can be calculated when speed of sound c in the medium is known: fcutoff=fspot·c. In the air, the speed of sound is approximately 340 [m/s].
When the radius of the sphere forming the body 2 of the probe is 52.5 mm. Given that there are 32 microphones, the cut-off frequency of the probe is approximately 6 kHz. Above this value spatial aliasing is a significant obstacle against effective beamforming. Spatial aliasing is determined by the distance between neighboring sensors. Hence, there are two relatively simple solutions to mitigate it either reduce the radius of the sphere or use more sensors.
European patent application EP 2773131 A1 discloses a spherical microphone array with improved frequency range for use in a modal beamformer system that comprises a sound-diffracting structure, e.g. a rigid sphere with cavities in the perimeter of the diffracting structure and a microphones located in or at the ends of said cavities respectively, where the cavities are shaped to form both a spatial low-pass filter, e.g. exhibiting a wide opening, and a concave focusing element so that sound entering the cavities in a direction perpendicular to the perimeter of the diffracting structure converges to the microphones, e.g. by providing a parabolic surface, in order to minimize spatial aliasing. Application of the solution according to the EP 2773131 A1 is limited by the fact that the depth of the cavities is limited by the size of the sphere which has to contain also other electronic equipment and by the size of the audio sensor as conventional microphone sensors are rather large and hence difficult to locate in the small focal point of the cavity.
Microphone probe according to the invention offers yet another way to at least partly solve this problem. Directivity of the MEMS audio sensors appear to be increased at higher frequencies due to the shape of the hemispherical recesses 11.1, 11.2, 11.311.N and due to an additional sound conduit formed along the thickness of the PCB, namely the openings 12.1, 12.2, 12.3, 12.N. That additional sound conduit in combination with the shape of the cavity, at higher frequencies offers significantly higher directivity of a single digital audio sensor 2.1, 2.2, 2.3, 2.4, . . . , 2.N. Hence, in the high part of the sound bandwidth the beam of a single sensor is narrow enough to select a sound source formed by a single instrument in a musical band. The directivity of a single sensor placed in the hemispherical recess increases at high frequencies. That means that increase of directivity corresponds to the frequency bands above spatial aliasing cut of frequency. That makes recording possible even when spatial aliasing affects conventional beamforming.
Other mechanical structures increasing directivity can also be applied. Similar approaches were used in parabolic microphones or Neumann TLM 50 microphone.
Microphone probe having 32 audio sensors offers 32 possibility of selection from 32 directivity patterns. On high frequencies these directivity patterns are elongated and referred to as beams. It should be stressed that directivity pattern of the whole probe 1 in which one audio sensor have been selected can be slightly tuned with a use of sound signals received with adjacent sensors added and aligned in phase but with smaller weights. Consequently, even when the mode of processing above upper frequency limit is changed from typical beamforming to audio sensor selection it is still possible to slightly tune the directivity pattern.
The audio sensors are distributed on a sphere in a latitude manner. One of the directions, in examples below denoted as Z, is distinguished. The audio sensors are distributed in layers spaced in the Z direction. The highest and the lowest layer always contain only one single audio sensor. The middle, center layer contains maximal number of audio sensors. Under this constrains there is a number of approaches towards selecting number of audio sensors per layer and a relative distance and rotation of the layers.
The distances between the layers can be selected based on either angular or linear approach. In the angular approach, the layers are uniformly distributed in the domain of latitudes, i.e. latitudes of adjacent layers differ always by the same angle. In the linear approach, the distances between layers in the Z direction are equal.
The relative rotation of adjacent layers is selected so that the audio sensors in one layer were located at the longitudes of centers of the gaps between the audio sensors in adjacent layers. That allows more effective use of the surface of the body 2.
In the embodiment with the spherical shape the linear approach results in higher density of the audio sensors in the central region of the body 2. That in turn gives better separation of the sources located in an elevation in the plane corresponding to 0 degrees—i.e. the horizontal one. On the other hand, the angular approach results in more uniform quality of beamforming in the whole range of elevation angle.
Exemplary distribution of 32 audio sensors in 7 layers including [1, 3, 6, 12, 6, 3, 1] audio sensors, respectively, with the angular distribution of layers is given in
It should be noted that thanks to the small dimension of the MEMS audio sensors, the total number of audio sensors can be further increased. That allows more precise determination of the direction of arrival and increases spatial aliasing cut-off frequency.
In an alternative embodiment, the body 2 is cylindrical and the MEMS audio sensors are distributed on its latter surface. The radius of the cylindrical body is 57.3 mom and the height is 78 mm. The sensors are distributed in 7 layers with 24 sensors per layer. Adjacent sensors are spaced by 15 mm one from each other, forming a mesh of equilateral triangles with sensors in vertices. The above distribution of sensors is illustrated in
An embodiment of the audio recording system adapted to execute a method according to the invention is presented in
Once signals from audio sensors are acquired, the method according to the embodiment of the invention is executed by the processing unit 4 having following blocks implanted in hardware or software: preprocessing block 25, beamforming block 21, and filtration block 24, presented in
The signals from the M channels are processed in filtration module 24. Parameters and essential features of the filtering process are adaptively changed by a steering unit 20 computing statistics of respective sources, communicating with user interface, UI block 23 and with Direction of Arrival, DoA block 22. DOA block is fed with signals s1, s2, . . . , sN to perform direction of arrival analysis and provide it to steering unit 20. Steering unit 20 is adapted to present the directions of arrival to user and optionally receive indication of the relevant ones as well as source specific information from UI block 23. Source specific, information is utilized in preprocessing block. The number and location of sources fed to the beamforming block 21. Beamforming block 21 is adapted to form M channel corresponding to M sources. Finally, processed samples in the channels ch1 . . . chM are ready fed to the Digital Audio Workstation i.e. DAN software—DAN block 7.
According to the invention there are two basic configurations of this system. In the first configuration the processing unit 4 is integrated with the probe 1. In fact, it is implemented in the same FPGA unit that acts as the acquisition unit 3. In the second configuration whole processing unit is implemented in the computer system only connected to probe 1. This computer system preferably has already implanted DAN block 7.
The direction of arrival analysis is executed by the DOA block 22. There is a number of state of the art methods applicable for the direction of arrival analysis applied in the method and system according to the invention. The one below is given by the way of example including some unique modification. The DOA analysis is based on the part of the frequency spectrum of the input signals s1, s2, . . . , sN below the spatial aliasing cut-off frequency. Namely, it is the lower part of STFT spectrum which is taken into account in the analysis.
That approach is successful also in localizing instruments, even though some of them operate in rather high frequency band. Usually, even when sound of the instrument occupies rather high frequency band some components have small amount of energy and are detectable at frequencies below spatial aliasing cut-off frequency.
An example of DoA operation is so called WDO approach described by O. Yilmaz and S. Rickard in “Blind separation of speech mixtures via time-frequency masking,” in IEEE Trans. Signal Process., vol. 52, no. 7, pp. 1830-1847, July 2004.
A wideband variation of MUSIC algorithm is described by S. Argentieri and P. Danés in “Broadband variations of the MUSIC high-resolution method for sound source localization in robotics,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), November 2007, pp. 2009-2014.
Independent Component Analysis-Generalised State Coherence Transform (ICA-CECT) algorithm is disclosed by F. Nesta and M. Omologo in “Generalized state coherence transform for multidimensional TDOA estimation of multiple sources,” in IEEE Trans Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 246-262, January 2012.
Particularly well results were obtained with application of constant-time analysis zone algorithm described by D. Pavlidi, A. Griffin, M. Puigt, and A. Mouchtaris in “Real-Time Multiple Sound Source Localization and Counting Using a Circular Microphone Array”, in IEEE Trans. Audio, Speech, Lang. Process., vol, 2, no. 10, Oct, 2013.
The core stages of the method proposed therein are:
The above cited paper describes the method in detail and in a such a manner that it can be easily repeated by the person skilled in the art. However, it should be noted that the inventors have introduced two advantageous modifications.
The first modification consists in amending the matching pursuit algorithm by replacing the fixed Blackman window width used for removing the source contribution with iterative selection of the Blackman window width, so as to obtain minimum value of the histogram energy remaining after removing the source contribution.
The second modification concerns trigger value of the source contribution factor. Instead of using fixed values, an adaptively determined value is applied. An arbitrary value is used only for the first source detection. In all further repetition of finding contributions from sources and removing them, a value of GAMMAj is selected on a base of the ratio between the normalized source energy and the normalized histogram energy. Reasonable results were obtained for GAMMAj values being mean values of such ratios for all previously detected and removed contributions.
In an alternative embodiment, the direction of arrival analysis may be reinforced or even replaced by prompting user with user interface and letting the user to select the sources. The later may be advisable if distribution of sources and the probe 2 within the room is repeatable. Such circumstances can be advantageous and open a possibility of improving the quality of recording with preprocessing including deconvolution of link impulse response determined with in that situation can be determined with additional measurements.
The direction of arrival analysis may be reduced merely to prompting user with user interface to enter location of sources and—optionally—parameters of these sources. These parameters may include in particular frequency band occupied by a source and/or the type of a source, e.g. drums or vocal.
Alternatively, when user enters locations and parameters of the sources, the direction of arrival analysis presented above is used to track the subsequent changes of location.
Preprocessing is an optional step. It is executed only in some embodiments of the invention. The functional block diagram of the preprocessing block 25 is presented in
Firstly, N input signals are divided into frames of K samples in the time domain block 251. If not specifically explained otherwise the frame length is 2048 samples in all the examples given in this specification. Hence, here K=2048.
The H-estimator is used for estimation of the parameters of the link in the propagation environment, it requires an additional source of noise-like reference signal used prior to recording music. This source is subsequently located in destined locations of the sources to be recorded.
For every single source location, the H-estimator 252a accepts a matrix of K-samples waveforms from N microphones and provides estimated impulse response to the steering unit 20 and the beamforming block 21 via a dedicated communication bus for further use during actual recording of sound. The impulse responses can be deconvoluted from signals corresponding to the particular sources to compensate the effect of environment on the sound, including cancelation of echo. Additionally impulse responses are optionally used in beamforming by providing indication of the expected directions of arrival of the loudest reflected signals which are then cancelled in the beamforming block 21. Processing according to the paper by Jacob Benesty “Adaptive eigenvalue decomposition algorithm, for passive acoustic source localization,” in J. Acoust, Soc. Am. 107 (1), January 2000, is applied in this example.
The 2d FFT filter block 252b is used optionally for elimination of interferences that are typical for MEMS audio sensors. It is a 2D median filter.
The pre-filters dynamic list block 253 comprises a sequence of filters used for individual correction of the signals from digital audio sensors 2.1, 2.2, 2.3, 2.4, 2.N with coefficient determined in a calibration—a process known from the art and not described herein. Alternatively, lowpass filters can be used, Filtration is executed in the frequency domain. Frames are transformed with Fast Fourier Transform and then multiplied by filter transfer function H(n,ωi) where n∈{1, . . . , N}. The ones skilled in the art know multiple ways for selection shapes of H(n,ωi) suitable for the given sensors' properties and interferences in the environment.
Finally, the frames are rebuilt in the frame rebuilder block 254.
The beamforming block 21 is responsible for synthetizing multiple directivity patterns of the probe 1, each corresponding to particular channel. A directivity pattern is a function describing the ability of the probe 1 to respond to audio signals arriving from a given direction. The directivity pattern depends also on the frequency of the signal. In practice it is represented as a function of direction—a pair of angles in azimuth and elevation (φ,θ), respectively, as well as a function of frequency f or pulsation ω, where ω=2πf. This function can be designed by to optimize reception of the signal having particular frequency spectrum and origin located in particular place in space.
The general problem of audio sensor array beamforming described in detail in section 5.1 of a book by Boaza Rafaely “Fundamentals of Spherical Array Processing”, Springer Topics in Signal Processing Volume 8, ISBN 978-3-662-45664-4, can be defined as a problem of designing vector w(ω) of weights, m(ω)=[w1(ω),w2(ω), . . . , wn(ω), . . . , wN(ω)]T, such that, for a given array input s(t)=[s1(t), s2(t), . . . , sn(t), . . . , sN(t) T], the audio sensor array output ch (t) is produced with some desired properties, where:
Ch(ω)=wM(ω)·s(ω)
and where N is a number of the audio sensors, n∈{1, . . . , N} is a variable used to index them, vector wM(ω) stands the conjugate transpose of w(ω), S(ω) represents a vector of complex amplitudes a pulsation ω of the(sound signal received by all audio sensors S(ω)=[S1(ω), S2(ω), . . . , Sn(ω), . . . , SN(ω)]T, ch(ω) represents complex amplitude at pulsation ω of the beamformer output sound signal. From the above it is clear that beamforming is done in the frequency domain in a frame-by-frame manner and that:
S
n(ω)=F{sn(t)}
Ch(t)=F−1{Ch(ω)}
In the system according to the present invention the problem has one additional dimension as there are at least two outputs each corresponding to different source. In the frequency domain output channels are represented by vector Ch(ω)=[Ch1(ω), Ch2(ω), . . . , Chm(ω), . . . , ChM(ω)]T, where M is a number of channels produced by the beamforming block 21 and m ∈ {1, . . . , M} is used to index them. Following notation used in
Chm(ωi)=wmH(ωi)·S(ωi),
where wm(ωi)represents vector of weights corresponding to m-th channel. That means, that in single beamforming operation M directivity patterns are applied to obtain M respective channels by applying matrix wH(ωi) of weights formed of the rows corresponding to respective channels:
Ch(ωi)=wH(ωi)·S(ωi),
It should be stressed that above formula represents amplitudes corresponding to a given pulsation discrete values ωi, a given frame, hence Ch(ωi), and S(ωi) are representations of the signals in frequency domain, calculated for this frame.
According to one embodiment of the invention for the values of ωi, corresponding to frequencies below spatial aliasing cutoff frequency —
—operation consists in determining and applying filter from table of filters wH. The beamforming block 21 is adapted to operate in four modes described below. It should be stressed that in order to evaluate condition
sampling frequency must be known. It has to be explicitly stored as in the numerical analysis frequency is normalized to the sampling frequency fs. Only then it is possible to identify values of i, for which ωi does not meet this condition. Above fcutoff conventional beamforming not applied. What is done instead is selection of signal from single sensor. Due to the fact that sensors are located in cavities 11.1, 11.2, 11.3 with further contribution of opening 12.1 in PCB board on frequencies above spatial aliasing cutoff frequency digital audio sensors have own directivity pattern in a form, of a beam narrow enough to select single instrument form a musical band.
Consequently signals in the respective channels in the system according to the invention have frequency domain representation obtained with beamforming below spatial aliasing cutoff frequency and sensor selection above spatial aliasing cutoff frequency. That results in very effective extraction of the signals from given sources even when frequency band of the source covers spatial aliasing cutoff frequency. For the m-th source located in direction (φn, θm) and for the value of pulsation ωi:
Let us now refer to modes of beamforming block operation below spatial aliasing cutoff frequency.
Beamforming block 21 in this mode provides constant gain in a given direction remaining directions are minimized. In this mode beamforming block 21 may or may not use input of DOA block 23 to locate the sources.
Without this support it is user who provides M arbitrarily given directions via user interface HI block 23. Coordinates are communicated to the beamforming block 21 via steering unit SU 20. Beamforming block 21 further operates to maximize signal to interference ratio in M channels corresponding to results of M beamformers steered to M given directions.
In cooperation with the Direction of Arrival block it is the DOA block 22 what is used to identify the directions of arrival and types of sources. The results are communicated to SU 20 and displayed in the circular diagram with the UI block 23. User is prompted to select and possibly manually tune the autodetected directions. The beamforming block 21 further operates to maximize signal-to-interference ratio in M channels corresponding to M given directions. With the support of the DOA block 22, the user selects with the user interface the directions of sound sources and assigns attributes thereto.
Minimal angular step of direction depends on the step used while creating the table of filters and typically is in the range of 1 to 5 degrees.
Denoting correlation matrix of the acoustic field sampled by the audio sensors as Sxx, a matrix of constrains of size N×M by V, and a vector of gains for particular channels of size 1×M by c, the beamforming applied in this mode is a solution to the following optimization problem:
This is called LCMV beamforming. The principle of designing filter table implementing it is described in section 7.5 of Boaza Rataely Fundamentals of Spherical Array Processing, Springer Topics in Signal Processing Volume 8, ISBN 978-3-662-45664-4. Minimization criterion is as follows: c is one element vector and V contains one steering vector, as described in subsection 7.55. Formula 7.60 is applied to calculate table of filter weights:
w
H
=c
H(VHSxx−1)−1VHSxx−1
The correlation of interference is a function of an izotropic noise field. Diffusion of the noise is in the system according to the present invention modelled according to I. A. McCowan, “Robust Speech Recognition using Microphone Arrays,”, PhDThesis, Queensland University of Technology, Australia 2001:
where
is a wavelength of sound propagating with velocity v and corresponding to ω, dij is the distance between sensors i and j, while
The beamforming block 21 in this mode provides constant gain in range <0;1> for a given direction, and suppression of signals coming from one or more unwanted directions that are minimized as described in reference with mode I. Using the UI block 23 user may manually select desired directions and define M corresponding channels, then for every channel may select unwanted directions—the ones corresponding to origins of the signals to be minimized. A use of the DOA block 22 allows fur an automatic detection of the directions corresponding to all origins, then the user defines attributes: either desired or unwanted (i.e. interference). The number of channels M is equal to the number of the directions having the attribute set to “desired”, Locations of particular sources are tracked in time. Beamforming filters are updated in real time and modified using adaptive signal processing techniques or partially stored in memory in the table of filters.
Criteria for minimization are changed as follows: Sxx is a synthesized correlation matrix between the interfering signals to be minimized. In that mode V contains steering vectors indicating directions that correspond to the signals prescribed to be either attenuated or amplified to precisely prescribed value. For a given direction the gain and supression are constant for all frequencies.
In this mode the beamforming block 21 optimizes in a domain including two dimensions—direction and frequency, not only to amplify signal originating from “desired” direction, but also to generate null for the unwanted direction but only for values of ωi corresponding to a source marked as unwanted. It should be noted that introducing frequency specific tags could reduce computational power required and is useful in filtration that follows beamforming. Precisely, sources can be assigned additional tags indicating the width of occupied frequency spectrum. Preferably, these tags correspond to the typical audio tracks: “vocal”, “violin”, “piano”, “drums”, “flute”, “saxophone” etc. Every tag represents particular frequency spectrum occupied by the signal as well as a model of the source of sound, and is used together with direction information.
The beamforming block 21 in this mode is applicable for elimination of reflections of sound from the walls of the room in which the probe is located.
Criteria for minimization are similar to the ones used. In mode II, but due to application of tags each frequency is considered independently, and so are correlation matrices, weights, constrains and gains. Namely, the optimization problem for each ωi is solved separately:
IV. Virtual Microphone with Directivity Shaping Mode
In the virtual microphone mode the beamforming block 21 optimizes directivity pattern of the probe 1 to match a directivity pattern arbitrarily given by the user.
Those skilled in the art are able to implement above modes easily by using LCMV algorithm described in “Fundamentals of Spherical Array Processing” by Rafaely Boaza, sections 7.6, 7.7, and 7.8, and “Design of Circular Differential Microphone Arrays” by Jacob Benetsy.
Optional but advantageous modification of operation of the beamforming block 21 consists in applying additional weights to the sensors prior execution of the beamforming operations indicated above. Distribution of weights depends on the source towards which the beam is supposed to be directed, as schematically illustrated in
An additional advantageous embodiment of the beamforming operation is related to a use of single MEMS microphone elements as audio sensors. Small dimension of MEMS microphones makes it possible to use 32 or even more, preferably 62 digital audio sensors on a sphere. Locating sensors tore densely contributers to increase of spatial aliasing cut-off frequency, allows for using sensors having higher directivity and narrower beam of directivity pattern, but on the other hand may cause problem due to limited precision of the sensor location.
There is a strong dependence between properties of the interference correlation matrix and noise on the output of the beamformer. White noise gain can be controlled algorithmically and geometrically. In the present invention the level of white Gaussian noise is estimated according to the I. A. McCowan, “Robust Speech Recognition using Microphone Arrays”, PhD Thesis, Queensland University of Technology, Australia 2001. Consequently, a constrain on weights w is applied according to the formula:
where k is a propagation vector and σ2 is a value not lower than acceptable SNR.
This formula is true under the condition that each sensor has the same noise and the location of the sensor is precise. The latter can be assumed true when the spaces between sensors are greater or equal than 0.1 of the wavelength.
That approach allows denser distribution of the audio sensors 2.1, 2.2, 2.3, . . . , 2.N on the surface of the body 2 of the probe 1. Namely, it enables increasing N without increase of the dimensions of the, body 2. In advantageous example the, beamforming block 21 operates in 3 sub-bandwidths. For each of the sub-bandwidths different subset of digital audio sensors is used. Consequently, at lower frequencies with longer wavelengths the spacing between particular audio sensors is greater. As frequency is increased, more audio sensors are selected and effectively the spacing between the sensors used is lower. A table indicating constrains for sensor selection is presented in
In an alternative embodiment a different beamforming principle is applied. It requires an initial measurement of the properties of the probe 1, i.e. the probe characterization that results in obtaining a frequency response matrix H(ω) of size N×L. Each element of the matrix comprises a Fourier transform of the impulse response of particular sensor corresponding to particular direction of arrival. N is a number of sensors and L is a number of directions of arrival. In further description beamforming is clone on a frequency by frequency basis. Sole symbol H denotes then N by L samples corresponding to a single frequency and consequently a single value of ω.
Using the measured frequency response matrix H has an advantage over use of is the synthesized correlation matrix Sxx and LCMV algorithm described above in that the Sxx matrix results from purely geometrical calculations done over the given geometry of the sensors of the probe 1 and under the assumption that sound propagates in a linear manner. Moreover, the results are susceptible to errors caused by production misplacement of the sensors that can be difficult to detect. On the other hand, a use of the frequency response matrix H requires individual characterization of every probe 1 that produced, which is time consuming and requires an anechoic chamber. In this respect simplicity is an advantage of using the synthesized correlation matrix.
The probe characterization procedure consists in locating a source of sound in a number of locations with respect to the probe 1 and recording responses of all N digital sound sensors present on the probe. It has to be done in an anechoic chamber to guarantee a single line of propagation of sound. When the probe 1 is located on a platform revolving in the vertical plane in an anechoic chamber in front of nine computer-controlled sources of sound, it is possible to record responses of the probe 1 on the 3D distribution of sources. Relative distribution of the sound sources with respect to the center of the probe 1 is shown in
Once the frequency response matrix is determined, determining and applying the filter table wmH(ω) is required for completing beamforming operation for every value of ω. The filter table elements wmH naturally depends on a frequency, but operations are done with the same principle for all frequencies and hence dependence on a frequency can be omitted in presented operations,
For given ω and for the m-th channel the values of the filter table are determined according to the formula:
where I stands for an identity matrix and β is selected to improve the numerical conditioning of the equation. In a case of well-conditioned equation, β is exactly equal to zero while for ill-conditioned one a small value is selected to improve conditioning. It is well known operation in numerical processing. The vector gm(ω) represents 3-D directivity pattern desired to be formed by the beamforming block 21 at given ω for the m-th channel. As directivity pattern is in general case a function g(φ, θ) of two dimensions representing angles of azimuth and elevation (φ, θ), the result of sampling it at given frequency is a two-dimensional matrix. The vector gn(ω) consists of concatenated columns of such matrix of samples desired for the m-th channel at given ω.
Typical choice of the shape of the directivity pattern is trigonometric polynomial function as described in Boaza Rafaely, “Fundamentals of Spherical Array Processing” and “Design of Circular Differential Microphone Arrays” by Jacob Benesty. In the latter particularly formulas for hyper- and super-cardioid are given. Formula (2.34) in the section 2.2 of said book defines general form of the trigonometric polynomial used in an example below.
Let us assume a simple case of four instruments. The number of instruments imposes a number of corresponding channels M=4. The number of channels imposes an order of cardioid as the number of nullified directions depends on the order and for each instrument the remaining three ones should be nullified for the corresponding channel. For the given example the order of the cardioid has to be equal to 3.
Under above assumptions the initial problem is to design four vectors g1, g2, g3, g4 containing samples of directivity patterns corresponding to the four channels. These four radiation patterns have to be orthogonal one to each other. Additionally, each of them has to meet condition of having maximal gain corresponding to the one instrument and zero gain corresponding to the remaining three ones. Assuming that instruments are located on the same elevation and distributed uniformly in terms of the azimuth angle, at 0°, 90°, 100°, 270° respectively, directivity patterns having cross-sections presented in
Directivity patterns should be sampled in such a manner so as to obtain vector gm of concatenated columns that has a length equal to the one dimension of H matrix of impulse responses transformed to the frequency domain, allowing matrix multiplication HHgm.
The kind of post processing filtration applied depends on the kind of recorded sound and can be applied by the user. Functional block diagram of an exemplary filtration block 24 is presented In
After optional execution of selected processing operations, processed signal is transformed to the time domain and outputted. The selection of processing operation is done by the steering unit 20 and depends on the instructions given by the user with the user interface when the sources were defined. Also, additional information from particular blocks executing particular processing operations may be returned to the steering unit 20.
In the simplest example of the Wiener filtration, during processing of the signal in channel chy, the signal from channel ch that is considered unwanted is adaptively filtered and subtracted from the signal from channel chy to meet minimum energy criterion. That approach allows for elimination of the signals reflected from walls and cross-talked to an another channel.
Further application of Wiener filtering consists in minimization of the, energy with subtraction from useful signal more than one filtered channel. It is applied in the frequency domain in a frame-by-frame manner with a frame of 2048 sound samples. That means that in each step a matrix of 2048 samples×N sensors is processed. Information regarding beamforming criteria are supplied to the filtration block 24 from the steering unit 20.
The filtration block 24 receives M channels ch1, . . . , chN from the beamformer block 21.
Let us consider signal ch1 in the first channel corresponding to the first source of sound, e.g. the first instrument. Signals of remaining sound sources are in ch1 treated as interferences. These signals are contained in the remaining channels ch2, . . . , chM.
As all operations are executed in the frequency domain, the following vectors and matrices are used:
CH1(ω1)=FFT{ch1(tk)}
U(ωi)=[FFT{ch2(tk)}, FFT{ch3(tk)}, . . . , FFT{chM(tk)}]
Ch1(ωt) represents spectrum samples resulting from transformation of the frame of 2048 samples of the signal in the channel 1, ch1 to the frequency domain with fast Fourier transform. Matrix U represents spectral samples of remaining channels.
As beamformer block 21 operates also in frequency domain the processing can be done in the same frames without rebuilding the frames in between.
Let us consider a frame number nf. Let us consider signal in channel 1 during this frame. Its initial spectrum is denoted with Ch(ωi). The spectrum of the signal Ch1′(ωi) of the signal after filtration is calculated according to the pair of formulas:
where UT stands for transposition of U, Ch1′* stands for conjugation of Ch1′, α is a constant that satisfies criterion α<2, and in this example it is equal 1, 2, Pest is estimation of the average power calculated over subsequent frames according to the formula:
where γ is so called forgetting factor, γ∈(0,1). In this example γ is equal to 0.4.
Additionally return information possible regarding tuning the operation of the beamforming block 21 is optionally given to the steering unit 20.
Possible implementations of Wiener filtration are described in detail in I. A. McCowan, “Robust Speech Recognition using Microphone Arrays”, PhD Thesis, Queensland University of Technology, Australia 2001.
Kalman filtration is used to speech and instrument tracking, removing pulsed, broadband sounds, e.g. drums and elimination interferences caused by side lobes if they appear either in particular audio sensor directivity pattern or in synthetized directivity pattern of whole probe 2. Possible implementations of Kalman filtering are discussed in Adaptive Filter Theory. In the present exampleit is implemented for vocals, as described in “Springer Handbook of Speech Processing, The Kalman. Filter”, section 8.4. The voice is modelled according to the autoregressive model. the same model is used for other instruments.
PCA tracking is used for removing nonpercusive sounds from drums channel and to remove drum sounds from polyphonic channels. Implementation is disclosed in article by Daniel. P. Jarrett, Emanuel A. P. Fabets, Patrick A. Naylor, “Eigenbeam-based Acoustic Source Tracking in Noisy Reverberant Environment”, Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers (ASILCMAR), 2010.
Alternative implementation is disclosed in “Extraction of drum tracks from polyphonic music using independent subspace analysis” by Christian Uhle Christian Dittmar Thomas Sporer, 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), April 2003, Nara, Japan.
Spectrum masking consists in baseband filtration based on the tag information regarding instrument type and resulting bandwidth occupation.
Model filtering is based on modeling sources and extraction of model parameters. Three models are used:
Identification of source model and its parameters allows during recording that follows more effective elimination of interferences.
Guitar in the first channel ch1 and drums in the second channel ch2. The transient model describes time slots in which drums are being hit and recorded in ch2. Drums generate pulsed broadband sound. That means that elimination of drums is likely to affect the useful signal in the guitar channel ch1. Using information from analysis of ch2 signal allows applying the pulsed interference elimination techniques exactly in the time slots when they appear and hence reduces the risk of affecting the useful signal.
Guitar in the first channel ch1 and violin in the third channel ch3. The sound of a guitar that is the useful signal in ch1 is represented by a model of stable tone trajectories and transients without FM modulation. Conversely, sound of violin that is the, useful signal in ch3 has trajectories with apparent FM modulations. Masking all components of sound in ch1 having FM modulations allows enhancing signal to the interference ratio as only sound of violin that is considered an interference in ch, is thereby suppressed. Inverse masking in ch3 allows elimination of guitar from the violin channel.
Those skilled in the art will easily recognize that numerous other signal processing and filtration techniques can be used to extract sound of instrument from the channel and use it for elimination of this instrument from the other channels by processing it with Wiener or some other adaptive filtration schemes.
Also, the ones skilled in the art will easily recognize that once the concept of using different beamforming methods in different frequency bands to form composite spectrum chm(ωi) and to finally extract resulting signal by inverse transformation of composite spectrum is disclosed, there is plurality of methods to use and numerous divisions to frequency bands can be applied.
It is also apparent from the present description that the disclosed processing can be applied basically in any audio sensor array not limited to cylindrical or spherical, even not necessarily to the shape of a solid of revolution
Also specialists in the field of signal processing are able to routinely apply modes of filtering adapted to particular sound sources not mentioned above.
Number | Date | Country | Kind |
---|---|---|---|
PL416068 | Feb 2016 | PL | national |
PL417913 | Jul 2016 | PL | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2017/050714 | 2/9/2017 | WO | 00 |