This application is a Section 371 National Stage Application of International Application No. PCT/FR2017/053622, filed Dec. 15, 2017, the content of which is incorporated herein by reference in its entirety, and published as WO 2018/115666 on Jun. 28, 2018, not in English.
This invention relates to the field of audio or acoustic signal processing, and more particularly to the processing of actual multichannel sound content in ambiophonic format (or “ambisonic” hereinafter).
The ambisonic technique consists in using in each frequency band a sub-set of channels that have sought directivity characteristics. By way of example of application, mention can be made of:
Ambisonics consists in protecting an acoustic field over a base of spherical harmonic functions (base shown in
where {tilde over (P)}mn(cos ϕ) is a polar function involving the Legendre polynomial:
As shown in
In practice, an actual ambisonic encoding is carried out using a network of sensors, generally distributed over a sphere, which are combined in order to synthesise an ambisonic content of which the channels best respect the directivities of the spherical harmonics (as shown in
The basic principles of ambisonic encoding are described hereinafter.
The ambisonic formalism, initially limited to the representation of spherical harmonic functions of order 1, was subsequently extended to the higher orders. The ambisonic formalism with a higher number of components is commonly referred to as “Higher Order Ambisonics” (or “HOA” hereinafter).
To each order m corresponds 2m+1 spherical harmonic functions, as shown in
The term “ambisonic components” hereinafter means the ambisonic signal in each ambisonic channel, in reference to the “vector components” in a vector base that would be formed by each spherical harmonic function. Thus for example, it is possible to count:
The ambisonic signals captured for these various components are then distributed over a number N of channels which is deduced from the maximum order m that it is provided to capture in the sound stage. For example, if a sound stage is captured with an ambisonic microphone with 20 piezoelectric capsules, then the maximum captured ambisonic order is M=3, so that there is not more than 20 channels N=(M+1)2, the number of ambisonic components considered is 7+5+3+1=16 and the number N of channels is N=16, given moreover by the relationship N=(M+1)2, with M=3.
The ambisonic capture x(t) of order M and comprised of N sound sources si of incidence (θi, ϕi) propagating in a free field can then be written mathematically in the following matrix form:
Where A is a matrix referred to as “mixing matrix”, of dimensions (M+1)2×N and of which each column Ai contains the mixing coefficients of the source i.
Physically, this matrix A corresponds to the encoding coefficients of each source i, associated with each direction of each source i. In order to extract the sources from such a content, a matrix B referred to as “separating matrix”, inverse of the matrix A, must be estimated. In order to obtain the matrix B, a step of blind source separation can be implemented, for example by using an independent component analysis (or “ICA” hereinafter) algorithm, or a main component analysis algorithm. The matrix B=A−1 allows for the extraction of the sources via the following operation:
s(t)=Bx(t)
This step amounts to forming beams (or “beamforming” hereinafter), i.e. in combining various channels that have separate directivities, in order to create a new component that has the desired directivity. An example of beamforming in order to extract three components, for a HOA content of order 2, 4 or 6, is shown in
In practice, generating ambisonic signals x(t)=As(t) passes through an intermediate step of microphone capture such as shown in
s(t)=Bx(t)
To decode an HOA content on a system of speakers, the approach is similar. Ambisonic signals in N channels x1, x2, . . . , xN are acquired, but, here, instead of considering s(t) as the sum of the contributions of sources, s(t) is considered as the sum of the signals emitted by a set of speakers (which then effectively makes it possible to supply these speakers with the signals s1, s2, s3 . . . ). The decoding matrix B is therefore formulated here using the positions of the speakers of a sound restitution system and the signals intended for the speakers according to the same method as the one used for the source separation are extracted.
In reality, the sensors used have physical limitations that cause a degradation in the microphone encoding, and therefore a degradation in the directivity of the ambisonic components. For example, the encoding of the high frequencies is degraded when the inter-sensor spacing becomes approximately greater than one half-wavelength: this is due to the phenomenon of spatial aliasing. At low frequencies, the microphone capsules tend to become omnidirectional and it becomes impossible to obtain the sought directivities. More precisely, the degradations at low frequencies are more marked when it entails synthesising ambisonic components of a high order. Generally, associated directivities are more complex and therefore more sensitive to variations in the properties of the sensors.
More generally, when the theoretical directivity is not respected, the beamforming carried out no longer makes it possible to suitable extract the sought components. For example, this results in the appearance of interferences during source separation. This can also result in a degradation of the spatial resolution in frequency bands concerned by a multichannel diffusion. More particularly, a loss of energy in the low frequencies in the high orders during encoding is observed. This induces that the sources extracted thanks to channels of high orders can lose part of their energy in the frequencies concerned.
The utilisation of beamforming for source separation or for the restitution of an ideal ambisonic content or of a multichannel capture is already used in particular for the separating, or for multichannel decoding. For source separation, an inversion of the mixing matrix estimated via independent component analysis is used in order to extract the sources. For the multichannel decoding, the matrix of the ambisonic coefficients relating to the speakers can be inverted. On the other hand, the processing of an actual ambisonic content, affected by the physical limitations of the recording system, is not addressed in prior art. The only solution currently proposed is to limit the total bandwidth of the extracted sources, which is not satisfactory.
This invention improves this situation.
It proposes for this purpose a method, implemented by computer means, for processing an ambisonic content comprising a plurality of ambisonic components of a plurality of orders defining a succession of ambisonic channels in each of which an ambisonic component is represented, the method comprising:
The term “sound source” here means:
A frequency band can be defined by several frequency bands or frequency sub-bands.
The developing of ambisonic decoding sub-matrices for each frequency band, and for each ambisonic order, makes it possible to benefit in each frequency band from a maximum number of ambisonic channels which are actually valid in each sub-matrix, in order to restore a decoded signal that is not or is hardly degraded.
According to an embodiment, each ambisonic decoding sub-matrix is associated with a frequency band selected according to a validity criterion of the ambisonic components of the order with which said sub-matrix is associated, in said selected frequency band.
Such an embodiment makes it possible to isolate the ambisonic components that form each order, so as to process them in the range of frequencies wherein they are valid. The term “valid” means respect with the theoretical ambisonic representation, such as for example the order m=4 in the frequency band 4000 to 6000 Hz in the example of
Thus, in an embodiment, the validity criterion of the components can be defined by conditions for capturing said ambisonic components, by at least one ambisonic microphone.
In this embodiment for example, the method can further comprise:
Knowledge of the data of the ambisonic microphone used for the ambisonic capture makes it possible to refine the determining of the frequency bands selected for the development of the sub-matrices. Indeed, the ambisonic processing is done on sub-matrices of which the ambisonic components strictly meet the validity criterion in the associated frequency bands.
However, the data of the ambisonic microphone used for the capturing are not always accessible. Alternatively, it is therefore possible to provide for the determining of the frequency bands using a chart established beforehand using measurements taken over a plurality of ambisonic microphones, so as to establish “average” frequency ranges, associated with an ambisonic order, wherein the ambisonic components of each ambisonic order generally meet the aforementioned validity criterion.
Thus, according to an embodiment, each ambisonic decoding sub-matrix being associated with an ambisonic order and a frequency band selected for this ambisonic order,
In an embodiment where the frequency bands are obtained by fast Fourier transform (FFT), a frequency band associated with an ambisonic order can comprise several frequency bands FFT. Thus, several frequency bands can be associated with an ambisonic order.
In an example of this embodiment where an FFT is used, for a signal sampled at 48 kHz and for an FFT size of 4096 points (212), the bands no. 10 to 910 correspond to the frequency band 100 to 10 kHz and are associated with ambisonic order m=1.
Thus, it is possible to define a validity criterion based on average values of the frequency bands for each ambisonic order, even if the data of the ambisonic microphone used for the capturing of ambisonic components is not accessible.
According to a particular embodiment, the processing of the ambisonic decoding matrix comprises:
It is thus understood that a frequency filtering of the components of order m=4 between 4000 to 6000 Hz, in the example of
In an embodiment, the processing of the ambisonic content is conducted for a source separation and said decoding matrix is a blind source separation matrix developed from ambisonic components.
For example, the separating matrix can be developed using ambisonic components filtered at a selected frequency band and preferably wherein the number of valid ambisonic channels according to the aforementioned criterion is maximal.
Thus, the channels are retained for a representation accuracy at such an ambisonic order that is the highest, but also in order to retain a maximum of correctly represented channels in this frequency band, at lower ambisonic orders.
In this embodiment, it is possible to simplify the mixing sub-matrices before the inversion thereof, via a reduction in the number of column of each sub-matrix, with the remaining columns of the sub-matrices being selected in such a way as to retain signals with the highest energies after application of the decoding sub-matrices.
Indeed, retaining the signals with the highest energy makes it possible to better represent, and therefore better restore, the sound field.
As a complement or as an alternative, it is possible to select to favour extracted signals that are the most decorrelated, or the most independent according to a selected independence criterion.
Thus, in this embodiment, mixing sub-matrices are simplified before the inversion thereof, via a reduction in the number of columns of each sub-matrix, with the remaining columns of the sub-matrices being selected in such a way as to retain the least correlated signals after application of the decoding sub-matrices.
Moreover, in a reverberating environment, the signal is formed of direct fields coming from the “free field” equivalent propagation of each source and from reflections on the walls of the acoustic environment. Thus, in an alternative or complementary embodiment, mixing sub-matrices are simplified before the inversion thereof, via a reduction in the number of column of each sub-matrix, with the remaining columns of the sub-matrices being selected in such a way as to retain the signals corresponding to direct sound fields after application of the decoding sub-matrices.
Of course, in an embodiment where the processing of the ambisonic content is conducted for an ambisonic restitution on a plurality of speakers, the aforementioned decoding matrix can be an inverse matrix of relative spatial positions of the speakers.
In an embodiment shown hereinafter in reference to
This invention also relates to a computer program comprising instructions for implementing the method when this program is executed by a processor. An example logical diagram of the general algorithm of such a program is shown in
This invention also relates to a computer device comprising:
An example of such a device is shown in
This invention thus proposes to use the formation of beams using an actual ambisonic encoding by taking advantage, in each frequency band, of all of the channels of which the directivity respects the ambisonic formalism. An embodiment presented hereinabove then makes it possible to determine one or several mixing matrices Ak, corresponding to sub-matrices obtained from the theoretical matrix A, and each formulated in a frequency band, then inverted in order to give the decoding matrices Bk.
Thus, the invention offers a generic processing of any ambisonic content, and in particular actual, possibly affected by the physical limitations of a recording system, and this without any constraint aimed at limiting the total bandwidth of the extracted sources.
Other advantages and characteristics of the invention shall appear when reading the detailed description hereinafter of embodiments of the invention, and when examining the accompanying drawings wherein:
The general diagram of a global method of ambisonic processing in terms of the invention is shown in
In the step S1, there is an ambisonic content x(t) comprising a plurality of ambisonic components CA, of successive orders m=0, 1, . . . , M (with for example M=4) and, coming from a recording, or from a “capture”, by at least one ambisonic microphone MIC. An ambisonic microphone is a microphone comprised of a plurality of microphone capsules generally distributed spherically and as evenly as possible. These capsules play the role of sound signal sensors. The microphone capsules are arranged on the ambisonic microphone in such a way as to capture the sound signals according to their directivity in space. As shown in
However, frequency variations in the accuracy of the ambisonic representation of each order of
The step S2 therefore aims to recover the data that characterises the ambisonic microphone MIC (and possible the conditions for capturing the ambisonic content c(t), and/or the reverberation conditions during the capturing, or others).
More generally, a characterising piece of data of the ambisonic microphone MIC can be the inter-capsule spacing. Indeed, the encoding of high frequencies is degraded when the inter-capsule spacing becomes greater than one half-wavelength. This is due to the phenomenon of spatial aliasing. Inversely, for a low frequency signal, microphone capsules that are too close cannot generate the designed directivity.
In the step S3, it is possible to apply an analysis filter bank AFB to the ambisonic content x(t) so as to then select, in the step S31, ambisonic component signals filtered in the range of frequencies wherein the ambisonic representation for a given order m is the most accurate (thus respecting a “validity criterion” of the ambisonic representation), and this according to the data of the microphone defined hereinabove.
According to the type of processing applied to the ambisonic content x(t), between a source separation processing SAS or a processing for a restitution on speakers RES, the step S4 aims to obtain a decoding matrix B, according to the type of processing selected. In the case of an ambisonic restitution on speakers, the decoding matrix B is the inverse of a matrix A containing coefficients proper to special positions of speakers used for the restitution.
In the case of source separation, the decoding matrix B is initially developed in the step S4 for the purpose of a blind source separation processing using filtered and selected ambisonic components. More particularly, this decoding matrix B is developed for the frequency band containing the largest number of valid ambisonic channels (and the highest order able to be obtained M).
The determining of the frequency bands of validity of the various ambisonic order can be suited to the ambisonic microphone that was used for the capturing of the ambisonic components to be decoded. To do this, it is possible for example to use as a base the frequency variations in the accuracy of the ambisonic representation for various orders m, of the type shown in
More generally, an “average” rate of the frequency variations in the accuracy of the ambisonic representation can be determined for the various orders m for different ambisonic microphone models, and these average rates can be used is this data is not available, at decoding.
In the step S7, at least two matrices B1, B2 are determined, coming from the matrix reduction of the decoding matrix B for each frequency sub-band (in the example shown the frequency sub-bands f1 and f2). A more accurate embodiment of this matrix reduction will be described hereinafter in reference to
In the step S9, the vectors of extracted signals s1 (1 for k=1) and s2 (2 for k=2) are combined in order to obtain the full-band reconstructed signals (by application for example of a synthetic filter band).
In the step S4, as described hereinabove, the decoding matrix B defined hereinabove is obtained. In the step S5 it is possible to carry out an inversion of this decoding matrix B (or equivalently, a determining of its pseudo-inverse) in order to obtain the corresponding mixing matrix A (step S51). In the case of source separation, the mixing matrix A can thus contain coefficients relative to respective positions of sound sources to be extracted. In the case of a restoration on speakers, the mixing matrix A can contain coefficients relative to the position of the speakers where on it is desired to restore the decoded signals. More precisely, the lines of the mixing matrix A correspond to the successive ambisonic channels (defining successively the orders m=0 to m=M, where M is the maximum ambisonic order available) and its columns correspond to the sources or to the speakers.
In the step S6, it is possible to reduce the dimensions of the mixing matrix A, in order to obtain sub-matrices A1, A2. This is a matrix reduction of which the number of lines corresponds to the numbers of ambisonic channels for each order. Typically, if the ambisonic signals are indeed encoded in the band from 100 to 1000 Hz, where the order m=1 is indeed respected (at least for the ambisonic microphone of
Regarding the sub-matrix A2, these lines of the global matrix A can be used, as well as the following, up to the line:
For the mixing matrix A2, corresponding to the order 2 of the ambisonic content x(t), and therefore to the sub-band f2, nine lines are therefore retained, corresponding to the nine channels of order 2, and the number of sources to be extracted in columns.
Each mixing sub-matrix thus obtained is of dimension N×Ntarget, with Ntarget the number of sources coming from the blind source separation or the number of speakers provided for a restitution.
In the case of a restitution on speakers, the number of speakers is preferably equal to or greater than the number of lines. For example, for the mixing matrix A1 of four lines, a set of four columns may only be retained. In the case of source separation, the number of columns can be less than or equal to the number of lines. For example, for the mixing matrix A1 of four lines, the columns can be suppressed and sources can be retained for example of which the signals are of greater energy and/or those which are the least correlated (sources that are the least “mixed” possible) and/or the signals that correspond to the direct field of the sources, or others.
In the step S71 an inversion of each mixing sub-matrix A1, A2 is carried out in order to respectively obtained the decoding sub-matrices B1, B2 presented hereinabove (step S7). Passing through the mixing matrix A makes it possible in particular to retain satisfactory energy levels of the ambisonic components linked to each order, despite the matrix reductions. In other terms, the steps S5 to S71 make it possible to “refine” the decoding of the ambisonic content x(t).
The word “channels” is used to refer to the ambisonic microphone sources and “sources” for the signals to be extracted (sources effectively to be extracted or the supply signals of the speakers). In the step S1, there is an ambisonic content x(t) of order M, comprising a plurality of recorded ambisonic channels N to be processed. Generally, the number of recorded ambisonic channels is equal to N=(M+1)2. In the step S2, there is data relative to the ambisonic capture of the content x(t) (data relative to the ambisonic microphone MIC used, etc.).
Knowing the validity limits of the microphone encoding, a frequency band is determined for each ambisonic order. A filter bank allowing for a reconstruction is applied to the N ambisonic channels in the step S3, in order to give K sub-bands noted as xk. The sub-bands are selected to correspond to the different validity ranges of the microphone encoding.
In a particular embodiment in the step S4A shown as a solid line, a source separation matrix B developed according to the frequency filtered ambisonic components (top arrow coming onto rectangle S4A) is used. More particularly, a blind source separation method is applied in the sub-band containing the most valid channels, in order to obtain a separating matrix B of dimensions Ntarget×N, Ntarget being the number of sources obtained by the blind source separation in the selected frequency sub-band.
The valid channels are determined using a validity criterion relative to each order of the ambisonic content x(t) according to each frequency band of the filter bank. More generally, in order to maximise the quality of the source separation, a frequency band is selected that has the most ambisonic components that are valid. The term “valid” means components of which the energy criteria or directivity were not biased during the ambisonic capture, as presented hereinabove in reference to
For example, the ambisonic channels of order 1 tend to be valid in a frequency band ranging from 100 HZ to about 10 kHz. The frequency band in which the ambisonic channels of order 2 can be more generally valid can for example range from 1 kHz to 9 kHz, etc.
In an alternative embodiment for the purpose of a restitution of a sound stage over several speakers (more than two in general), in the step S4B (shown as a dotted line in
Returning to the general processing (for a restitution or for a separation of sources), in the step S5, the “theoretical” mixing matrix A (for the two aforementioned alternatives) is constructed through inversion of B. For source separation, the mixing matrix is comprised of N lines and of Ntarget columns, the ith column containing the spherical harmonic coefficients, relative to the coordinates (θi, ϕi) of the source si. Hereinbelow is an example of a mixing matrix A in the case of a separation of sources for an ambisonic content of order 2 comprised of five sources:
For the diffusion on speakers, A is comprised of N lines and of a minimum of N columns, the ith column containing the spherical harmonic coefficients, relative to the coordinates (θi, ϕi) of the speaker i.
In the step S6, and for each sub-band k, a mixing sub-matrix Ak is constructed, such that Ak is a truncated version of the matrix A, retaining only the Nk lines that correspond to the channels that are effectively valid in this sub-band k.
For source separation, if Nk is less than the number of sources Ntarget sought in the sub-band, only one set of Ntarget,k, columns (with Ntarget,k less than or equal to Nk) is retained, selected according to energy criteria (for example by separating the sources that have the largest contribution) or according to other criteria of interest such as defined hereinabove. The matrix Ak thus has for dimensions Nk×Ntarget,k, with Ntarget,k=min(Nk, Ntarget) for example. Hereinbelow is an example of a truncated matrix Ak(4×4) at ambisonic order 1:
For the restitution on speakers, a set of Nk speakers is selected for the restitution, and Ak therefore has for dimensions Nk×Nk.
In the step S7, the matrix Ak is inverted in order to give Bk. When the sub-matrix Ak is not a square matrix, there are an infinite number of possibilities for the inversion. A pseudo-inversion can be applied, or an inversion by applying additional constraints (for example selection of the solution that gives the most direct beamforming, or that minimises the secondary lobes).
Generally, the term “matrix inversion” means a conventional matrix inversion as well as a pseudo-inversion as presented hereinabove.
Then, in the step S8, Bk is applied to the sub-band xk in order to obtain the signals sk such that
sk=Bk·xk
Once the sources have been extracted in each sub-band, the corresponding full-band signals are reconstructed by a synthetic filter using the sub-band signals of the same direction, in the step S9.
Hereinbelow, an example of an embodiment of the method according to a particular embodiment of the invention is described by way of example.
There is an ambisonic content of order 2 (9 channels) sampled at 16 kHz, noted as x(t) comprised of 3 sources that are to be extracted. The ambisonic encoding at orders 0 and 1 is valid between 200 Hz and 8000 Hz. The encoding of the order 2 is valid between 900 Hz and 8000 Hz.
A filter bank is implemented, formed from two frequency bands, 200 Hz-900 Hz (up to order 1) and 900 Hz-8000 Hz (use of order 2)
The filter bank is applied to x(t), in order to form x1(t) and x2(t). x1(t) is formed from 4 channels (ambisonics of order 1) and x2(t) contains 9 channels (ambisonics of order 2).
A separating matrix B of dimensions 3×9 is estimated via independent component analysis carried out in the sub-band 900 Hz-8000 Hz i.e. x2(t).
A theoretical mixing matrix A, of dimensions 9×3, is deduced by inversion of B, each column i containing the spherical harmonic coefficients of the source i.
At the same time, the matrices A1 and A2 are calculated using A in order to extract the sources in each sub-band:
The three sources are extracted in each respective sub-band of indexes 1 and 2:
s1=B1·x1 and s2=B2·x2
Then, the full-band sources are reconstituted by application of the synthetic filter to the signals in sub-bands s1 and s2, for example and adding, band by band (if the analysis filter band was in base band):
s=s1+s2
In reference to
Of course, this invention is not limited to the embodiments described hereinabove by way of example; it extends to all alternatives.
Typically, the frequency ranges for which the ambisonic representation is valid are given hereinabove by way of example and can differ according to the nature of the ambisonic microphone or microphones used for the capturing, even the capturing conditions themselves.
Number | Date | Country | Kind |
---|---|---|---|
16 63079 | Dec 2016 | FR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FR2017/053622 | 12/15/2017 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2018/115666 | 6/28/2018 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20110249822 | Jaillet | Oct 2011 | A1 |
20120155653 | Jax | Jun 2012 | A1 |
20140307894 | Kordon | Oct 2014 | A1 |
20150194161 | Najaf-Zadeh | Jul 2015 | A1 |
20170243589 | Krueger | Aug 2017 | A1 |
20190349699 | Keiler | Nov 2019 | A1 |
Number | Date | Country |
---|---|---|
2010076460 | Jul 2010 | WO |
Entry |
---|
English translation of the Written Opinion of the International Searching Authority dated Jun. 25, 2019 for corresponding International Application No. PCT/FR2017/053622, filed Dec. 15, 2017. |
International Search Report dated Jun. 25, 2019 for corresponding International Application No. PCT/FR2017/053622, filed Dec. 15, 2017. |
M. Baque, A. Guerin, M.Melon: “Separation de sources appliquee a un contenu ambisonique: localisation et extraction des champs directs”. Congres Francais d'Acoustique et le 20e colloque Vibrations, SHocks and NOise, CFA/VISHNO 2016, Apr. 1, 2016 (Apr. 1, 2016), pp. 1-6, XP055361095. |
Graczyk J Skoglund Google Inc M: “Ambisonics in an Ogg Opus Container; Draft-ieff-codec-ambisonics-01.txt”,Internet Engineering Task Force, IETF; Standardworkingdraft. Internet Society (ISOC) 4, Rue Des Falaises Ch—1205 Geneva, Switzerland, Nov. 22, 2016 (Nov. 22, 2016), pp. 1-10. XP015116784. |
Number | Date | Country | |
---|---|---|---|
20190335291 A1 | Oct 2019 | US |