The present invention relates to the technical field of processing signals representative a sound field.
In particular, it relates to a method for converting a first set of signals representative of a sound field into a second set of signals and an associated electronic device.
It has already been proposed to convert a first set of signals representative of a sound field into a second set of signals, for example to allow the restitution of the sound field by applying the signals of the second set to a reproduction system (audio headset or loudspeakers).
The signals of the first set have sometimes, in this situation, a format that is not directly usable by the reproduction system. It is typically a scene-based format, such as HOA (“High-Order Ambisonics”) format.
A solution of this type is proposed in the article “COMPASS: Coding and Multidirectional Parametrization of Ambisonic Sound Scenes”, A. Politis, S. Tervo and V. Pulkki in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018.
Like other solutions moreover mentioned in this article, this solution is based on the estimation of at least one dominant direction per frequency band by analysis of the signals of the first set.
This analysis has however a significant computational cost and therefore requires a non-negligible processing time.
In this context, the present invention provides a method for converting a first set of signals representative or a sound field in a space into a second set of signals by means of an electronic device, characterized in that the electronic device stores, for each temporal frequency band of a plurality of temporal frequency bands of the sound field, at least one data item associated with a particular spatial direction, the set of these particular spatial directions associated with a data item for at least one temporal frequency band forming a mesh over the set of spatial directions, and in that the method comprises the following steps:
for each of the signals of the first set, determining values associated with said temporal frequency bands, respectively;
for each temporal frequency band, converting the values associated with the relevant temporal frequency band and determined for the different signals of the first set, into at least one value representative of a virtual sound source oriented along the spatial direction associated with the data item stored for the relevant temporal frequency band;
for each temporal frequency band, determining, on the basis of said at least one value representative of a virtual sound source and obtained at the conversion step for the relevant temporal frequency band, a plurality of values associated with the different signals of the second set, respectively;
constructing each signal of the second set on the basis of the values associated with this signal of the second set and obtained for the different temporal frequency bands, respectively.
The use of predefined directions, for which associated data items are stored in the electronic device, avoids the analysis processing tasks used in the prior solutions.
Those directions however form a mesh (or grid) covering all the possible directions and waves present in the sound field will hence be represented in the constructed signals (signals of the second set), regardless of their dominant direction.
The electronic device stores for example, for each temporal frequency band, data items associated with a number of particular spatial directions equal to the number of signals in the first set of signals, which allows obtaining an optimum processing. It may be provided that, at the conversion step related to a given temporal frequency band, the values associated with the given temporal frequency band and determined for the different signals of the first set are converted into a plurality of values representative of virtual sound sources oriented along the respective spatial directions associated with the data items stored for the given temporal frequency band. Therefore, for each temporal frequency band, the input signals are converted into a plane wave representation along the different directions associated with the relevant frequency band.
The particular directions associated with the data items stored for a given temporal frequency band are for example distributed (potentially on a regular basis) among the set of spatial directions.
The number of signals in the second set is for example strictly higher than the number of signals in the first set. The conversion allows in this case an artificial increase of the spatial resolution of the sound scene represented.
Moreover, it may be provided that two directions associated with two data items stored for two respective adjacent frequency bands are neighbours in the mesh (or grid). This avoids performing very different processing tasks for neighbour frequency bands, which could create unwanted artefacts.
The set of said particular directions may include at least 50 particular directions, for example between 50 and 5000 particular directions.
The values associated with said temporal frequency bands, respectively, can be determined by time-frequency transformation on the basis of the signals of the first set. Each signal of the second set can itself be constructed by frequency-time transformation on the basis of the values associated with this signal of the second set and obtained for the different temporal frequency bands, respectively.
As described hereinafter, for each temporal frequency band, the conversion step can be carried out in practice by matrix multiplication of a vector comprising the values associated with the relevant temporal frequency band and determined for the different signals of the first set. The matrix used for this matrix multiplication as regards a given temporal frequency band can comprise the data stored for this given temporal frequency band and associated with the different particular directions allocated to this given temporal frequency band.
Moreover, for each temporal frequency band, the step of determining a plurality of values associated with the different signals of the second set, respectively, can be carried out by matrix multiplication of a vector comprising said at least one value representative of a virtual sound source and obtained at the conversion step for the relevant temporal frequency band. It is therefore possible to pass from a plane wave representation (by means of the values representative of sound sources) to a representation corresponding to the signals of the second set (output signals).
The method can also comprise preliminary steps of defining a plurality of spatial directions by an optimization process, allocating spatial directions of the plurality to said temporal frequency bands, and storing, for each temporal frequency band, said at least one data item associated with the spatial direction allocated to the relevant frequency band.
The invention further proposes an electronic device for converting a first set of signals representative of a sound field in a space into a second set of signals, characterized in that the electronic device comprises:
a storage unit adapted to store, for each temporal frequency band of a plurality of temporal frequency bands of the sound field, at least one data item associated with a particular spatial direction, so that the set of these particular spatial directions associated with a data item for at least one temporal frequency band forms a mesh over the set of spatial directions;
a transformation module adapted to determine, for each of the signals of the first set, values associated with said temporal frequency bands, respectively;
a decoding module adapted to convert, for each temporal frequency band, the values associated with the relevant temporal frequency band and determined for the different signals of the first set, into at least one value representative of a virtual sound source oriented along the spatial direction associated with the data item stored for the relevant temporal frequency band;
an encoding module adapted to determine, for each temporal frequency band,
a plurality of values associated with the different signals of the second set, respectively, on the basis of said at least one value representative of a virtual sound source and obtained by the decoding module for the relevant temporal frequency band;
a construction module adapted to construct each signal of the second set on the basis of the values associated with this signal of the second set and obtained for the different temporal frequency bands, respectively.
Of course, the different features, alternatives and embodiments of the invention may be associated with each other according to various combinations, insofar as they are not mutually incompatible or exclusive.
Moreover, various other features of the invention will be apparent from the appended description made with reference to the drawings that illustrate non-limitative embodiments of the invention, and wherein:
This memory can moreover store the above-mentioned computer program instructions.
The input signals (or signals of the first set) are for example ambisonic signals of order L. The first set comprises in this case (L+1)2 signals. The case of ambisonic input signals of order 1 (i.e. L=1) is described herein by way of illustration; with the first set then comprising 4 signals.
The processing made by the electronic device 2 on a given time interval is then described; this processing may be repeated for subsequent time intervals. In the following, bE(t) will be used to denote the vector formed by the values taken by the different signals of the first set, respectively, at different times t of the considered time interval. (In the case of ambisonic input signals of order L, each vector bE(t) is hence of dimension (L+1)2, herein of dimension 4.) The number of successive times t at which the signals bE(t) are considered is for example between 100 and 1000 for each time interval. The values taken by the different signals (and hence the different elements of the vectors bE(t)) are for example complex values; as an alternative, these values could be real values.
Moreover, in the following, a plurality of temporal frequency bands of the sound field is considered. (The term “temporal frequency” is used in the present description to make it clear that these are not spatial frequencies, a notion that is also used in the present technical field.) In the example described herein, these temporal frequency bands are disjointed (or separated) two by two and cover (when gathered) the spectrum of the audible frequencies. The plurality of temporal frequency bands comprises for example between 100 and 1000 temporal frequency bands, here 256 temporal frequency bands. Each temporal frequency band has for example a width between 10 Hz and 500 Hz.
The electronic device 2 comprises a storage unit 4 adapted to store, for each temporal frequency band of this plurality of temporal frequency bands, at least one data item associated with a particular spatial direction n; (i.e. a particular direction Ωj of the space mentioned above).
In the example described herein, the storage unit 4 stores, for each temporal frequency band, data items associated with a number of particular spatial directions Ωj equal to the number of signals in the first set of signals (input signals), i.e. (L+1)2 in the case of ambisonic input signals of order L. The directions so associated with a given temporal frequency band are denoted hereinafter Ω1(f), Ω2(f), . . . , Ω(L+1)2(f).
The data item associated with a particular spatial direction n; can be a data item defining this particular spatial direction, for example by means of an azimuth angle and/or an elevation angle.
The data item associated with a particular spatial direction Ωj can also be a data item making it possible to perform a calculation related to this particular direction
In the example described herein, to a particular direction Ωj are for example associated several coefficients Dk,i(f) (forming a line of a matrix D(f)) making it possible to obtain the contribution of the different input signals, respectively, to a plane wave in the particular direction Ωk(f), as explained hereinafter.
Each particular direction Ωj is defined herein by an azimuth angle θ (x-axis in
The set of particular spatial directions Ωj associated with a data item stored for at least one temporal frequency band forms a mesh (or grid) over the set of spatial directions (i.e. a mesh or grid covering the set of possible directions in the space mentioned above). The set of particular directions Ωj comprises for example more than 50 particular directions.
As can be seen in
According to a possible implementation, for any azimuth value range having a width of 60° and any elevation value range having a width of 30°, the set of particular directions Ωj comprises at least 5 particular directions n; defined by an azimuth θ included in this azimuth value range and an elevation ε included in this elevation value range.
According to another possible implementation (potentially compatible with the previous one), for any elevation value range having a width of 30° and any particular direction Ωj of the set defined by an elevation ε included in this elevation value range and by a given azimuth θ, the set of particular directions comprises at least one particular direction Ωj, defined by an elevation ε′ included in this elevation value range and by an azimuth θ′ that is different from the given azimuth θ by less than 30° (i.e. |θ′-θ|<30°, where |x| is the absolute value of x).
According to another possible implementation (potentially compatible with the previous ones), for any azimuth value range having a width of 60° and any particular direction Ωj, of the set defined by an azimuth θ included in this azimuth value range and by a given elevation ε′, the set of particular directions comprises at least another particular direction ε′ defined by an azimuth θ′ included in this azimuth value range and by an elevation ε′ that is different from the given elevation ε by less than 30° (i.e. |ε′-ε|<30°).
A method for defining and allocating these particular spatial directions n; to the different temporal frequency bands will be described hereinafter with reference to
The electronic device 2 comprises a reception module 6 adapted to receive data representative of the input signals (signals of the first set), here the vectors bE(t) respectively associated with the successive times of the considered time interval. This reception module 6 can be a communication module adapted to receive the data representative of the input signals coming from another electronic device. As an alternative, the reception module 6 can be a module for reading the data representative of the input signals from a memory (such as the already-mentioned memory of the electronic device 2).
The electronic device 2 comprises a configuration module 8 adapted to configure the other modules, as a function in particular of the input signals bE(t) (in particular, as a function of the format of the input signal bE(t)). so For that purpose, the electronic device 2 can comprise a detection module 10 adapted to analyse the input signals bE(t) and to provide the configuration module with information I indicative of the format of the input signals bE(t). This information I is for example the number of signals which the input signals bE(t) are made of.
As an alternative, the data representative of the input signals bE(t) (received by the reception module 6) can comprise metadata M indicative of the format of the input signals bE(t). It can be provided in this case that the reception module 6 transmits these metadata M to the configuration module 8, as shown in dotted-line in
The operation of the configuration module 8 is described in detail hereinafter with reference to
The electronic device 2 moreover comprises a transformation module 12 adapted to determine, for each of the input signals (signals of the first set), values associated with the different temporal frequency bands, respectively.
Using βi(t) to denote the values taken over time (on the considered interval) by each input signal (so that bE(t)=[β1(t), β2(t), . . . , β(L+1)
For a given signal of the first set, the values αi(f) associated with the different time frequency bands, respectively, are for example determined by time-frequency transformation (such as a short-term Fourier transformation) on the basis of the values
βi(t) taken over time (on the considered time interval) by this signal of the first set.
For each frequency band, α(f) is used in the following to denote the vector formed by the values αi(f) associated with the different input signals, respectively, for the relevant frequency band: α(f)=[α1(f), α2(f), α(L+1)
The electronic device 2 comprises a decoding module 14 adapted to convert, for each temporal frequency band, the values α1(f), α2(f), α(L+1)
δ(f) is used in the following to denote the vector formed (for a temporal frequency band) by these values δ1(f), δ2(f), . . . , δ(L+1)
δ(f)=[δ1(f), δ2(f), . . . , δ(L+1)
The decoding module 14 performs for example, for each temporal frequency band, the above-mentioned conversion by matrix multiplication of the vector a(f), which comprises, as already indicated, the values α1(f), α2(f), . . . , α(L+1)
For that purpose, the decoding module 14 uses for example a plurality of matrices D(f) associated with the different temporal frequency bands, respectively, and, for each temporal frequency band, multiplies the above-mentioned vector a(f) by the relevant matrix D(f) in order to obtain the values δ1(f), δ2(f), δ(L+1)
δ(f)=D(f)α(f).
The matrices D(f) are such that the values α1(f), α2(f), α(L+1)
Each matrix D(f) is hence formed of elements Dk,i that each represent the coefficient to be allocated to a value αi(f) (obtained for an input signal βi(t)) to determine its contribution to the plane wave emitted by the virtual sound source oriented along the direction Ωk(f). Indeed, the above matrix product means that we have:
Sk(f)=Σi Dk,i,αi(f).
3In the example described herein, in which the storage unit 4 stores, for each temporal frequency band, data associated with a number of particular spatial directions Ωj equal to the number of signals in the first set of signals (input signals), each matrix D(f) is a square matrix, of dimension equal to the number of signals in the first set, here (L+1)2.
In the case where the input signals are ambisonic, aE(Ωj) is used to denote the vector whose coefficients express the transfer function between a plane wave propagating from the direction n; and the different ambisonic signals of order L:
a
E(Ωj)=[Y00(Ωj), Y1−1(Ωj) . . . , Ylm(Ωj), . . . , YLL(Ωj)]T,
where Ym(·) is the spherical harmonic function of order l and degree m.
For each temporal frequency band, the matrix D(f) can then be, in this case, defined by:
D(f)=pinv([aE(Ω1(f)), a E(Ω2(f)), . . . , aE(Ω(L+1)
where pinv(·) represents the Moore-Penrose pseudo-inverse.
In the case where the matrix D(f) is square as indicated hereinabove, it can then be written:
D(f)=[aE(Ω1(f)), aE(Ω2(f)), aE(Ω(L+1)
As can be seen in
The electronic device 2 comprises an encoding module 18 adapted to determine, for each temporal frequency band, a plurality of values π1(f), π2(f), . . . , λN(f) associated with the different signals of the second set (output signals), respectively, on the basis of the values δ1(f), δ2(f), . . . , δ(L+1)
As indicated hereinabove, N is used to denote the number of signals of the second set.
For example, when the output signals are ambisonic signals of order L′, we have: N=(L′+1)2.
In the example described herein, the number N of signals in the second set is strictly higher than the number of signals (here equal to (L+1)2) in the first set. This is in particular the case when the processing performed by the electronic device, described hereinafter with reference to
For example, when the input signals and the output signals are ambisonic signals, the order L′ of the output signals is strictly higher than the order L of the input signals.
In the example described herein, the encoding module 18 determines, for each temporal frequency band, the plurality of values λ1(f), λ2(f), λN(f) associated with the different signals of the second set, respectively, by matrix multiplication (by means of a matrix E(f)) of the vector δ(f) comprising the values δ1(f), δ2(f), . . . , δ(L+1)
Such a matrix E(f) has hence here a number of columns equal to the number of signals in the first set (here (L+1)2) and a number of lines equal to the number N of signals in the second set.
In the case where the output signals are ambisonic signals, the encoding module 18 uses, for each frequency band, a matrix E(f) allowing the passage from a plane wave representation to an ambisonic representation, here of order L′:
E(f)=[as(Ω1(f)), as(Ω2(f)), . . . , as(Ω(L+1)
with as(Ωj)=[Y00(Ωj), Y1−1(Ωj) . . . , Ylm(Ωj), . . . , TLL′(Ωj)]T,
where, as already indicated, Ylm(·) is the spherical harmonic function of order I and degree m.
By noting λ(f)=[λ1(f), λ2(f), λN(f)]T, we then have: λ(f)=E(f)δ(f).
As can be seen in
The electronic device 2 finally comprises a construction module 22 adapted to construct each signal σi(t) of the second set on the basis of the values λi(f) associated with this σi(t) of the second set and obtained for the different temporal frequency bands, respectively.
The construction module 22 constructs for example each signal ai(t) of the second set by frequency-time transformation (such as an inverse short-term Fourier transformation) on the basis of the values λi(f) associated with this signal of the second set and obtained for the different temporal frequency bands, respectively.
N output signals (signals of the second set) are hence obtained, precisely here, for each output signal, a set of values σi(t) forming this output signal for the different (successive) times t of the considered time interval. The values of the different output signals for each time t can be noted in vectorial form: bs(t)=[σ1(t), σ2(t), σN(t)]T.
The method of
This step E2 here makes it possible to determine the number of signals present in the first set of signals.
The method of
This step E2 can further comprise the configuration (here by the configuration module 8) of other elements of the electronic device 2, such as the transformation module 12 and/or the construction module 22. For example, the configuration module 8 configures the transformation module 12 and/or the construction module 22 as a function of the number of temporal frequency bands to be used (this number can be stored in a memory of the electronic device 2 and/or input by a user via a user interface—not shown—of the electronic device 2).
For example, during the configuration step E4, the configuration module 8 determines (as a function of the format determined at step E2) the matrices D(f) to be used, and configures the respective conversions units 16 by means of these matrices D(f).
The configuration module 8 determines for example the matrices D(f) to be used as a function of the number of signals present in the first set of signals.
According to a first possibility, as a function of the number of signals in the first set of signals (i.e. the number of input signals), the configuration module 8 reads a set of matrices D(f) stored (for example in the memory of the electronic device 2) in association with this number of signals in the first set of signals. As an alternative, the configuration module 8 could emit this number of signals in the first set of signals towards a remote server and receive as an answer the associated set of matrices D(f).
According to another possibility (for example implemented the first time the number of input signals determined at step E2 is met), the configuration module 8 carries out a method such as that described hereinafter in
Likewise, during the configuration step E4, the configuration module 8 can determine the matrices E(f) to be used (for example as a function of the format of the output signals, here the number of output signals, that can be stored and/or input by a user via the user interface of the electronic device 2), and configure the processing units 20 by means of these matrices E(f).
The configuration module 8 determines for example the matrices E(f) to be used as a function of the number of signals present in the second set of signals (output signals).
According to a first possibility, as a function of the number of signals in the second set of signals (i.e. the number of input signals), the configuration module 8 reads a set of matrices E(f) stored (for example, in the memory of the electronic device 2) in association with this number of signals in the second set of signals. As an alternative, the configuration module 8 could emit this number of signals in the second set of signals towards a remote server and receive as an answer the associated set of matrices E(f).
According to another possibility (for example implemented the first time the chosen number of output signals is met), the configuration module 8 runs a method such as that described hereinafter with reference to
The method of
This determination step E6 is herein carried out by the transformation module 12. As already indicated, the values αi(t) associated with said temporal frequency bands, respectively, can be determined by time-frequency transformation on the basis of the signals βi(t) of the first set.
The method of
This conversion step E8 is herein implemented by the decoding module 8, for example as already indicated, by performing the matrix products D(f)α(f) to obtain the different vectors δ(f)=[δ1(f), δ2(f), . . . , δ(L+1)
Precisely, for each temporal frequency band, one of the conversion units 16 performs a matrix product D(f)α(f) to obtain a vector δ(f) formed of the values δ1(f), δ2(f), δ(L+1)
The method of
Step E10 is herein implemented by the encoding module 18, for example as already indicated, by performing the matrix products E(f)δ(f) to obtain the different vectors λ(f)=[λ1(f), λ2(f), λN(f)]T.
Precisely, for each temporal frequency band, one of the processing units 20 performs a matrix product E(f)δ(f) to obtain a vector λ(f) formed o the values λ1(f), λ2(f), . . . , λN(f) associated with the signals σ1(t), σ2(t), . . . , σN(t) of the second set, respectively.
In the example described herein, the different values λi(f) obtained for the different temporal frequency bands and associated with a same signal σi(t) of the second set form a representation of this signal σi(t) of the second set in the frequency domain.
The method of
Step E12 is herein implemented in the construction module 22.
As already indicated, each signal σi(t) of the second set can be constructed by frequency-time transformation on the basis of the values λi(f) associated with this signal σi(t) of the second set and obtained for the different temporal frequency bands, respectively.
This method starts by a step E20 of defining a plurality of spatial directions by an optimization process, here so-called “Thomson problem” optimization process.
The plurality of so-obtained spatial directions forms a mesh (or grid) over the set of spatial directions, as already indicated.
This optimization process is described in the case of ambisonic input signals of order 1: in this case, as already indicated, 4 particular directions Ωj are used for each temporal frequency band.
If F is used to denote the number of temporal frequency bands used (as already indicated, F is for example between 100 and 1000, here F=256), here F groups of 4 particular directions n; are provided (the number of particular directions per group is equal to the number of input signals, here 4 input signals for ambisonic signals of order L=1 as already indicated).
In each group, the particular directions are distributed in space and thus form, in the example described herein, a tetrahedron (for example a regular tetrahedron).
Rotations can be defined, which each allow passing from a tetrahedron defined for a group of particular directions to another tetrahedron, defined for another group of particular directions.
Each of the 4F particular directions n; is modelled as a charged particle located at the surface of a sphere, and moving integrally with the other directions belonging to the same group, i.e. to the same tetrahedron. Two charged particles exert on each other a repulsive force similar to the electrostatic interaction.
A cost function corresponding to the total potential energy of the so-modelled system is then defined.
By successive iterations, the above-mentioned rotations are changed so as to reach a minimum of potential energy (Thomson problem). Since the potential energy is all the greater as the particles are close to each other, this optimization leads to an optimum distribution of the directions on the sphere.
F tetrahedrons are hence obtained, arranged in such a way as to provide a regular sampling (and hence a mesh or grid) of all the possible spatial directions.
The method of
For that purpose, any one of the tetrahedrons (i.e. one of the particular direction groups) may be randomly allocated to the first temporal frequency band (the temporal frequency bands being for example ordered by increasing central frequency).
The tetrahedron allocated to the second temporal frequency band is that which corresponds to the smallest rotation with respect to the tetrahedron allocated to the first temporal frequency band. The other tetrahedrons are thus allocated successively to the different temporal frequency bands in such a way that the angular distance between two successive direction groups is as small as possible.
Two particular directions allocated to two adjacent frequency bands are hence neighbours in the mesh, which allows avoiding hops in the processing performed for two neighbour frequency bands.
A group of particular directions Ω1(f), Ω2(f), . . . , Ω(L+1)
In the example described herein, for each temporal frequency band, the step E24 comprises constructing and storing the matrix D(f) and/or the matrix E(f) as indicated hereinabove, on the basis of the particular directions Ω1(f), Ω2(f), . . . , Ω(L+1)
The just-described invention can be applied in different situations in which it is desired to convert a first set of signals having a first format into a second set of signals having a second format.
For example, when it is desired to reproduce ambisonic signals of relatively low order L (for example, of order L=1) by means of a significant number of loudspeakers (for example, by means of 10 loudspeakers or more), it is desirable to convert the ambisonic signals of order L into ambisonic signals of order L′, strictly higher than L, and to reproduce the converted signals on the loudspeakers in such a way as to avoid the production of artefacts unpleasant to the ear.
According to another example schematically shown in
For example, in order to reproduce sounds represented that way, it is possible, in this case, to convert the ambisonic signals bE(t) of order L into ambisonic signals bs(t) of order L′ thanks to the electronic device 2 and/or to the method of
Moreover, although the above examples use ambisonic input and output signals, it is alternatively possible to use input or output signals of another type, for example multi-channel signals.
In this case, the different signals, each corresponding to a given loudspeaker position, is considered as a scene-based format in which the space-function base that is used consists of so-called “panning” functions. A panning function expresses the gains applied to the different loudspeakers to give the impression to a listener that a sound source is located in a given direction. The VBAP (“Vector Base Amplitude Panning”) method, for example, makes it possible to calculate panning functions for a given set of loudspeakers. For example, reference can be made to the article “Virtual Sound Source Positioning Using Vector Base Amplitude Panning”, of V. Pulkki, in Journal of the Audio Engineering Society, 45(6), pp. 456-466, June 1997.
The above-mentioned matrices D(f) and E(f) can in this case be constructed by concatenating the vectors consisted of the panning gains for the different plane wave directions
Number | Date | Country | Kind |
---|---|---|---|
2006878 | Jun 2020 | FR | national |