Embodiments of the present invention refer to an apparatus for transforming an audio stream with more than one channel into another representation. Further embodiments refer to a corresponding method and to a corresponding computer program.
Further embodiments refer to an apparatus for transforming an audio stream in a directional audio coding system. Further embodiments refer to a corresponding method and computer program.
Additional embodiments refer to an encoder comprising one of the above-defined apparatuses into a corresponding method for encoding as well as to a decoder comprising one of the above-discussed apparatuses and a corresponding method for decoding. Embodiments refer in general to the technical field of compression of audio channels by a prediction based on acoustic model parameters. Relevant conventional technology for the embodiments mainly comes from two previously known audio coding schemes:
Both concepts will be summarized briefly:
DirAC is a parametric technique for the encoding and reproduction of spatial sound fields [1, 2, 3, 4]. It is justified by the psychoacoustical argument that human listeners can only process two cues per critical band at a time [4]: the direction of arrival (DOA) of one sound source and the inter-aural coherence [4]. Consequently, it is sufficient to reproduce two streams per critical band: a directional one comprising the coherent channel signals from one point source from a given direction and a diffuse one comprising incoherent diffuse signals [4].
The analysis stage on the encoder side is depicted in the diagram of
The input is provided in the form of four B-format channel signals and analyzed with a filter bank (FB). For each band of this FB, the DOA of the point source and the diffuseness are extracted [3, 4]. These two parameters in each band, the DOA represented by the azimuth and elevation angles and the diffuseness, comprise the DirAC metadata [3, 4], whose efficient compression has been treated in Ref. [3, 4, 5].
As it is shown by
An extension of this system to higher-order Ambisonics (HOA) together with multi-channel (MC) or object based audio has been presented by Fuchs et al. [5]. There, the authors propose to perform additional processing of the B-format input signal in order to select suitable downmix channels or find suitable beams of virtual microphones to capture the transport streams as depicted in
Hence, the stream of data transmitted from the encoder to the decoder must contain both the EVS bitstreams and the DirAC metadata streams and care must be taken to find the optimal distribution of the available bits between the metadata and the individual EVS-coded channels of the downmix.
An alternative approach to the encoding and reproduction of spatial audio recordings that has previously been proposed in standards organizations is a metadata-assisted EVS coder [7]. It is also referred to as spatial audio reconstruction (SPAR) [7].
The downmix is performed in such a way that an energy compaction of the FOA signal is achieved (see
When spatial sound scenes are to be reproduced on headphones, it is needed to track the movement of the listeners head and rotate the sound scene accordingly in order to produce a consistent and realistic experience. To this end, a widely-adopted technique is to rotate the scene in the Ambisonics domain by pre-multiplication of a rotation matrix to the vector of channel signals [8, 9, 0]. This rotation matrix is typically computed by the method of Ref. [11]. An alternative approach is to render the output signal to virtual loudspeakers and perform the rotation by amplitude panning [9, 6].
All of the above-described solutions have drawbacks as will be discussed below. A remedy for these drawbacks is part of the invention.
In both of the systems referenced above, some of the key challenges are to (i) select the most well-suited channels of the input signal for the transport via EVS, (ii) find a representation of these channels that reduces redundancies between them, and (iii) distribute the available bitrate between the metadata and the individual EVS encoded audio streams such that the best possible perceptual quality is attained. As these decisions are highly dependent on the signal characteristics, signal-adaptive processing must be implemented.
An embodiment may have an apparatus for transforming an audio stream with more than one channel into another representation, apparatus being on an encoder side and having: means for deriving one or more parameters describing an acoustic or psychoacoustic model of the audio stream on the encoder side, wherein the means for deriving are configured to calculate prediction coefficients as the one or more parameters, wherein the prediction coefficients are calculated based on a covariance matrix by the means for deriving; means for transforming the audio stream in a signal-adaptive way dependent on the one or more parameters; and wherein the one or more parameters have at least an information on at least one direction of arrival (DOA), wherein the means for transforming are configured to perform a downmixing or other transforming of the audio stream on the encoder side.
Another embodiment may have an apparatus for transforming an audio stream with more than one channel into another representation apparatus being on a decoder side and having: means for receiving one or more parameters describing an audio scene with an acoustic or psychoacoustic model on the decoder side; means for transforming the audio stream in a signal-adaptive way dependent on the one or more parameters; and wherein the one or more parameters have at least an information on at least one direction of arrival (DOA), wherein the means for transforming are configured to perform upmix or other transform generation of the audio stream on the decoder side.
Another embodiment may have an apparatus for transforming an audio stream in a directional audio coding system, being on an encoder side and having: means for deriving one or more acoustic model parameters of a model of the audio stream, wherein the one or more acoustic model parameters are transmitted to enable restoring all channels of the audio stream and have at least an information on direction of arrival (DoA), means for transforming the audio stream in a signal-adaptive way dependent on the one or more acoustic model parameters; where all or a subset of the channels of the audio stream are transformed; wherein the means for transforming are configured to perform a downmixing or other transforming of the audio stream on the encoder side.
Another embodiment may have an apparatus for transforming an audio stream in a directional audio coding system, apparatus being on a decoder side and having: means for receiving one or more acoustic model parameters of a model of the audio stream, wherein the one or more acoustic model parameters are received to restore all channels of the audio stream and have at least an information on direction of arrival (DoA), means for transforming the audio stream in a signal-adaptive way dependent on the one or more acoustic model parameters; where all or a subset of the channels of the audio stream are transformed; wherein the means for transforming are configured to perform upmix or other transform generation of the audio stream on the decoder side.
Another embodiment may have an encoder having an apparatus for transforming an audio stream with more than one channel into another representation, apparatus being on an encoder side and having: means for deriving one or more parameters describing an acoustic or psychoacoustic model of the audio stream on the encoder side, wherein the means for deriving are configured to calculate prediction coefficients as the one or more parameters, wherein the prediction coefficients are calculated based on a covariance matrix by the means for deriving; means for transforming the audio stream in a signal-adaptive way dependent on the one or more parameters; and wherein the one or more parameters have at least an information on at least one direction of arrival (DOA), wherein the means for transforming are configured to perform a downmixing or other transforming of the audio stream on the encoder side.
Another embodiment may have a decoder having an apparatus for transforming an audio stream with more than one channel into another representation apparatus being on a decoder side and having: means for receiving one or more parameters describing an audio scene with an acoustic or psychoacoustic model on the decoder side; means for transforming the audio stream in a signal-adaptive way dependent on the one or more parameters; and wherein the one or more parameters have at least an information on at least one direction of arrival (DOA), wherein the means for transforming are configured to perform upmix or other transform generation of the audio stream on the decoder side.
According to another embodiment, a system may have an inventive encoder and an inventive decoder, wherein the encoder is configured to calculate a prediction matrix and/or a downmix or other transform and wherein the decoder is configured to calculate an upmix or other transform matrix from estimated parameters or the one or more parameters of the acoustic model independently of each other.
Another embodiment may have a method for transforming an audio stream with more than one channel into another representation, performed on an encoder side and having the following steps: deriving the one or more parameters describing an acoustic or psychoacoustic model of an audio stream from the audio stream, wherein deriving has calculating prediction coefficients as the one or more parameters, wherein the prediction coefficients calculated are calculated based on a covariance matrix by the means for deriving and wherein the one or more parameters have at least an information on direction of arrival (DOA); and transforming the audio stream in a signal-adaptive way dependent the on one or more parameters; wherein transforming has a downmixing or other transforming of the audio stream on the encoder side.
Another embodiment may have a method for transforming an audio stream with more than one channel into another representation, performed on a decoder side and having the following steps: receiving one or more parameters describing an audio scene with an acoustic or psychoacoustic model on the decoder side, wherein the one or more parameters have at least an information on direction of arrival (DOA); and transforming the audio stream in a signal-adaptive way dependent the on one or more parameters; wherein transforming has upmixing or other transforming of the audio stream on the decoder side.
Another embodiment may have a method for transforming an audio stream in a directional audio coding system, performed on an encoder side and having the steps of: deriving one or more acoustic model parameters of a model of the audio stream parametrized by direction of arrival (DOA) and diffuseness or energy-ratio parameters, said acoustic model parameters are transmitted to restore all channels of an input of audio stream and have at least an information on DOA, wherein all or a subset of the channels of the audio stream are transformed; and transforming the audio stream in a signal-adaptive way dependent on one or more acoustic model parameters, wherein transforming has a downmixing or other transforming of the audio stream on the encoder side.
Another embodiment may have a method for transforming an audio stream in a directional audio coding system, performed on a decoder side and having the steps of: receiving one or more acoustic model parameters of a model of the audio stream parametrized by direction of arrival (DOA) and diffuseness or energy-ratio parameters, said acoustic model parameters are received to restore all channels of an input of audio stream and have at least an information on DOA, wherein all or a subset of the channels of the audio stream are transformed; and transforming the audio stream in a signal-adaptive way dependent on one or more acoustic model parameters, wherein transforming has upmixing or other transforming of the audio stream on the decoder side.
Another embodiment may have a non-transitory digital storage medium having stored thereon a computer program for performing any of the above methods according to the invention, when the computer program is run by a computer.
An embodiment of the present invention provides an apparatus for transforming an audio stream with more than one channel into another representation. The apparatus comprises means for transforming and means for deriving and/or means for receiving. The means for transforming are configured to transform the audio stream in a signal-adaptive way dependent on one or more parameters. The means for deriving are configured to derive the one or more parameters describing an acoustic or psychoacoustic model of the audio stream (signal). Note at the decoder side the prediction parameter can be received (cf. means for receiving). Said parameters comprise at least an information on D OA (direction of arrival), where the one or more parameters may be derived from the audio stream, e.g. at the encoder side (or just received, e.g. at the decoder side).
According to further embodiments, the means for deriving are configured to calculate prediction coefficients or to calculate prediction coefficients based on a covariance matrix or on parameters of an acoustic signal.
According to embodiments, the means for deriving are configured to calculate a covariance matrix from the model/acoustic model or in general based on the DOA or an additional diffuseness factor or an energy ratio.
It should be noted that according to embodiments the one or more parameters comprise prediction parameters.
Embodiments of the present invention are based on the principle that prediction coefficients on both the encoder and decoder side can be approximated from a model like an acoustic model or acoustic model parameters. In directional audio coding systems, these parameters are present at the decoder side and, consequently, no additional metadata bits are transmitted for the prediction. Thus, the amount of additional metadata needed to enable the reconstruction of the downmix channels at the decoder side is strongly reduced as compared to the naïve implementation of prediction. Expressed in other words, this means that the combination of deriving one or more parameters describing an acoustic model and transforming the audio stream in a signal adaptive way provides an approach to compress downmix channels in directional audio coding systems or other applications via the application of inter-channel prediction based on acoustic models of the input signal.
In the above-discussed embodiments, mainly a DOA parameter has been discussed. According to further embodiments, additionally a diffuseness information/diffuseness factor may be used. Thus, said parameters used for the means for transforming and derived by the means for deriving may comprise an information on a diffuseness factor or on one or more DOAs or on energy ratios. For example, the one or more parameters are derived from the audio stream itself.
Regarding the prediction coefficients, it should be mentioned that according to further embodiments, the prediction coefficients are calculated based on the real or complex spherical harmonics Yl,m with degree l and index m evaluated at angles corresponding to a DOA
Regarding the covariance matrix, it should be noted that according to further embodiments, the means for deriving are configured to calculate a covariance matrix based on an information about diffuseness, spherical harmonics and a time-dependent scalar-valued signal. For example, the calculation may be based on the following formula:
where Yl,m is a spherical harmonic with the degree and index l and m and where s(t) is a time-dependent scalar-valued signal.
According to further embodiments, the calculation may be based on a signal energy, for example, by using the following formula:
Alternatively or additionally, the following formula may be used:
where E is again the signal energy.
Alternatively or additionally, the following formula may be used:
According to embodiments, the energy E is directly calculated from the audio stream (signal). Alternatively or additionally, the energy E is estimated from the model of the signal.
According to further embodiments, the audio stream is preprocessed by a parameter estimator or a parameter estimator comprising as metadata encoder or metadata decoder and/or by an analysis filterbank.
According to further embodiments, the input audio stream is a higher-order Ambisonics signal and the parameter estimation is based on all or a subset of these input channels. For example, this subset can comprise the channels of the first order. Alternatively it can consist of the planar channels of any order or any other selection of channels.
As discussed above, embodiments provide an encoder comprising the above-discussed apparatus. Further embodiments provide a decoder comprising the above-discussed apparatus. On the encoder side, the apparatus may comprise means for transforming which are configured to perform a mixing, e.g. a downmixing of the audio stream. On the decoder side, the means for transforming are configured to perform a mixing, e.g. an upmixing or an upmix generation of the audio streams.
The above-discussed apparatus may also be used for transforming an audio stream in a directional audio coding system. According to embodiments, the apparatus comprises means for transforming and means for deriving. The means for transforming are configured to transform the audio stream in a signal-adaptive way dependent on one or more acoustic model parameters. The means for deriving are configured to derive the one or more acoustic model parameters of a model of the audio stream (parametrized by the DOA and/or the diffuseness and/or energy-ratio parameter). Said acoustic model parameters are transmitted to restore all channels of the audio stream and comprise at least an information on DOA. The transmitted audio streams are derived by transforming all or a subset of the channels of the audio stream. According to embodiments, the transmitted parameters are quantized prior to transmission. According to embodiments, the parameters are dequantized after transmission. According to further embodiments, the parameters may be smoothed over time. According to further embodiments the quantized parameters may be compressed by means of entropy coding.
Regarding the transform, it should be noted that according to further embodiments, the transform is computed such that correlations between transport channels are reduced. According to embodiments, the inter-channel covariance matrix of an input of the audio stream is estimated from a model of the signal of the audio stream. For example, a transform matrix is derived from a covariance matrix of a model of the audio stream signal. The covariance matrix may be calculated using different methods for different frequency bands. Regarding the transformation performed by the means for transforming, it should be noted that according to an embodiment at least one of the transform methods is multiplication of the vector of the audio channels by a constant matrix. According to another embodiment, the transform methods use prediction based on the inter-channel covariance matrix of an audio signal vector. According to another embodiment at least one of the transform methods uses prediction based on the inter-channel covariance matrix of the model signal described by DOAs and/or diffuseness factors and/or energy ratios.
According to another embodiment, and mainly applicable for the apparatus for transforming an audio stream in a directional audio coding system, the scene encoded by the audio stream (signal) is rotatable in such a way that
As discussed above, the apparatus may be applied to an encoder and a decoder. Another embodiment provides a system comprising an encoder and a decoder. The encoder and the decoder are configured to calculate a prediction matrix and/or a downmix and/or upmix matrix from the estimated or transform parameters of the acoustic model independently of each other.
According to further embodiments, the above-discussed approach may be implemented by a method. Another embodiment provides a method for transforming an audio stream with more than one channel into another representation, comprising the following steps:
Another embodiment provides a method for transforming an audio stream in a directional audio coding system, comprising the steps:
According to further embodiments, the method may computer implemented. Thus, an embodiment provides a computer program for performing, wherein running on a computer, the method according to the above-disclosure.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
Below, embodiments of the present invention will subsequently be discussed referring to the enclosed figures, wherein identical reference numerals are provided to objects that have an identical or similar function, so that the description thereof is interchangeable or mutually applicable.
Before discussing embodiments of the present invention a discussion of some features of the invention will be given separately.
For the compression of the transport channels it is known that the optimal decorrelation and therefore energy compaction would be obtained by the Karhunen-Loève transform (KLT) (see e.g. [12]). The KLT transforms the signal vector to a basis of the eigenvectors of the inter-channel covariance matrix. For a B-format input signal of the form
the elements of the inter-channel covariance matrix
are given by
and analogously for the other channel combinations. With the KLT, the matrix 2 is diagonalized and all inter-channel correlations are fully removed, therefore yielding the least redundant representation of the signal. There are, however, two difficulties which prevent the implementation of the KLT in most real-world systems: the computational complexity of the needed eigenvector calculations and the metadata bit usage for the transmission of the resulting transform matrices are often considered too high.
As a compromise, one can remove only the correlations of the x, y, and z with the w channel via the prediction matrix
In this approach, no matrix diagonalization is needed and only the three prediction coefficients Px/y/z are to be transmitted. Depending on the frame length and the signal characteristics, the amount of metadata for this approach can still be considerable. According to our experiments this is of the order of 10 kbps. This is especially noteworthy as these metadata would be transmitted along with those needed for the DirAC system itself, raising the overall bit requirement.
This naturally invites the question as to how these two metadata streams are connected. The invention described in the following clarifies the connection between the prediction for the purpose of the compression of the DirAC or SPAR transport channels and the model parameters transmitted in DirAC to allow for the decoder-side reconstruction of the full HOA input signal. We provide a path to the re-use of metadata already transmitted as part of the DirAC system for the compression of the transport channels. Our method can therefore improve the perceptual quality of DirAC as compared to a passive downmix by a static selection of transport channels while avoiding additional metadata transmission.
Both of the approaches to scene rotation as discussed above have significant drawbacks. For the former, the computational complexity is very high due to the matrix multiplication for every sample of the signal. For the latter, the quality is less than optimal [9]. It is therefore desirable to reduce the complexity of the former method without compromising on the quality too much. Our invention provides a path to applying the rotation in a lower-dimensional space. Within the framework of the two aforementioned systems for parametric coding of spatial audio, this can be realized by combining the rotation of a subset of the channels in the Ambisonics domain with a suitable transform in the metadata domain.
Above it has been established that a compression of transport channels can be achieved by reducing correlations via transforms derived from the covariance matrix. The below discussion will show an approach how such transforms can be obtained independently on both the encoder and decoder side from the readily available DirAC model parameters or general acoustic model parameters.
According to embodiments a covariance matrix may be determined from the model signal.
It is considered one of the parameter bands of directional audio coding (c.f. above). For brevity, we omit the frequency-band index in the notation. First we focus on the non-diffuse directional part of the signal. Let
be the direction of arrival (DOA) of the sound from a point source on the unit sphere specified by the compound angle variable θD:=(ϕ,θ). The sound pressure due to this source on the unit sphere is then given by
with the time-dependent signal s(t) and the Dirac distribution on the sphere δ(θ).
We consider a B-format or first-order Ambisonics (FOA) signal that comprises a directional part from a panned point source at rDOA and an uncorrelated diffuse part with no correlation between the individual channels. The signal vector for the directional part then becomes
where Yl,m are the spherical harmonics with the degree and index numbers l and m. This result can be readily read off from the expansion of the Dirac function in 7 up to the first order in the spherical harmonics (see also [13]).
Together with the diffuse part, the full B-format signal then becomes
The prefactor of
in the l=1.
components in the diffuse part arises from the normalization of the signal.
Given this model signal, one can now straightforwardly evaluate the covariance matrix elements. For the off-diagonal matrix elements we find
where the terms involving integrals over the products sw/x/y/zdiffs(t) vanish since the diffuse components are assumed to exhibit no correlations with s (t) or between each other. With the directional energy of the signal Edir=∫ dt s2(t) this can be cast as
The diagonal matrix elements Cw,w becomes
with the diffuse energy Ewdiff defined analogously to the directional one. The other diagonal matrix elements follow in the same way.
Using Eqs. 4, 12, and 13 and expressing the direct and diffuse energies Edir and Ediff by the total signal energy E, the only remaining parameters are the angles θD and the diffuseness or energy ratios which are present in the DirAC decoder at all times. Therefore, the need to transmit additional prediction coefficients can be entirely avoided.
Alternatively, the model can be enabled for a subset of the frequency bands only. For the other bands the prediction coefficients will then be calculated from the exact covariance matrix and transmitted explicitly. This can be useful in cases where a very accurate prediction is needed for the perceptually most relevant frequencies. Often it is desirable to have a more accurate reproduction of the input signal at lower frequencies, e.g. below 2 kHz. The choice of the cross-over frequencies can be motivated from two different arguments.
Firstly, the localization of sound sources is known to rely on different mechanisms for low and high frequencies [14]. While the inter-aural phase difference (IPD) is evaluated at low frequencies, the inter-aural level difference (ILD) dominates for the localization of sources at higher frequencies [14]. Therefore, it is more important to achieve a high accuracy of the prediction and a more accurate reproduction of the phases at lower frequencies. Consequently, one may wish to resort to the more demanding but more accurate transmission of the prediction parameters for lower frequencies.
Secondly, perceptual audio coders for the resulting downmix channels, because of the above argument, often reproduce low frequency bands more accurately than higher ones. For example at low bitrates, higher frequencies can be quantized to zero and restored from a copy of lower ones [15]. In order to deliver consistent quality across the whole system, it can therefore be desirable to implement a cross-over frequency according to the internal parameters of the core coder employed.
The signal path of the resulting DirAC system is depicted in
Head Tracking with Low Complexity
Let SHOA-L(t) be the vector of the output channel signals in HOA of order L. The dimension of this vector is then given by N=(L+1)2. In order to perform a rotation of the scene by the conventional method, this signal would first be reconstructed in the DirAC or SPAR decoder and multiplied by a rotation matrix RHOA-L of the size N×N at each sample of the signal.
Let now Strans(t) be the signal vector of the transported channels after applying the inverse transform as shown in
The key novelty of our invention is now to exploit the properties of RHOA-L: it is block diagonal with each belong belonging to a specific degree/and the matrix elements for l=1 are identical to those of the same rotation applied to any vector in 3 [11]. Consequently, one can apply the l=1 block of RHOA-L to the DOA vector 5 prior to the reconstruction of the channels with l>L1. As a result, these channels are reconstructed including the scene rotation and the need to perform a matrix multiplication with the full dimensionality N can be avoided, yielding a large reduction in computational complexity.
The above discussed approach can be used by an apparatus as it is shown by
It is assumed that the apparatus 100 being part of an encoder receives a HOA representation. This representation is provided to the entities 110 and 120. For example a preprocessing of the HOAs signal, e.g. by an analysis filterbank or DirAC parameter estimator is performed (not shown). The one or more parameters describing an acoustic or psychoacoustic model of the input audio stream HOA. For example, they may comprise at least an information on a direction of arrival (DOA) or optionally information on a diffuseness or an energy ratio end of insertion.
The entity 120 performs a deriving of one or more parameters, e.g. prediction parameters/prediction coefficients.
The diffuseness and/or direction of arrival may be parameters of the mentioned acoustic model. Based on the acoustic model or based on the parameters describing the acoustic model, the prediction coefficients may be calculated by the entity 120. According to a further embodiment an interim step may be used. The prediction coefficient according to further embodiments is calculated based on a covariance matrix which is also calculated by the means for deriving 120, e.g. from the acoustic model. Often such a covariance matrix is calculated based on information about the diffuseness, spherical harmonics and/or a time-dependent scalar-valued signal. For example, the formula
where Yl,m is a spherical harmonic with the degree and index l and m and where s(t) is a time-dependent scalar-valued signal. The discussion of the calculation of a covariance matrix has been made above in great detail. According to further embodiments the additional calculation methods as discussed above may be used.
This means, that according to embodiments the entity 120 performs the following calculation. Extracting acoustic or psychoacoustic model parameters like a DOA or diffuseness out of the audio stream HOA
The entity 110 is configured to perform transformation, e.g. downmix generation. This downmix generation is based on the input signal, here the HOA signal. However, in this case, the transformation is applied in a signal adaptive way dependent on the one or more parameters as derived by the entity 120.
Due to the novel approach that parameters, e.g. inter-channel prediction coefficients are derived from the acoustic signal model or the parameters of the acoustic signal model it is possible to perform a transformation like a mixing/down mixing in a signal-adaptive way. For example, this principle can be used to develop an extension to the DirAC system for spatial audio signals. This extension improves the quality as compared to static selection of a subset of the channel of the HOA input signal as transport channels. In addition, it reduces the metadata bit usage as compared to previous approaches to signal-adaptive transforms that reduce the inter-channel correlation. The savings on the metadata can in turn free more bits for the EVS bitstreams and further improve the perceptual quality of the system. The additional computational complexity is negligible. These advantages result directly from the derivation of a mathematical connection between the signal model considered in the DirAC system and prediction coefficients typically transmitted as side information in predictive coding schemes.
Though the principle has been discussed in context of an encoder it can also be applied to the decoder side. At the decoder side the apparatus also comprises transforming means and means for deriving one or more parameters (c.f. reference number 120) which are used at the transforming means 110. For example, the decoder receives metadata comprising information on the acoustic/psychoacoustic model or of parameters the acoustic/psychoacoustic model (in general parameters enabling to determine the prediction coefficients) together with a coded signal, like an EVS bitstream. The EVS bitstream is provided to the transforming means 110, wherein the metadata are used by the means for deriving 120. The means for deriving 120 determine based on the metadata parameters, e.g. comprising an information on a DOA. For example, the parameters to be determined may be prediction parameters. It should be noted, that metadata are derived from the audio stream e.g. at the encoder side. These parameters/prediction parameters are then used by the transforming means 110 which may be configured to perform an inverse transforming like an upmixing so as to output a decoded signal like a FOA signal which can then be further processed so as to determine the HOA signal or directly a loudspeaker signal. The further processing may, for example comprise a DirAC synthesis including an analysis filterbank.
It should be noted that the calculation of the prediction coefficients may be performed in the same way in the decoder as in the encoder. In this case, the parameters, may be preprocessed by a metadata decoder.
With respect to
The entity 120e comprises in this embodiment two entities, namely an entity for determining a model and/or model covariance matrix which is marked by the reference numeral 121 as well as an entity for determining prediction coefficients which is marked by the reference numeral 122. According to embodiments the entity 122 performs the determination of the covariance matrix, e.g. based on one or more model parameters, like the DOA. The entity 122 determines the prediction coefficients, e.g. based on the covariance matrix.
The entity 120e may according to further embodiments receive a HOA signal or a derivative of the HOA signal e.g. preprocessed by a DirAC parameter estimator 232 and an analysis filterbank 231. The output of the DirAC parameter estimator 232 may give information on a direction of arrival (DOA as it was discussed above). This information is then used by the entity 120e and especially by the entity 121. According to further embodiments the estimated parameters of the entity 232 may also be used by a metadata encoder 233, wherein the encoded metadata stream is multiplexed together with the EVS coded stream by the multiplexer 230 so as to output the encoded HOA signal/encoded audio stream.
The DirAC synthesis entity is marked by the reference numeral 335 the output of the DirAC synthesis entity 335 may be further processed by a synthesis filterbank 336 so as to output a HOA signal or headphone/loudspeaker signal.
The metadata, e.g. the metadata decoded by the metadata decoder 333 are used for determining the parameters obtained by the entity 120d. In this case, the entity 120d comprised the two entities for determining the model/the model covariance matrix as marked by reference numeral 121 and the entity for determining the prediction coefficients/general parameters (marked by the reference numeral 122). The output of the entity 120d is used for the transformation performed by the entity 110d.
Below, further aspects may be discussed. The above discussed embodiments start from the assumption, that an audio stream with more than one channel should be transformed into another representation. The above discussed embodiments may also be applied for transforming audio streams in a directional audio coding system. Thus embodiments provide an apparatus and method to transform audio streams in a directional audio coding system where
According to embodiment a sound scheme can be rotated in such a way that
In general embodiments refer to an apparatus and method to transform audio streams with more than one channel into another representation such that
According to further embodiments the transform is computed such that correlations between the transport channels are reduced. For example, an inter-channel covariance matrix may be used. Here the inter-channel covariance matrix of the input signal is estimated from a model of the signal. According to further embodiments a transform matrix is derived from the covariance matrix of the model. According to embodiments such as for matrices calculated using different methods for different frequency bands.
In the following, additional embodiments and aspects of the invention will be described which can be used individually or in combination with any of the features and functionalities and details described herein.
A first aspect relates to an apparatus for transforming an audio stream with more than one channel into another representation, apparatus being on an encoder side and comprising: means for deriving one or more parameters describing an acoustic or psychoacoustic model of the audio stream on the encoder side, wherein the means for deriving are configured to calculate prediction coefficients as the one or more parameters, wherein the prediction coefficients are calculated based on a covariance matrix by the means for deriving; means for transforming the audio stream in a signal-adaptive way dependent on the one or more parameters; and wherein the one or more parameters comprise at least an information on at least one direction of arrival (DOA), wherein the means for transforming are configured to perform a downmixing or other transforming of the audio stream on the encoder side.
A second aspect relates to an apparatus for transforming an audio stream with more than one channel into another representation apparatus being on a decoder side and comprising: means for receiving one or more parameters describing an audio scene with an acoustic or psychoacoustic model on the decoder side; means for transforming the audio stream in a signal-adaptive way dependent on the one or more parameters; and wherein the one or more parameters comprise at least an information on at least one direction of arrival (DOA), wherein the means for transforming are configured to perform upmix or other transform generation of the audio stream on the decoder side.
According to a third aspect when referring back to the first aspect, prediction coefficients are calculated based on Yl,m, especially beads on the formula
According to a fourth aspect when referring back to any of the previous aspects, the one or more parameters further comprise at least an information on a diffuseness factor or on one or more DOAs or on energy ratios, and/or wherein the one or more parameters are derived from the audio stream.
According to a fifth aspect when referring back to the first aspect, the means for deriving are configured to calculate a covariance matrix or a covariance matrix from the acoustic or psychoacoustic model.
According to a sixth aspect when referring back to any of the previous aspects, the means for deriving are configured to calculate a covariance matrix based on the DoA and a diffuseness factor or an energy ratio.
According to a seventh aspect when referring back to the sixth aspect, the means for deriving are configured to calculate a covariance matrix based on an information about diffuseness, spherical harmonics and a time-dependent scalar-valued signal, especially based on the formula
where Yl,m is a spherical harmonic with the degree and index l and m and where s (t) is a time-dependent scalar-valued signal; and/or
based on a signal energy, especially based on the following formula
where ψ describes the diffuseness and where E describes the signal energy for the audio stream; and/or based on the formula
where E is the signal energy; and/or based on the formula
and for the y and z channels analogously.
According to an eighth aspect when referring back to the seventh aspect, the signal energy E is directly calculated from the audio stream; or the signal energy E is estimated from the model of the audio stream.
According to a ninth aspect when referring back to any of the previous aspects, the audio stream is preprocessed by a parameter estimator or wherein the audio stream is preprocessed by a parameter estimator comprising a metadata encoder or metadata decoder and/or wherein the audio stream is preprocessed by an analysis filterbank.
A tenth aspect relates to an apparatus for transforming an audio stream in a directional audio coding system, being on an encoder side and comprising: means for deriving one or more acoustic model parameters of a model of the audio stream, wherein the one or more acoustic model parameters are transmitted to enable restoring all channels of the audio stream and comprise at least an information on direction of arrival (DoA), means for transforming the audio stream in a signal-adaptive way dependent on the one or more acoustic model parameters; where all or a subset of the channels of the audio stream are transformed; wherein the means for transforming are configured to perform a downmixing or other transforming of the audio stream on the encoder side.
An eleventh aspect relates to an apparatus for transforming an audio stream in a directional audio coding system, apparatus being on a decoder side and comprising: means for receiving one or more acoustic model parameters of a model of the audio stream, wherein the one or more acoustic model parameters are received to restore all channels of the audio stream and comprise at least an information on direction of arrival (DoA), means for transforming the audio stream in a signal-adaptive way dependent on the one or more acoustic model parameters; where all or a subset of the channels of the audio stream are transformed; wherein the means for transforming are configured to perform upmix or other transform generation of the audio stream on the decoder side.
According to a twelfth aspect when referring back to any of the previous aspects, the one or more parameters are quantized prior to a transmission.
According to a thirteenth aspect when referring back to any of the previous aspects, the one or more parameters are dequantized after a transmission.
According to a fourteenth aspect when referring back to any of the previous aspects, the parameters are smoothed over time.
According to a fifteenth aspect when referring back to any of the previous aspects, a transform is computed such that correlations between transport channels are reduced by use of Karhunen-Loève transform or prediction matrix.
According to a sixteenth aspect when referring back to any of the previous aspects, an inter-channel covariance matrix of the audio stream is estimated from the model or the acoustic or psychoacoustic model of the audio stream.
According to a seventeenth aspect when referring back to any of the previous aspects, a transform matrix is derived from a covariance matrix of the model or the acoustic or psychoacoustic model of the audio stream.
According to an eighteenth aspect when referring back to any of the previous aspects, a transform matrix is calculated using the covariance matrix from the acoustic or psychoacoustic model for one or more frequency bands and a different method to calculate the covariance matrix for one or more other frequency bands
According to a nineteenth aspect when referring back to any of the previous aspects, at least one of transform methods used by the means for transforming is multiplication of a vector of audio channels by a constant matrix.
According to a twentieth aspect when referring back to any of the previous aspects, at least one of transform methods used by the means for transforming uses prediction based on the inter-channel covariance matrix of a vector of audio channels.
According to a twenty-first aspect when referring back to any of the previous aspects, at least one of transform methods used by the means for transforming uses prediction based on inter-channel covariance matrix based on the DOA and an additional diffuseness factor or an energy ratio.
According to a twenty-second aspect when referring back to any of the previous aspects, the means for deriving the one or more parameters are configured to process all or a subset of the channels of a first-order or higher-order Ambisonics input signal of the audio stream.
According to a twenty-third aspect when referring back to any one of aspects 10-22, a sound scene of the audio stream is rotatable in such a way that:
A twenty-fourth aspect relates to an encoder comprising an apparatus according to one of the aspects 1, 3-22 having backreference to aspect 1.
A twenty-fifth aspect relates to a decoder comprising an apparatus according to one of the aspects 2 or 3-22 having backreference to aspects 2.
A twenty-sixth aspect relates to a system comprising an encoder according to aspect 24 and a decoder according to aspect 25, wherein the encoder is configured to calculate a prediction matrix and/or a downmix or other transform and wherein decoder is configured to calculate an upmix or other transform matrix from estimated parameters or the one or more parameters of the acoustic model independently of each other.
A twenty-sixth aspect relates to a method for transforming an audio stream with more than one channel into another representation, performed on an encoder side and comprising the following steps: deriving the one or more parameters describing an acoustic or psychoacoustic model of an audio stream from the audio stream, wherein deriving comprises calculating prediction coefficients as the one or more parameters, wherein the prediction coefficients calculated are calculated based on a covariance matrix by the means for deriving and wherein the one or more parameters comprise at least an information on direction of arrival (DOA); and transforming the audio stream in a signal-adaptive way dependent the on one or more parameters; wherein transforming comprises a downmixing or other transforming of the audio stream on the encoder side.
A twenty-eighth aspect relates to a method for transforming an audio stream with more than one channel into another representation, performed on a decoder side and comprising the following steps: receiving one or more parameters describing an audio scene with an acoustic or psychoacoustic model on the decoder side, wherein the one or more parameters comprise at least an information on direction of arrival (DOA); and transforming the audio stream in a signal-adaptive way dependent the on one or more parameters; wherein transforming comprises upmixing or other transforming of the audio stream on the decoder side.
A twenty-ninth aspect relates to a method for transforming an audio stream in a directional audio coding system, performed on an encoder side and comprising the steps: deriving one or more acoustic model parameters of a model of the audio stream parametrized by direction of arrival (DOA) and diffuseness or energy-ratio parameters, said acoustic model parameters are transmitted to restore all channels of an input of audio stream and comprise at least an information on DOA, wherein all or a subset of the channels of the audio stream are transformed; and transforming the audio stream in a signal-adaptive way dependent on one or more acoustic model parameters, wherein transforming comprises a downmixing or other transforming of the audio stream on the encoder side.
A thirtieth aspect relates to a method for transforming an audio stream in a directional audio coding system, performed on a decoder side and comprising the steps: receiving one or more acoustic model parameters of a model of the audio stream parametrized by direction of arrival (DOA) and diffuseness or energy-ratio parameters, said acoustic model parameters are received to restore all channels of an input of audio stream and comprise at least an information on DOA, wherein all or a subset of the channels of the audio stream are transformed; and transforming the audio stream in a signal-adaptive way dependent on one or more acoustic model parameters, wherein transforming comprises upmixing or other transforming of the audio stream on the decoder side.
A thirty-first aspect relates to a computer program for performing, when running on a computer, any of the above methods according to the invention.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
The inventive encoded audio signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
While this invention has been described in terms of several advantageous embodiments, there are alterations, permutations, and equivalents, which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
PCT/EP2022/052642 | Feb 2022 | WO | international |
This application is a continuation of copending International Application No. PCT/EP2023/052331, filed Jan. 31, 2023, which is incorporated herein by reference in its entirety, and additionally claims priority from International Application No. PCT/EP2022/052642, filed Feb. 3, 2022, which is also incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2023/052331 | Jan 2023 | WO |
Child | 18793735 | US |