The present invention is related to audio encoding or decoding and particularly to hybrid encoder/decoder parametric spatial audio coding.
Transmitting an audio scene in three dimensions entails handling multiple channels which usually engenders a large amount of data to transmit. Moreover 3D sound can be represented in different ways: traditional channel-based sound where each transmission channel is associated with a loudspeaker position; sound carried through audio objects, which may be positioned in three dimensions independently of loudspeaker positions; and scene-based (or Ambisonics), where the audio scene is represented by a set of coefficient signals that are the linear weights of spatial orthogonal spherical harmonics basis functions. In contrast to channel-based representation, scene-based representation is independent of a specific loudspeaker set-up, and can be reproduced on any loudspeaker set-ups at the expense of an extra rendering process at the decoder.
For each of these formats, dedicated coding schemes were developed for efficiently storing or transmitting the audio signals at low bit-rates. For example, MPEG surround is a parametric coding scheme for channel-based surround sound, while MPEG Spatial Audio Object Coding (SAOC) is a parametric coding method dedicated to object-based audio. A parametric coding technique for high order of Ambisonics was also provided in the recent standard MPEG-H phase 2.
In this transmission scenario, spatial parameters for the full signal are part of the coded and transmitted signal, i.e. estimated and coded in the encoder based on the fully available 3D sound scene and decoded and used for the reconstruction of the audio scene in the decoder. Rate constraints for the transmission typically limit the time and frequency resolution of the transmitted parameters which can be lower than the time-frequency resolution of the transmitted audio data.
Another possibility to create a three-dimensional audio scene is to upmix a lower dimensional representation, e.g. a two channel stereo or a first order Ambisonics representation, to the desired dimensionality using cues and parameters directly estimated from the lower-dimensional representation. In this case the time-frequency resolution can be chosen as fine as desired. On the other hand the used lower-dimensional and possibly coded representation of the audio scene leads to sub-optimal estimation of the spatial cues and parameters. Especially if the audio scene analyzed was coded and transmitted using parametric and semi-parametric audio coding tools the spatial cues of the original signal are disturbed more than only the lower-dimensional representation would cause.
Low rate audio coding using parametric coding tools has shown recent advances. Such advances of coding audio signals with very low bit rates led to the extensive use of so called parametric coding tools to ensure good quality. While a wave-form-preserving coding, i.e., a coding where only quantization noise is added to the decoded audio signal, is of advantage, e.g. using a time-frequency transform based coding and shaping of the quantization noise using a perceptual model like MPEG-2 AAC or MPEG-1 MP3, this leads to audible quantization noise particularly for low bit rates.
To overcome this problem parametric coding tools where developed, where parts of the signal are not coded directly, but regenerated in the decoder using a parametric description of the desired audio signals, where the parametric description needs less transmission rate than the wave-form-preserving coding. These methods do not try to retain the wave form of the signal but generate an audio signal that is perceptually equal to the original signal. Examples for such parametric coding tools are band width extensions like Spectral Band Replication (SBR), where high band parts of a spectral representation of the decoded signal are generated by copying wave form coded low band spectral signal portions and adaptation according to said parameters. Another method is Intelligent Gap Filling (IGF), where some bands in the spectral representation are coded directly, while the bands quantized to zero in the encoder are replaced by already decoded other bands of the spectrum that are again chosen and adjusted according to transmitted parameters. A third used parametric coding tools is noise filling, where parts of the signal or spectrum are quantized to zero and are filled with random noise and adjusted according to the transmitted parameters.
Recent audio coding standards used for coding at medium to low bit rates use a mixture of such parametric tools to get high perceptual quality for those bit rates. Examples for such standards are xHE-AAC, MPEG4-H and EVS.
DirAC spatial parameter estimation and blind upmix is a further procedure. DirAC is a perceptually motivated spatial sound reproduction. It is assumed, that at one time instant and at one critical band, the spatial resolution of the auditory system is limited to decoding one cue for direction and another for inter-aural coherence or diffuseness.
Based on these assumptions, DirAC represents the spatial sound in one frequency band by cross-fading two streams: a non-directional diffuse stream and a directional non-diffuse stream. The DirAC processing is performed in two phases: the analysis and the synthesis as pictured in
In the DirAC analysis stage shown in
The analysis stage in
The DirAC synthesis stage illustrated in
The component signal in the direct signal branch 1015 is also gain-adjusted using a gain parameter derived from the direction parameter consisting of an azimuth angle and an elevation angle. Particularly, these angles are input into a VBAP (vector base amplitude panning) gain table 1011. The result is input into a loudspeaker gain averaging stage 1012, for each channel, and a further normalizer 1013 and the resulting gain parameter is then forwarded to the amplifier or gain adjuster in the direct signal branch 1015. The diffuse signal generated at the output of a decorrelator 1016 and the direct signal or non-diffuse stream are combined in a combiner 1017 and, then, the other subbands are added in another combiner 1018 which can, for example, be a synthesis filter bank. Thus, a loudspeaker signal for a certain loudspeaker is generated and the same procedure is performed for the other channels for the other loudspeakers 1019 in a certain loudspeaker setup.
The high-quality version of DirAC synthesis is illustrated in
The aim of the synthesis of the diffuse sound is to create perception of sound that surrounds the listener. In the low-bit-rate version, the diffuse stream is reproduced by decorrelating the input signal and reproducing it from every loudspeaker. In the high-quality version, the virtual microphone signals of the diffuse streams are already incoherent in some degree, and they need to be decorrelated only mildly.
The DirAC parameters also called spatial metadata consist of tuples of diffuseness and direction, which in spherical coordinate is represented by two angles, the azimuth and the elevation. If both analysis and synthesis stage are run at the decoder side the time-frequency resolution of the DirAC parameters can be chosen to be the same as the filter bank used for the DirAC analysis and synthesis, i.e. a distinct parameter set for every time slot and frequency bin of the filter bank representation of the audio signal.
The problem of performing the analysis in a spatial audio coding system only on the decoder side is that for medium to low bit rates parametric tools like described in the previous section are used. Since the non-wave-form preserving nature of those tools, the spatial analysis for spectral portions where mainly parametric coding is used can lead to vastly different values for the spatial parameters than an analysis of the original signal would have produced.
Recently, a spatial audio coding method using DirAC analysis in the encoder and transmitting the coded spatial parameters in the decoder was disclosed in [3][4].
The resulting B-format signal is introduced into a DirAC analyzer 210 in order to derive DirAC metadata such as direction of arrival metadata and diffuseness metadata, and the obtained signals are encoded using a spatial metadata encoder 220. Moreover, the B-format signal is forwarded to a beam former/signal selector in order to downmix the B-format signals into a transport channel or several transport channels that are then encoded using an EVS based core encoder 140.
The output of block 220 on the one hand and block 140 on the other hand represent an encoded audio scene. The encoded audio scene is forwarded to a decoder, and in the decoder, a spatial metadata decoder 700 receives the encoded spatial metadata and an EVS-based core decoder 500 receives the encoded transport channels. The decoded spatial metadata obtained by block 700 is forwarded to a DirAC synthesis stage 800 and the decoded one or more transport channels at the output of block 500 are subjected to a frequency analysis in block 860. The resulting time/frequency decomposition is also forwarded to the DirAC synthesizer 800 that then generates, for example, as a decoded audio scene, loudspeaker signals or first order Ambisonics or higher order Ambisonics components or any other representation of an audio scene.
In the procedure disclosed in [3] and [4], the DirAC metadata, i.e., the spatial parameters, are estimated and coded at a low bitrate and transmitted to the decoder, where they are used to reconstruct the 3D audio scene together with a lower dimensional representation of the audio signal.
In this invention, the DirAC metadata, i.e. the spatial parameters, are estimated and coded at a low bit rate and transmitted to the decoder where they are used to reconstruct the 3D audio scene together with a lower dimensional representation of the audio signal.
To achieve the low bit rate for the metadata, the time-frequency resolution is smaller than the time-frequency resolution of the used filter bank in analysis and synthesis of the 3D audio scene.
According to an embodiment, an audio scene encoder for encoding an audio scene, the audio scene having at least two component signals, may have: a core encoder for core encoding the at least two component signals, wherein the core encoder is configured to generate a first encoded representation for a first portion of the at least two component signals, and to generate a second encoded representation for a second portion of the at least two component signals, wherein the core encoder is configured to form a time frame from the at least two component signals, wherein a first frequency subband of the time frame of the at least two component signals is the first portion of the at least two component signals and a second frequency subband of the time frame is the second portion of the at least two component signals, wherein the first frequency subband is separated from the second frequency subband by a predetermined border frequency, wherein the core encoder is configured to generate the first encoded representation for the first frequency subband having M component signals, and to generate the second encoded representation for the second frequency subband having N component signals, wherein M is greater than N, and wherein N is greater than or equal to 1; a spatial analyzer for analyzing the audio scene having the at least two component signals to derive one or more spatial parameters or one or more spatial parameter sets for the second frequency subband; and an output interface for forming an encoded audio scene signal, the encoded audio scene signal having the first encoded representation for the first frequency subband having the M component signals, the second encoded representation for the second frequency subband having the N component signals, and the one or more spatial parameters or one or more spatial parameter sets for the second frequency subband.
According to another embodiment, an audio scene decoder may have: an input interface for receiving an encoded audio scene signal having a first encoded representation of a first portion of at least two component signals, a second encoded representation of a second portion of the at least two component signals, and one or more spatial parameters for the second portion of the at least two component signals; a core decoder for decoding the first encoded representation and the second encoded representation to obtain a decoded representation of the at least two component signals representing an audio scene; a spatial analyzer for analyzing a portion of the decoded representation corresponding to the first portion of the at least two component signals to derive one or more spatial parameters for the first portion of the at least two component signals; and a spatial renderer for spatially rendering the decoded representation using the one or more spatial parameters for the first portion and the one or more spatial parameters for the second portion as included in the encoded audio scene signal.
According to another embodiment, a method of encoding an audio scene, the audio scene having at least two component signals, may have the steps of: core encoding the at least two component signals, wherein the core encoding has generating a first encoded representation for a first portion of the at least two component signals, and generating a second encoded representation for a second portion of the at least two component signals; wherein the core encoding has forming a time frame from the at least two component signals, wherein a first frequency subband of the time frame of the at least two component signals is the first portion of the at least two component signals and a second frequency subband of the time frame is the second portion of the at least two component signals, wherein the first frequency subband is separated from the second frequency subband by a predetermined border frequency, wherein the core encoding has generating the first encoded representation for the first frequency subband having M component signals, and generating the second encoded representation for the second frequency subband having N component signals, wherein M is greater than N, and wherein N is greater than or equal to 1; analyzing the audio scene having the at least two component signals to derive one or more spatial parameters or one or more spatial parameter sets for the second frequency subband; and forming the encoded audio scene signal, the encoded audio scene signal having the first encoded representation for the first frequency subband having the M component signals, the second encoded representation for the second frequency subband having the N component signals, and the one or more spatial parameters or the one or more spatial parameter sets for the second frequency subband.
According to still another embodiment, a method of decoding an audio scene may have the steps of: receiving an encoded audio scene signal having a first encoded representation of a first portion of at least two component signals, a second encoded representation of a second portion of the at least two component signals, and one or more spatial parameters for the second portion of the at least two component signals; decoding the first encoded representation and the second encoded representation to obtain a decoded representation of the at least two component signals representing the audio scene; analyzing a portion of the decoded representation corresponding to the first portion of the at least two component signals to derive one or more spatial parameters for the first portion of the at least two component signals; and spatially rendering the decoded representation using the one or more spatial parameters (840) for the first portion and the one or more spatial parameters for the second portion as included in the encoded audio scene signal.
Another embodiment may have a non-transitory digital storage medium having stored thereon a computer program for performing a method of encoding an audio scene, the audio scene having at least two component signals, the method having the steps of: core encoding the at least two component signals, wherein the core encoding has generating a first encoded representation for a first portion of the at least two component signals, and generating a second encoded representation for a second portion of the at least two component signals; wherein the core encoding has forming a time frame from the at least two component signals, wherein a first frequency subband of the time frame of the at least two component signals is the first portion of the at least two component signals and a second frequency subband of the time frame is the second portion of the at least two component signals, wherein the first frequency subband is separated from the second frequency subband by a predetermined border frequency, wherein the core encoding has generating the first encoded representation for the first frequency subband having M component signals, and generating the second encoded representation for the second frequency subband having N component signals, wherein M is greater than N, and wherein N is greater than or equal to 1; analyzing the audio scene having the at least two component signals to derive one or more spatial parameters or one or more spatial parameter sets for the second frequency subband; and forming the encoded audio scene signal, the encoded audio scene signal having the first encoded representation for the first frequency subband having the M component signals, the second encoded representation for the second frequency subband having the N component signals, and the one or more spatial parameters or the one or more spatial parameter sets for the second frequency subband, when said computer program is run by a computer.
Still another embodiment may have a non-transitory digital storage medium having stored thereon a computer program for performing a method of decoding an audio scene, having the steps of: receiving an encoded audio scene signal having a first encoded representation of a first portion of at least two component signals, a second encoded representation of a second portion of the at least two component signals, and one or more spatial parameters for the second portion of the at least two component signals; decoding the first encoded representation and the second encoded representation to obtain a decoded representation of the at least two component signals representing the audio scene; analyzing a portion of the decoded representation corresponding to the first portion of the at least two component signals to derive one or more spatial parameters for the first portion of the at least two component signals; and spatially rendering the decoded representation using the one or more spatial parameters (840) for the first portion and the one or more spatial parameters for the second portion as included in the encoded audio scene signal, when said computer program is run by a computer.
According to another embodiment, an encoded audio scene signal may have: a first encoded representation for a first frequency subband of a time frame of a at least two component signals of an audio scene, wherein the first encoded representation for the first frequency subband has M component signals; a second encoded representation for a second frequency subband of a time frame of the at least two component signals the second encoded representation for the second frequency subband has N component signals, wherein M is greater than N, wherein N is greater than or equal to 1, wherein the first frequency subband is separated from the second frequency subband by a predetermined border frequency; and one or more spatial parameters or one or more spatial parameter sets for the second frequency subband.
The present invention is based on the finding that an improved audio quality and a higher flexibility and, in general, an improved performance is obtained by applying a hybrid encoding/decoding scheme, where the spatial parameters used to generate a decoded two dimensional or three dimensional audio scene in the decoder are estimated in the decoder based on a coded transmitted and decoded typically lower dimensional audio representation for some parts of a time-frequency representation of the scheme, and are estimated, quantized and coded for other parts within the encoder and transmitted to the decoder.
Depending on the implementation, the division between the division between encoder-side estimated and decoder-side estimated regions can be diverging for different spatial parameters used in the generation of the three-dimensional or two-dimensional audio scene in the decoder.
In embodiments, this partition into different portions or advantageously time/frequency regions can be arbitrary. In an embodiment, however, it is advantageous to estimate the parameters in the decoder for parts of the spectrum that are mainly coded in a wave-form-preserving manner, while coding and transmitting encoder-calculated parameters for parts of the spectrum where parametric coding tools were mainly used.
Embodiments of the present invention aim to propose a low bit-rate coding solution for transmitting a 3D audio scene by employing a hybrid coding system where spatial parameters used for the reconstruction of the 3D audio scene are for some parts estimated and coded in the encoder and transmitted to the decoder, and for the remaining parts estimated directly in the decoder.
The present invention discloses a 3D audio reproduction based on a hybrid approach for a decoder only parameter estimation for parts of a signal where the spatial cues are retained well after bringing the spatial representation into a lower dimension in an audio encoder and encoding the lower dimension representation and estimating in the encoder, coding in the encoder, and transmitting the spatial cues and parameters from the encoder to the decoder for parts of the spectrum where the lower dimensionality together with the coding of the lower dimensional representation would lead to a sub-optimal estimation of the spatial parameters.
In an embodiment, an audio scene encoder is configured for encoding an audio scene, the audio scene comprising at least two component signals, and the audio scene encoder comprises a core encoder configured for core encoding the at least two component signals, where the core encoder generates a first encoded representation for a first portion of the at least two component signals and generates a second encoded representation for a second portion of the at least two component signals. The spatial analyzer analyzes the audio scene to derive one or more spatial parameters or one or more spatial parameter sets for the second portion and an output interface then forms the encoded audio scene signal which comprises the first encoded representation, the second encoded representation and the one or more spatial parameters or one or more spatial parameter sets for the second portion. Typically, any spatial parameters for the first portion are not included in the encoded audio scene signal, since those spatial parameters are estimated from the decoded first representation in a decoder. On the other hand, the spatial parameters for the second portion are already calculated within the audio scene encoder based on the original audio scene or an already processed audio scene which has been reduced with respect to its dimension and, therefore, with respect to its bitrate.
Thus, the encoder-calculated parameters can carry a high quality parametric information, since these parameters are calculated in the encoder from data which is highly accurate, not affected by core encoder distortions and potentially even available in a very high dimension such as a signal which is derived from a high quality microphone array. Due to the fact that such very high quality parametric information is preserved, it is then possible to core encode the second portion with less accuracy or typically less resolution. Thus, by quite coarsely core encoding the second portion, bits can be saved which can, therefore, be given to the representation of the encoded spatial metadata. Bits saved by a quite coarse encoding of the second portion can also be invested into a high resolution encoding of the first portion of the at least two component signals. A high resolution or high quality encoding of the at least two component signals is useful, since, at the decoder-side, any parametric spatial data does not exist for the first portion, but is derived within the decoder by a spatial analysis. Thus, by not calculating all spatial metadata in the encoder, but core-encoding at least two component signals, any bits that would, in the comparison case, be used for the encoded metadata can be saved and invested into the higher quality core encoding of the at least two component signals in the first portion.
Thus, in accordance with the present invention, the separation of the audio scene into the first portion and into the second portion can be done in a highly flexible manner, for example, depending on bitrate requirements, audio quality requirements, processing requirements, i.e., whether more processing resources are available in the encoder or the decoder, and so on. In an embodiment, the separation into the first and the second portion is done based on the core encoder functionalities. Particularly, for high quality and low bitrate core encoders that apply parametric coding operations for certain bands such as a spectral band replication processing or intelligent gap filling processing or noise filling processing, the separation with respect to the spatial parameters is performed in such a way that the non-parametrically encoded portions of the signal form the first portion and the parametrically encoded portions of the signal form the second portion. Thus, for the parametrically encoded second portion which typically are the lower resolution encoded portion of the audio signal, a more accurate representation of the spatial parameters is obtained while for the better encoded, i.e., high resolution encoded first portion, the high quality parameters are not so necessary, since quite high quality parameters can be estimated on the decoder-side using the decoded representation of the first portion.
In a further embodiment, and in order to even more reduce the bitrate, the spatial parameters for the second portion are calculated, within the encoder, in a certain time/frequency resolution which can be a high time/frequency resolution or a low time/frequency resolution. In case of a high time/frequency resolution, the calculated parameters are then grouped in a certain way in order to obtain low time/frequency resolution spatial parameters. These low time/frequency resolution spatial parameters are nevertheless high quality spatial parameters that only have a low resolution. The low resolution, however, is useful in that bits are saved for the transmission, since the number of spatial parameters for a certain time length and a certain frequency band are reduced. This reduction, however, is typically not so problematic, since the spatial data nevertheless does not change too much over time and, over frequency. Thus, a low bitrate but nevertheless good quality representation of the spatial parameters for the second portion can be obtained.
Since the spatial parameters for the first portion are calculated on the decoder-side and do not have to be transmitted anymore, any compromises with respect to resolution do not have to be performed. Therefore, a high time and high frequency resolution estimation of spatial parameters can be performed on the decoder-side and this high resolution parametric data then helps in providing a nevertheless good spatial representation of the first portion of the audio scene. Thus, the “disadvantage” of calculating the spatial parameters on the decoder-side based on the at least two transmitted components for the first portion can be reduced or even eliminated by calculating high time and frequency resolution spatial parameters and by using these parameters in the spatial rendering of the audio scene. This does not incur any penalty in a bit rate, since any processing performed on the decoder-side does not have any negative influence on the transmitted bitrate in an encoder/decoder scenario.
A further embodiment of the present invention relies on a situation, where, for the first portion, at least two components are encoded and transmitted so that, based on the at least two components, a parametric data estimation can be performed on the decoder-side. In an embodiment, however, the second portion of the audio scene can even be encoded with a substantially lower bitrate, since it is of advantage to only encode a single transport channel for the second representation. This transport or downmix channel is represented by a very low bitrate compared to the first portion, since, in the second portion, only a single channel or component is to be encoded while, in the first portion, two or more components are to be encoded so that enough data for a decoder-side spatial analysis is there.
Thus, the present invention provides additional flexibility with respect to bitrate, audio quality, and processing requirements available on the encoder or the decoder-side.
Embodiments of the present invention are subsequently described with respect to the accompanying drawings, in which:
The second encoder representation for the second portion can consist of the same number of components or can, alternatively, have a lower number such as only a single omnidirectional component that has been encoded by the core coder in a second portion. In case of the implementation where the core encoder 100 reduces the dimensionality of the original audio scene 110, the reduced dimensionality audio scene optionally can be forwarded to the spatial analyzer via line 120 instead of the original audio scene.
The encoded representation comprising the first encoded representation 410 for the first portion and the second encoded representation 420 for the second portion is input into a core decoder for decoding the first encoded representation and the second encoded representation to obtain a decoded representation of the at least two component signals representing an audio scene. The decoded representation comprises a first decoded representation for the first portion indicated at 810 and a second decoded representation for a second portion indicated at 820. The first decoded representation is forwarded to a spatial analyzer 600 for analyzing a portion of the decoded representation corresponding to the first portion of the at least two component signals to obtain one or more spatial parameters 840 for the first portion of the at least two component signals. The audio scene decoder also comprises a spatial rendered 800 for spatially rendering the decoded representation which comprises, in the
Alternatively, in case of intelligent gap filling (IGF) or noise filling (NF), the bands are arbitrarily selected in line with a signal analysis and, therefore, the first portion could, for example, consist of bands B1, B2, B4, B6 and the second portion could be B3, B5 and probably another higher frequency band. Thus, a very flexible separation of the audio signal into bands can be performed, irrespective of whether the bands are, as is of advantage and illustrated in
The core encoder 100 of
The audio encoder 160a for the first encoded representation can comprise a wave form preserving or non-parametric or high time or high frequency resolution encoder while the audio encoder 160b can be a parametric encoder such as an SBR encoder, an IGF encoder, a noise filling encoder, or any low time or frequency resolution or so. Thus, the audio encoder 160b will typically result in a lower quality output representation compared to the audio encoder 160a. This “disadvantage” is addressed by performing a spatial analysis via the spatial data analyzer 210 of the original audio scene or, alternatively, a dimension reduced audio scene when the dimension reduced audio scene still comprises at least two component signals. The spatial data obtained by the spatial data analyzer 210 are then forwarded to a metadata encoder 220 that outputs an encoded low resolution spatial data. Both blocks 210, 220 may be included in the spatial analyzer block 200 of
Advantageously, the spatial data analyzer performs a spatial data analysis with a high resolution such as a high frequency resolution or a high time resolution and, the, in order to keep the used bitrate for the encoded metadata in a reasonable range, the high resolution spatial data may be grouped and entropy encoded by the metadata encoder in order to have an encoded low resolution spatial data. When, for example, a spatial data analysis is performed for, for example, eight time slots per frame and ten bands per time slot, one could group the spatial data into a single spatial parameter per frame and, for example, five bands per parameter.
It is of advantage to calculate directional data on the one hand and diffuseness data on the other hand. The metadata encoder 220 could then be configured to output the encoded data with different time/frequency resolutions for the directional and diffuseness data. Typically, directional data is used with a higher resolution than diffuseness data. As advantageous way in order to calculate the parametric data with different resolutions is to perform the spatial analysis with a high resolution for and typically an equal resolution for both parametric kinds and to then perform a grouping in time and/or frequency with the different parametric information for the different parameter kinds in different ways in order to then have an encoded low resolution spatial data output 330 that has, for example, a medium resolution with time and/or frequency for the directional data and a low resolution for the diffuseness data.
The core decoder 500 of
The result of block 710 is then a collection of decoded advantageously high resolution parameters for the second portion that typically have the same resolution than the parameters 840 for the first portion. Also, the encoded representation of the second portion is decoded by the audio decoder 510b to obtain the decoded second portion 820 of typically at least one or of a signal having at least two components.
Alternatively, the signal input into the format converter or the core encoder could be a signal captured by an omnidirectional microphone positioned at the first portion and another signal captured by an omnidirectional microphone positioned at the second portion different from the first portion. Again, alternatively, the audio scene comprises, as a first component signal, a signal captured by a directional microphone directed to a first direction and, as a second component, at least one signal captured by another directional microphone directed to a second direction different from the first direction. These “directional microphones” do not necessarily have to be real microphones but can also be virtual microphones.
The audio input into block 900 or output by block 900 or generally used as the audio scene can comprise A-format component signals, B-format component signals, first order Ambisonics component signals, higher order Ambisonics component signals or component signals captured by a microphone array with at least two microphone capsules or component signals calculated from a virtual microphone processing.
The output interface 300 of
Thus, when the parameters 330 for the second portion are direction of arrival data and diffuseness data, the first encoded representation for the first portion will not comprise directional of arrival data and diffuseness data but can, of course, comprise any other parameters that have been calculated by the core encoder such as scale factors, LPC coefficients, etc.
Moreover, the band separation performed by signal separator 140, when the different portions are different bands can be implemented in such a way that a start band for the second portion is lower than the bandwidth extension start band and, additionally, the core noise filling does not necessarily have to apply any fixed crossover band, but can be used gradually for more parts of the core spectra as the frequency increases.
Moreover, the parametric or largely parametric processing for the second frequency subband of a time frame comprises calculating an amplitude-related parameter for the second frequency band and the quantization and entropy coding of this amplitude-related parameter instead of individual spectral lines in the second frequency subband. Such an amplitude related parameter forming a low resolution representation of the second portion is, for example, given by a spectral envelope representation having only, for example, one scale factor or energy value for each scale factor band, while the high resolution first portion relies on individual MDCT or FFT or general, individual spectral lines.
Thus, a first portion of the at least two component signals is given by a certain frequency band for each component signal, and the certain frequency band for each component signal is encoded with a number of spectral lines to obtain the encoded representation of the first portion. With respect to the second portion, however, an amplitude-related measure such as the sum of the individual spectral lines for the second portion or a sum of squared spectral lines representing an energy in the second portion or the sum of spectral lines raised to the power of three representing a loudness measure for the spectral portion can be used as well for the parametric encoded representation of the second portion.
Again referring to
On the decoder-side, the encoded spatial metadata is input into the spatial metadata decoder 700 to generate the parameters for the second portion illustrated at 830. The core decoder which is an embodiment typically implemented as an EVS-based core decoder consisting of elements 510a, 510b outputs the decoded representation consisting of both portions where, however, both portions are not yet separated. The decoded representation is input into a frequency analyzing block 860 and the frequency analyzer 860 generates the component signals for the first portion and forwards same to a DirAC analyzer 600 to generate the parameters 840 for the first portion. The transport channel/component signals for the first and the second portions are forwarded from the frequency analyzer 860 to the DirAC synthesizer 800. Thus, the DirAC synthesizer operates, in an embodiment, as usual, since the DirAC synthesizer does not have any knowledge and actually does not require any specific knowledge, whether the parameters for the first portion and the second portion have been derived on the encoder side or on the decoder side. Instead, both parameters “do the same” for the DirAC synthesizer 800 and the DirAC synthesizer can then generate, based on the frequency representation of the decoded representation of the at least two component signals representing the audio scene indicated at 862 and the parameters for both portions, a loudspeaker output, a first order Ambisonics (FOA), a high order Ambisonics (HOA) or a binaural output.
Alternatively, the mode controller can comprise a tonality mask processing as known from intelligent gap filling that analyzes the spectrum of the input signal in order to determine bands that have to be encoded with a high spectral resolution that end up in the encoded first portion and to determine bands that can be encoded in a parametric way that will then end up in the second portion. The mode controller 166 is configured to also control the spatial analyzer 200 on the encoder-side and advantageously to control a band separator 230 of the spatial analyzer or a parameter separator 240 of the spatial analyzer. This makes sure that, in the end, only spatial parameters for the second portion, but not for the first portion are generated and output into the encoded scene signal.
Particularly, when the spatial analyzer 200 directly receives the audio scene signal either before being input into the analysis filter bank or subsequent to being input into the filter bank, the spatial analyzer 200 calculates a full analysis over the first and the second portion and, the parameter separator 240 then only selects for output into the encoded scene signal the parameters for the second portion. Alternatively, when the spatial analyzer 200 receives input data from a band separator, then the band separator 230 already forwards only the second portion and, then, a parameter separator 240 is not required anymore, since the spatial analyzer 200 anyway only receives the second portion and, therefore, only outputs the spatial data for the second portion.
Thus, a selection of the second portion can be performed before or after the spatial analysis and may be controlled by the mode controller 166 or can also be implemented in a fixed manner. The spatial analyzer 200 relies on an analysis filter bank of the encoder or uses his own separate filter bank that is not illustrated in
Depending on whether the spatial analyzer 200 relies on the band separator 168 of the core encoder, a separate band separator 230 is not required. When, however, the spatial analyzer 200 relies on the band separator 230, then the connection between block 168 and block 200 of
Thus, while
The first portion can be directly forwarded to the spatial analyzer 600 or the first portion can be derived from the decoded representation at the output of the synthesis filter bank 169 via a band separator 630. Depending on how the situation is, the parameter separator 640 is used or not. In case of the spatial analyzer 600 receiving the first portion only, then the band separator 630 and the parameter separator 640 are not required. In case of the spatial analyzer 600 receiving the decoded representation and the band separator is not there, then the parameter separator 640 is used. In case of the decoded representation is input into the band separator 630, then the spatial analyzer does not need to have the parameter separator 640, since the spatial analyzer 600 then only outputs the spatial parameters for the first portion.
Alternatively, when the second portion is only available in a single component, then the time/frequency tiles for the first portion are input into the virtual microphone processor 870a, while the time/frequency portion for the single or lower number of components second portion is input into the processor 870b. The processor 870b, for example, only has to perform a copying operation, i.e., to copy the single transport channel into an output signal for each loudspeaker signal. Thus, the virtual microphone processing 870a of the first alternative is replaced by a simply copying operation.
Then, the output of blocks 870a in the first embodiment or 870a for the first portion and 870b for the second portion are input into a gain processor 872 for modifying the output component signal using the one or more spatial parameters. The data is also input into a weighter/decorrelator processor 874 for generating a decorrelated output component signal using the one or more spatial parameters. The output of block 872 and the output of block 874 is combined within a combiner 876 operating for each component so that, at the output of block 876 one obtains a frequency domain representation of each loudspeaker signal.
Then, by means of a synthesis filter bank 878, all frequency domain loudspeaker signals can be converted into a time domain representation and the generated time domain loudspeaker signals can be digital-to-analog converted and used to drive corresponding loudspeakers placed at the defined loudspeaker positions.
Typically, the gain processor 872 operates based on spatial parameters and advantageously, directional parameters such as the direction of arrival data and, optionally, based on diffuseness parameters. Additionally, the weighter/decorrelator processor operates based on spatial parameters as well, and, advantageously, based on the diffuseness parameters.
Thus, in an implementation, the gain processor 872 represents the generation of the non-diffuse stream in
Exemplary benefits and advantages of embodiments over the state of the art are:
V. Pulkki, M-V Laitinen, J Vilkamo, J Ahonen, T Lokki and T Pihlajamäki, “Directional audio coding - perception-based reproduction of spatial sound”, International Workshop on the Principles and Application on Spatial Hearing, November 2009, Zao; Miyagi, Japan.
Ville Pulkki. “Virtual source positioning using vector base amplitude panning”. J. Audio Eng. Soc., 45(6):456{466, June 1997.
European patent application No. EP17202393.9, “EFFICIENT CODING SCHEMES OF DIRAC METADATA”.
European patent application No EP17194816.9 “Apparatus, method and computer program for encoding, decoding, scene processing and other procedures related to DirAC based spatial audio coding”.
An inventively encoded audio signal can be stored on a digital storage medium or a non-transitory storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier or a non-transitory storage medium.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which will be apparent to others skilled in the art and which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
18154749.8 | Feb 2018 | EP | regional |
18185852.3 | Jul 2018 | EP | regional |
This application is a continuation of copending U. S. Pat. Application No. 17/645,110 filed Dec. 20, 2021 which is a continuation of U.S. Pat. No. 11,361,778 issued Jun. 14, 2022 which is a continuation of International Application No. PCT/EP2019/052428, filed Jan. 31, 2019, which is incorporated herein by reference in its entirety, and additionally claims priority from European Application No. 18154749.8, filed Feb. 01, 2018, and from European Application No. 18185852.3, filed Jul. 26, 2018, which are also incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 17645110 | Dec 2021 | US |
Child | 18330953 | US | |
Parent | 16943065 | Jul 2020 | US |
Child | 17645110 | US | |
Parent | PCT/EP2019/052428 | Jan 2019 | WO |
Child | 16943065 | US |