Directivity is an important acoustic property of a sound source e.g. in an immersive reproduction environment. Directivity is frequency dependent and may be measured on discrete frequencies on an octave or third octave frequency grid. For a given frequency, the directivity is a scalar value defined on the unit sphere. The estimation may be done using a number of microphones distributed evenly on a sphere. The measurements are then post-processed, and then accurately interpolated on a fine or very fine spherical grid. The values are saved into one of the available interoperability file formats, such as SOFA files [1]. These files can be quite large, up to several megabytes.
However, for inclusion into a bitstream for transmission, a much more compact representation is needed, where the size is reduced to a dimension from several hundred bytes to at most a few kilobytes, depending on the number of frequency bands and the accuracy desired for reconstruction (e.g., reduced accuracy on mobile devices).
There are several file formats supporting directivity data, like SOFA [1] and OpenDAFF [2], however their main goals are to be very flexible interchange formats, and also to preserve a significant amount of additional metadata, like how the data was generated, and what equipment was used for the measurements. This additional metadata makes it easier to interpret and load the data automatically in research applications, because some file formats allow a large number of heterogeneous data types. Moreover, the spherical grid usually defined is fine or very fine, so that the much simpler approach of using the closest neighbor search can be used instead of 2D interpolation.
A system for obtaining more compact representations are pursued.
According to an embodiment, an apparatus for decoding audio values from a bitstream, the audio values being according to different directions, the directions being associated with discrete positions on a unit sphere, the discrete positions on the unit sphere being displaced according to parallel lines from an equatorial line towards a first pole from the equatorial line towards a second pole, may have: a bitstream reader configured to read prediction residual values from the bitstream; a prediction section configured to obtain the audio values by prediction and from prediction residual values, the prediction section using a plurality of prediction sequences including: at least one initial prediction sequence, along a line of adjacent discrete positions, predicting audio values based on the audio values of the immediately preceding audio values in the same initial predictions sequence; and at least one subsequent prediction sequence, divided among a plurality of subsequences, each subsequence moving along a parallel line and being adjacent to a previously predicted parallel line, and being such that audio values along a parallel line being processed are predicted based on at least: audio values of the adjacent discrete positions in the same subsequence; and interpolated versions of the audio values of the previously predicted adjacent parallel line, each interpolated version of the adjacent previously predicted parallel line having the same number of discrete positions of the parallel line being processed.
According to another embodiment, an apparatus for encoding audio values according to different directions, the directions being associated with discrete positions on a unit sphere, the discrete positions on the unit sphere being displaced according to parallel lines from an equatorial line towards two poles, may have: a predictor block configured to perform a plurality of prediction sequences including: at least one initial prediction sequence, along a line of adjacent discrete positions, by predicting audio values based on the audio values of the immediately preceding audio values in the same initial predictions sequence; and at least one subsequent prediction sequence, divided among a plurality of subsequences, each subsequence moving along a parallel line and being adjacent to a previously predicted parallel line, and being such that audio values are predicted based on at least: audio values of the adjacent discrete positions in the same subsequence; and interpolated versions of the audio values of the previously predicted adjacent parallel line, each interpolated version having the same number of discrete positions of the parallel line, a prediction residual generator configured to compare the predicted values with actual audio values to generate prediction residual values; a bitstream writer configured to write the prediction residual values, or a processed version thereof, in a bitstream.
According to another embodiment, an apparatus for decoding audio metadata from a bitstream, the audio metadata being according to different directions, the directions being associated with discrete positions on a unit sphere, the discrete positions on the unit sphere being displaced according to parallel lines from an equatorial line towards a first pole from the equatorial line towards a second pole, may have: a bitstream reader configured to read prediction residual values of the encoded audio metadata from the bitstream; a prediction section configured to obtain the audio metadata by prediction and from prediction residual values of the audio metadata, the prediction section using a plurality of prediction sequences including: at least one initial prediction sequence, along a line of adjacent discrete positions, predicting the audio metadata based on the immediately preceding audio metadata in the same initial predictions sequence; and at least one subsequent prediction sequence, divided among a plurality of subsequences, each subsequence moving along a parallel line and being adjacent to a previously predicted parallel line, and being such that audio metadata along a parallel line being processed are predicted based on at least: audio metadata of the adjacent discrete positions in the same subsequence; and interpolated versions of the audio metadata of the previously predicted adjacent parallel line, each interpolated version of the adjacent previously predicted parallel line having the same number of discrete positions of the parallel line being processed.
According to another embodiment, an audio decoding method for decoding audio values according to different directions, the directions being associated with discrete positions on a unit sphere, the discrete positions on the unit sphere being displaced according to parallel lines from an equatorial line towards a first pole from the equatorial line towards a second pole, may have the steps of: reading prediction residual values from a bitstream; decoding the prediction residual values and predicted values from a plurality of prediction sequences including: at least one initial prediction sequence, along a line of adjacent discrete positions, predicting audio values based on the audio values of the immediately preceding audio values in the same initial predictions sequence; and at least one subsequent prediction sequence, divided among a plurality of subsequences, each subsequence moving along a parallel line and being adjacent to a previously predicted parallel line, and being such that audio values along a parallel line being processed are predicted based on at least: the audio values of the adjacent discrete positions in the same subsequence; and interpolated versions of the audio values of the adjacent previously predicted parallel line, each interpolated version of the adjacent previously predicted parallel line having the same number of discrete positions of the parallel line being processed.
Another embodiment may have a non-transitory digital storage medium having a computer program stored thereon to perform the method for decoding audio values according to different directions, when said computer program is run by a computer.
There is proposed an apparatus for decoding an audio signal encoded in a bitstream, the audio signal having different audio values according to different directions, the directions being associated with discrete positions in a unit sphere, the discrete positions in the unit sphere being displaced according to parallel lines from an equatorial line towards a first pole from the equatorial line towards a second pole, the apparatus comprising:
There is also proposed an apparatus for encoding an audio signal, the audio signal having different audio values according to different directions, the directions being associated with discrete positions in a unit sphere, the discrete positions in the unit sphere being displaced according to parallel lines from an equatorial line towards two poles, the apparatus comprising:
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
Encoder and Encoder Method
The audio signal 102 may be a preprocessed version of an audio signal 101 (e.g. as outputted by a preprocessor 105). The preprocessor 105 may, for example, perform at least one of:
The preprocessor 105 may decompose, in different frequency bands, the audio signal 101, so that the preprocessed audio signal 102 includes a plurality of bandwidths (e.g., from a lowest frequency band to a highest frequency band. The operations at the predictor block 110, the prediction residual generator 120 (or more in general at the prediction section 110′), and/or the bitstream writer 130 may be repeated for each band.
It will be shown that it is also possible to perform a prediction selection to decide which type (e.g. order) of prediction is to be performed (see below).
A first encoding operation 502 (first stage) may be a sampling operation, according to which a directional signal is obtained. However, the sampling operation 502 is not to be necessarily performed in the method 500 or by the encoder 100, 100a, 100b, and can be performed, for example, by an external device (and the audio signal 101 may therefore be stored in a storage, or transmitted to the encoder 100, 100a, 100b).
A step 504 comprises a conversion in decibel or another logarithmic scale of the values obtained and/or decomposing the audio signal 101 onto different frequency bands. The subsequent steps 508-514 may be therefore performed for each band, e.g. in logarithmic (e.g. decibel) domain.
At step 508, a third stage of differentiating may be performed (e.g., to obtain a differential value for each frequency band). This step may be performed by the differentiation generator 105a, and may be skipped in some examples (e.g. in
At least one of the steps 504 and 508 (second and third stages) may be performed by the preprocessor 105, and may provide, for example, a processed version 102 of the audio signal 101 (the prediction may be performed on the processed version). However, it is not strictly necessary that the steps 504 and 508 are performed by the encoder 100, 100a, 100b, 100d, 100e, 100f: in some examples, the steps 504 and/or 508 may be performed by an external device, and the processed version 102 of the audio signal 101 may be used for the prediction.
At steps 509 and 510, a fourth stage of predicting audio values (e.g., for each frequency band) is performed (e.g. by the predictor block 110). An optional state 509 of selecting the prediction is performed may be performed by simulating different predictions (e.g. different orders of predictions) to be performed, and deciding to use the prediction which, according to the simulation, provides the best prediction effect. For example, the best prediction effect may be the one which minimizes the prediction residuals and/or the one which minimizes the length of the bitstream 104. At step 510, the prediction is performed (if step 509 has been performed, the prediction is the prediction chosen at step 509, other ways the prediction is predetermined).
At step 512, a prediction residual calculating step may be performed. This can be performed by the prediction residual generator 120 (or more in general by the prediction section 110′). For example, the prediction residual 112 between the audio signal 101 (or its processed version 102) may be calculated, to be encoded in the bitstream.
At step 514, a fifth stage of bitstream writing may be performed, for example, by the bitstream writer 130. The bitstream writing 514 may be subjected, for example, to a compression, e.g. by substituting the prediction residuals 112 with codes, to minimize the bitlength in the bitstream 104.
Different frequency bands may have the same spatial resolution.
Decoder and Decoding Method
The output of the prediction residual adder 220 may be values 212 to be predicted. The values of the audio signal to be predicted are submitted to a predictor block 210. Predictive values 212 may be obtained.
In general terms, the predictor 210 and the adder 220 (and integrator block 205a, if provided) are part of a prediction section 210′.
The values 202 may then be subjected to a post-processor 205 e.g., by converting from logarithmic (decibel) domain onto the linear domain; by composing the different frequency bands.
Different frequency bands may have the same spatial resolution.
Coordinates in the Unit Sphere
In examples, the coordinates may be expressed, instead of angles, in terms of indexes, such as:
Preprocessing and Differentiating at the Encoder
Some preprocessing (e.g. 504) and differentiating (e.g. 508) may be performed onto the audio signal 101, to obtain a processed versions 102, e.g. through the preprocessor 105, and/or to obtain a differentiation residual version 105a′, e.g. through the differentiation residual generator 105a.
For example, the audio signal 101 may be decomposed (at 504) among the different frequency bands. Each prediction process (e.g. at 510) may be performed, subsequently, for a specific frequency band. Therefore the encoded bitstream 104 may have, encoded therein, different prediction residuals for different frequency bands. Therefore, in some examples, the discussion below regarding the predictions (prediction sequences, prediction subsequences sphere unit, and so on) is valid for each frequency band, and may be repeated for the other frequency bands. Further, the audio values may be converted (e.g. at 504) onto a logarithmic scale, such as in the decibel domain. It is possible to select between a coarse quantization step (e.g., 1.25 dB to 6 dB) for the elevation and/or the azimuth.
The audio values along the different positions of the unit sphere 1 may be subjected to differentiation. For example, a differential audio value 105a′ at a particular discrete position of the unit sphere 1 may be obtained by subtracting the audio value at the particular discrete position for an audio value of an audio adjacent discrete position (which may be an already differentiated discrete position). A predetermined path may be performed for differentiating the different audio values. For example, it may be that a particular first point is not provided differentially (e.g., the south pole) while all the remaining differentiations may be performed along a predefined path. In examples, sequences may be defined which may be the same sequences for the prediction. In some examples, it is possible to separate the frequency of the audio signal according to different frequency bands, and to perform a prediction for each frequency band.
It is to be noted that the predictor block 110 is in general inputted by the preprocessed audio signal 102, and not by the differentiation residual 105a′. Subsequently, the prediction residual generator 120 will generate the prediction residual values 122.
The techniques above may be combined with each other. For a first frequency band (e.g., the lowest frequency band) may be obtained by differentiating from adjacent discrete positions of the same frequency, while for the remaining frequencies (e.g., higher frequencies) it is possible to perform the differentiation from the immediately preceding adjacent frequency band.
Prediction at the Encoder and at the Decoder
A description of the prediction as at the predictor block 110 of the encoder and of the predictor block 210 of the decoder, or of the prediction as carried out at step 510 is now discussed.
It is noted that, when the prediction is performed at the encoder, the input is the preprocessed audio signal 102.
A prediction of the audio values along the entire unit sphere 1 may be performed according to a plurality of prediction sequences. In examples, there may be performed at least one initial prediction sequence and at least one subsequent prediction sequence. The at least one initial prediction sequence (which can be embodied by two initial prediction sequences 10, 20) may extend along a line (e.g. a meridian) of adjacent discrete positions, by predicting audio values based on the audio values of the immediately preceding audio values in the same initial prediction sequence. For example, there may be at least a first sequence 10 (which may be a meridian initial prediction sequence) which extends from the south pole 2 towards the north pole 4, along the at least one meridian. Prediction values may therefore be propagated along the reference meridian line (azimuth=0°). It will be shown that, at the south pole 2 (starting position of the first sequence) a non-predicted value may be inserted, but the subsequent prediction values are propagated through the meridian towards the north pole 4.
A second initial prediction sequence 20 may be defined along the equatorial line. Here, the line of adjacent discrete positions is formed by the equatorial line (equatorial circumference) and the audio values are predicted according to a predefined circumferential direction, e.g., from the minimum positive azimuth (closest to 0°) towards the maximum azimuth (closest to 360°). Notably, the second sequence 20 starts with a value at the intersection of the predicted meridian line (predicted at the first sequence 10) and the equatorial line. That position is the starting position 20a of the second sequence 20 (and may be the value with azimuth 0° and elevation 0°). After the second prediction sequence 20, therefore, at least one discrete position for the at least one meridian line (e.g. reference meridian) and at least one discrete position for each parallel line is performed.
At least one subsequent prediction sequence 30 may include, for example, a third sequence 30 for predicting discrete positions in the northern hemisphere, between the equatorial line and the north pole 4. A fourth sequence 40 may predict positions in the southern hemisphere, between the equatorial line and the south pole 2 (the already predicted positions in the meridian line as predicted in the second sequence 20 are not generally not predicted in the subsequent prediction sequences 30, 40).
Each of the subsequent prediction sequences (third prediction sequence 30, fourth prediction sequence 40) may be in turn subdivided into a plurality of subsequences. Each subsequence may move along one parallel line adjacent to a previously predicted parallel line. For example,
Since the equatorial line (circumference) is longer than the parallel line on which the first subsequence 31 is processed, there is not an exact correspondence between the discrete positions in the parallel line in which the first subsequence 31 is carried out and the discrete positions in the equatorial line (i.e. the discrete positions of the equatorial line and of the parallel line are misaligned with each other). However, it has been understood that it is possible to interpolate the audio values of the equatorial line to reach an interpolated version of the equatorial line, with the same number of discrete positions of the parallel line.
The same is repeated, parallel line by parallel line, for the remaining subsequences of the same hemisphere. In some examples:
While the second sequence 30 moves from the equatorial line towards the north pole 4 propagating audio values in the northern hemisphere, the fourth sequence 40 moves from the equatorial line towards the south pole 2 propagating audio values in the southern hemisphere. Apart from that, the third and the fourth sequences 30 and 40 are analogous with each other.
Different orders of prediction may be defined.
An example is provided in
Let us now examine the first and second sequences 10 and 20 according to the second order, illustrated section b) of
Let us now examine the third and fourth sequences 30 and 40 in
For example, at least one of the following pre-defined orders may be defined (the symbols and reference numerals are completely generic, only for the sake of understanding):
Even if reference has been made to subsequence 32, this is general for the second sequence 30 and the fourth sequence 40.
The type of ordering may be signalled in the bitstream 104. The decoder will adopt the same prediction signalled in the bitstream.
The prediction orders discussed below may be selectively chosen (e.g., by block 109a and or at step 509) for each prediction sequence (e.g. one selection for the initial prediction sequences 10 and 20, and one selection for the subsequent prediction sequences 30 and 40).
For example, it may be signalled that the first and second initial sequences 10 and 20 are to be performed with order 1 or with order 2, and there may be signalled the the third and fourth sequences 30 and 40 are to be performed with order selected between 1, 2, 3, and 4. The decoder will read the signalling and will perform the prediction according to the selected order(s). It is noted that the orders 1 and 2 (
Basically, the encoder may select (e.g., at block 109a and or at step 509), e.g. based on simulations, to perform the at least one subsequent prediction sequence (30, 40) by moving along the parallel line and being adjacent to a previously predicted parallel line, such that audio values along a parallel line being processed are predicted based on only audio values of the adjacent discrete positions in the same subsequence (31, 32, 33). The decoder will follow the encoder's selection based on the signalling the bitstream 104, and will perform the prediction as requested, e.g. according to the order selected.
It is noted that, after the prediction carried out by the predictor block 210, the predicted values 212 may be added (at adder 220) with the prediction residual values 222, so as to obtain signal 202.
With reference to the decoder 200 or 200a, a prediction section 210′ may be considered to include the predictor 210 and an adder 200, so as to add the residual value (or the integrated signal 105a′ generated by the integrator 205a) to the predicted value 212. The obtained value may then be postprocessed.
With reference to the above, it is noted that the first sequence 10 may start (e.g. at the south pole) with a value obtained from the bitstream (e.g. the value of at the south pole). In the encoder and/or in the decoder, this value may be non-residual.
Residual Generator and Bitstream Writer at the Encoder
With reference to
With reference to
A bitstream writer may write the prediction residual values 122 onto the bitstream 104. The bitstream writer may, in some cases, encode the bitstream 104 by using a single-stage encoding. In examples, more frequent predicted audio values (e.g. 112), or processed versions thereof (e.g. 122), are associated with codes with lower length than the less frequent predicted audio values, or processed versions thereof.
In some cases, it is possible to perform a two-stage encoding.
Bitstream Reader at the Decoder
The reading to be performed by the bitstream reader 230 substantially follows the rules described for encoding the bitstream 104, which are therefore not repeated in detail.
The bitstream reader 230 may, in some cases, read the bitstream 104 using a single-stage decoding. In examples, more frequent predicted audio values (e.g. 112), or processed versions thereof (e.g. 122), are associated with codes with lower length than the less frequent predicted audio values, or processed versions thereof.
In some cases, it is possible to perform a two-stage decoding.
Postprocessing and Rendering at the Decoder
Some postprocessing may be performed onto the audio signal 201 or 202 to obtain a processed versions 201 of the audio signal to be rendered. A postprocessor 205 may be used. For example, the audio signal 201 may be recomposed recomposing the frequency bands.
Further, the audio values may be reconverted from the logarithmic scale, such as in the decibel domain, to a linear domain.
The audio values along the different positions of the unit sphere 1 (which may be defined as a differential values) may be recomposed, e.g. by adding the value of the immediately preceding adjacent discrete position (apart from a first value, e.g. at the south pole, which may be not differential). An predefined ordering is defined, which is the same taken by the preprocessor 205 of the encoder 200 (the ordering may be the same as the one taken for predicting, e.g., at first, the first sequence 10, then the second sequence 20, then the third sequence 30, and finally the fourth sequence 40).
Example of Decoding
It is here in concrete how to carry out the present examples, in particular from the point of view of the decoder 200.
Directivity is used to auralize the Directivity property of Audio Elements. To do this, the Directivity tool is comprised of two components: the coding of the Directivity data, and the rendering of the Directivity data. The Directivity is represented as a number of Covers, where each Cover is arithmetically coded. The rendering of the Directivity is done by checking to see which render items (RIs) use Directivity, taking the filter gain coefficients from the Directivity, and applying an equalizer (EQ) to the metadata of the RI.
Here below, when it is referred to “points”, it is referred to the “discrete positions” defined above.
Data Elements and Variables:
Decoding Process
Once the directivity payload is received by the renderer, before the Directivity Stage initialization, the decoding process begins. Each Cover has an associated frequency; direcFreqQuantType indicates how the frequency is decoded, i.e. determining the width of the frequency band, which is done in readQuantFreq( ). The variable dbStep determines the quantized step sizes for the gain coefficients; its value lies within a range between 0.5 and 3.0 with increments of 0.5. intPer90 is the number of azimuth points around a quadrant of the equator and is the key variable used for the Sphere Grid generation (This integer is the number of elevation points on the Cover). direcUseRawBasline determines which of two decoding modes is chosen for the gain coefficients. The available decoding modes either the “Baseline Mode” or the “Optimized Mode”. The baseline mode simply codes each decibel index arithmetically using a uniform probability distribution. Whereas, the optimized mode uses residual compression in conjunction with an adaptive probability estimator alongside five different prediction orders. Finally, after the completion of decoding, the directivities are passed to the Scene State where other Scene Objects can refer to them.
Sphere Grid Generation
The Sphere Grid determines the spatial resolution of a Cover, which could be different across Covers. The Sphere Grid of the Cover has a number of different points. Across the equator, there are at least 4 points, possibly more depending on the intPer90 value. At the north and south poles, there is exactly one point. At different elevations, the number of points is equal or less than the number of points across the equator, and is decreasing as the elevation approaches the poles. Upon each elevation layer, the first azimuth point is 0°, creating a line of evenly spaced points from the south pole, to the equator, and, finally, to the north pole. This property is not guaranteed for the rest of the azimuth points across different elevations. The following is a description in pseudocode format:
Baseline Mode
The baseline mode uses a range decoder with a uniform probability distribution to decode quantized decibel values. The maximum and minimum possible values (i.e., maxPosVal, minPosVal) that can be stored are −128.0 and 127, respectively. The alphabet size can be found using dbStep and the actual maximum and minimum possible value (maxVal, minVal). After decoding the decibel, a simple rescaling is done to find the actual dB value. This can be seen in Table.
Optimized Mode
The optimized mode decoding uses a sequential prediction scheme, which traverses the Cover in a special order. This scheme is determined by predictionOrder, where its value can be an integer between 1 and 5 inclusive. predictionOrder dictates which linear prediction order (1 or 2) to use. When predictionOrder==1||predictionOrder==3, the linear prediction order is 1 and when predictionOrder==2||predictionOrder==4, the linear prediction order is 2. The traversal is composed of four different sequences:
The first sequence goes vertically, from the value at the South Pole to the North Pole, all with azimuth 0. The first value of the sequence (coverResiduals[0][0]), at the South Pole is not predicted. This value serves as the basis in which the rest of the values are predicted from.
This prediction uses either a linear prediction of order 1 or 2. Using a prediction order of 1 uses the previous elevation value, where a prediction order of 2 uses the two previous elevation values as a basis for prediction.
The second sequence goes horizontally, at the equator, from the value next to the one at azimuth 0 degrees (which was already predicted during the first sequence), until the value previous to it at azimuth close to 360 degrees. The values are predicted from previous values also using linear prediction of order 1 or 2. Similarly to sequence one, using a prediction order of 1 uses the previous azimuth value, where using a prediction of 2 uses the previous two azimuth values as a basis prediction.
The third sequence goes horizontally, in order for each elevation, starting from the one next to the equator towards the North Pole until the one previous to the North Pole. Each horizontal subsequence starts from the value next to the one at azimuth 0 degrees (which was already predicted during the first sequence), until the value previous to it at azimuth close to 360 degrees. When (predictionOrder==1||predictionOrder==2||predictionOrder==3||predictionOrder==4) the values are predicted from previous values using either linear prediction of order 1 or 2, as explained above. Furthermore, when (predictionOrder==3||predictionOrder==4), in addition to the previous values on the current Cover, the values are also used from the previously predicted elevation. Since the number of points upon the Sphere Grid ne
The fourth sequence also goes horizontally, in order for each elevation, exactly like the third sequence, however starting from the one next to the equator towards the South Pole until the one previous to the South Pole.
The following pseudocode describes the aforementioned algorithm:
Stage Description
The stage iterates over all RIs in the update thread, checks whether Directivity can be applied, and, if so, the stage takes the relative position between the Listener and the RI, and queries the Directivity for filter coefficients. Finally, the stage applies these filter gain coefficients to the central EQ metadata field of the RI, to be finally auralized in EQ stage.
Update Thread Processing
Directivity is applied to all RIs with a value of true in the data elements of objectSourceHasDirectivity and loudspeakerHasDirectivity (and by secondary RIs derived from such RIs in the Early Reflections and Diffraction stages) by using the central EQ metadata field that accumulates all EQ effects before they are applied to the audio signals by the EQ stage. The listener's relative position in polar coordinates to the RI is needed to query the Directivity. This can be done, e.g. using Cartesian to Polar coordinate conversion, homogenous matrix transforms, or quaternions. In the case of secondary RIs, their relative position for their parents must be used to correctly auralize the Directivity. For consistent frequency resolution, the directivity data is linearly interpolated to match the EQ bands of the metadata field, which can differ from the bitstream representation, depending on the bitstream compression configuration. For each frequency band, directiveness (available from objectSourceDirectiveness or loudspeakerDirectiveness) is applied according to the formula Ceq=exp(d log m), where d is the directiveness value and m is the interpolated magnitude derived from the Covers adjacent to the requested frequency band, and Ceq is the coefficient used for the EQ.
Audio Thread Processing
The directivity stage has no additional processing in the audio thread. The application of the filter coefficients is done in the EQ stage.
A Bitstream Syntax
In environments that need byte alignment, MPEG-I Immersive audio configuration elements or payload elements that are not an integer number of bytes in length are padded at the end to achieve an integer byte count. This is indicated by the function ByteAlign( )
Renderer Payloads Syntax (to be Inserted in the Bitstream 104)
The new approach is composed of five main stages. The first stage generates a quasi-uniform covering of the unit sphere, using an encoder selectable density. The second stage converts the values to the dB scale and quantizes them, using an encoder selectable precision. The third stage is used to remove possible redundancy between consecutive frequencies, by converting the values to differences relative to the previous frequency, useful especially at lower frequencies and when using relatively coarse sphere covering. The fourth stage is a sequential prediction scheme, which traverses the sphere covering in a special order. The fifth stage is entropy coding of the prediction residuals, using an adaptive estimator of its distribution and optimally coding it using a range encoder.
A first stage of the new approach may be to sample quasi-uniformly the unit sphere 1 using a number of points (discrete positions), using further interpolation over the fine or very fine spherical grid available in the directivity file. The quasi-uniform sphere covering, using an encoder selectable density, has a number of desirable properties: there is elevation 0 present (the equator), at every elevation level present there is a sphere point at azimuth 0, and both determining the closest sphere point and performing bilinear interpolation can be done in constant time for a given arbitrary elevation and azimuth. The parameter controlling the density of the sphere covering is the angle between two consecutive points on the equator, the degree step. Because of the constraints implied by the desirable properties, the degree step must be a divisor of 90 degrees. The coarsest sphere covering, with a degree step of 90 degrees, corresponds to a total of 6 sphere points, 2 points at the poles and 4 points on the equator. On the other end, a degree step of 2 degrees corresponds to a total of 10318 sphere points, and 180 points on the equator. This sphere covering is very similar to the one used for the quantization of azimuth and elevation for DirAC direction metadata in IVAS, except that it is less constrained. In comparison, there is no requirement that the number of points at every elevation level other than at the equator is a multiple of 4, which was chosen in DirAC in order to ensure that there are sphere points at azimuths of 90, 180, and 270 degrees. In
A second stage may convert the linear domain values, which are positive, but are not limited to a maximum value of 1, into dB domain. Depending on the normalization convention chosen for the directivity (i.e., an average value of 1 on the sphere, a value 1 on the equator at azimuth 0, etc.), values can be larger than 1. The quantization is done linearly in the dB domain using an encoder selectable precision, typically using a quantization step size from very fine at 0.25 dB to very coarse at 6 dB. In
A third stage (differentiation) may be used to remove possible redundancy between consecutive frequencies. This is done by converting the values on the sphere covering for the current frequency to differences relative to values on the sphere covering of the previous frequency. This approach is especially advantageous at lower frequencies, where the variations across frequency for a given elevation and azimuth tend to be smaller than at high frequencies. Additionally, when using quite coarse sphere coverings, e.g., with a degree step of 22.5 degrees or more, there is less correlation available between neighboring consecutive sphere points, when compared to the correlation across consecutive frequencies. In
A fourth stage is a sequential prediction scheme, which traverses the sphere covering for one frequency in a special order. This order was chosen to increase the predictability of the values, based on the neighborhood of previously predicted values. It is composed of 4 different sequences 10, 20, 30, 40. The first sequence 10 goes vertically, e.g. from the value at the South Pole to the North Pole, all with azimuth 0°. The first value of the sequence, at the South Pole 2 is not predicted, and the rest are predicted from the previous values using linear prediction of order 1 or 2. The second sequence 20 goes horizontally, at the equator, from the value next to the one at azimuth 0 degrees (which was already predicted during the first sequence), until the value previous to it at azimuth close to 360 degrees. The values are predicted from previous values also using linear prediction of order 1 or 2. One option is to use fixed linear prediction coefficients, with the encoder selecting the best prediction order, the one producing the smallest entropy of the prediction error (prediction residual).
The third sequence 30 goes horizontally, in order for each elevation, starting from the one next to the equator towards the North Pole until the one previous to the North Pole. Each horizontal subsequence starts from the value next to the one at azimuth 0 degrees (which was already predicted during the first sequence), until the value previous to it at azimuth close to 360 degrees. The values are predicted from previous values using either linear prediction of order 1 or 2, or a special prediction mode using also the values available at the previously predicted elevation. Because the number of points ne
The fourth sequence 40 also goes horizontally, in order for each elevation, exactly like the third sequence 30, however starting from the one next to the equator towards the South Pole 2 until the one previous to the South Pole 2. For the third and fourth sequences 30 and 40, the encoder 100 may select the best prediction mode among order 1 prediction, order 2 prediction, and special prediction, the one producing the smallest entropy of the prediction error (prediction residual).
In
It is to be mentioned here that all alternatives or aspects as discussed before and all aspects as defined by independent claims in the following claims can be used individually, i.e., without any other alternative or object than the contemplated alternative, object or independent claim. However, in other embodiments, two or more of the alternatives or the aspects or the independent claims can be combined with each other and, in other embodiments, all aspects, or alternatives and all independent claims can be combined to each other.
An inventively encoded signal can be stored on a digital storage medium or a non-transitory storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier or a non-transitory storage medium.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
While this invention has been described in terms of several advantageous embodiments, there are alterations, permutations, and equivalents, which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
21176342.0 | May 2021 | EP | regional |
This application is a continuation of copending International Application No. PCT/EP2022/064343, filed May 25, 2022, which is incorporated herein by reference in its entirety, and additionally claims priority from European Application No. 21176342.0, filed May 27, 2021, which is also incorporated herein by reference in its entirety. There are here disclosed apparatuses and methods for encoding and decoding audio signals having directivity.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2022/064343 | May 2022 | US |
Child | 18519335 | US |