This disclosure relates to processing of media data, such as audio data.
The evolution of surround sound has made available many output formats for entertainment. Examples of such consumer surround sound formats are mostly ‘channel’ based in that they implicitly specify feeds to loudspeakers in certain geometrical coordinates. The consumer surround sound formats include the popular 5.1 format (which includes the following six channels: front left (FL), front right (FR), center or front center, back left or surround left, back right or surround right, and low frequency effects (LFE)), the growing 7.1 format, various formats that includes height speakers such as the 7.1.4 format and the 22.2 format (e.g., for use with the Ultra High Definition Television standard). Non-consumer formats can span any number of speakers (in symmetric and non-symmetric geometries) often termed ‘surround arrays’. One example of such an array includes 32 loudspeakers positioned on coordinates on the corners of a truncated icosahedron.
This disclosure describes techniques for new object metadata to represent more precise beam patterns using object-based audio.
According to one example, a device configured for processing coded audio includes a memory configured to store an audio object and audio object metadata associated with the audio object, wherein the audio object meta data comprises frequency dependent beam pattern metadata, and one or more processors electronically coupled to the memory, the one or more processors configured to apply, based on the frequency dependent beam pattern metadata, a renderer to the audio object to obtain one or more first speaker feeds and output the one or more speaker feeds.
According to another example, a method for processing coded audio includes storing an audio object and audio object metadata associated with the audio object, wherein the audio object meta data comprises frequency dependent beam pattern metadata; applying, based on the frequency dependent beam pattern metadata, a renderer to the audio object to obtain one or more first speaker feeds; and outputting the one or more speaker feeds.
According to another example, a computer-readable storage medium stores instructions that when executed by one or more processors cause the one or more processors to store an audio object and audio object metadata associated with the audio object, wherein the audio object meta data comprises frequency dependent beam pattern metadata; apply, based on the frequency dependent beam pattern metadata, a renderer to the audio object to obtain one or more first speaker feeds; and output the one or more speaker feeds.
According to another example, an apparatus for processing coded audio includes means for storing an audio object and audio object metadata associated with the audio object, wherein the audio object meta data comprises frequency dependent beam pattern metadata; means for applying, based on the frequency dependent beam pattern metadata, a renderer to the audio object to obtain one or more first speaker feeds; and output the one or more speaker feeds.
The details of one or more examples of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.
Audio encoders may receive input in one of three possible formats: (i) traditional channel-based audio (as discussed above), which is meant to be played through loudspeakers at pre-specified positions; (ii) object-based audio, which involves discrete pulse-code-modulation (PCM) data for single audio objects with associated metadata containing their location coordinates (amongst other information); and (iii) scene-based audio, which involves representing the soundfield using coefficients of spherical harmonic basis functions (also called “spherical harmonic coefficients” or SHC, “Higher-order Ambisonics” or HOA, and “HOA coefficients”).
This disclosure describes techniques for new object metadata to represent more precise beam patterns using object-based audio. More specifically, a common set of metadata for object-based audio data includes azimuth, elevation, distance, gain, and diffuseness, and this disclosure introduces weighting values that may enable the rendering of more precise beam patterns. Each beam pattern (whether frequency dependent or not) may use a set of weighting values and a set of metadata. For example, if N=3, 3 weighting values and 3 sets of {azimuth, elevation, distance, gain, and diffuseness} metadata can be used to generate a beam pattern B. This B can be used to locate an audio object.
3D Audio has three audio elements, typically referred to as channel-, object-, and scene-based audio. The object-based audio is described with audio and associated metadata. A common set of metadata includes azimuth, elevation, distance, gain, and diffuseness. This disclosure introduces new object metadata to describe more precise beam patterns. More specifically, according to one example, the proposed object audio metadata includes weighting values, in addition to set(s) of azimuth, elevation, distance, gain, and diffuseness, with the weighting values enabling a content consumer device to model complex beam patterns (as shown in the examples of
According to one example of the techniques of this disclosure, for a given object audio signal, a content consumer device can model a beam pattern with a weighted summation of multiple single-directional beams, according to equation (1A):
{circumflex over (B)}=Σ
i=
NωiB(θi,φi) (1A)
Equation 1A can be used for each frequency band. If there are two bands, for example, then 2×N weighting values and 2×N set of {azimuth, elevation, distance, gain, and diffuseness} metadata may be. An audio object may be bandpass filtered into A_1st_band and A_2nd_band. A_1st_band is rendered with the first set of weighting values and the first set of metadata. A_2_nd_band is rendered with the second set of weighting values and the second set of metadata. The final output is the sum of the two renderings.
Thus, equation 1A can be extended to multiple audio objects to describe a single audio scene, using equation (1B).
B
k
m=Σi=1Nωkm,iB(Λkm,i) (1B)
where for i:1 to N, N corresponds to the number of weightings and metadata sets, for m: 1 to M, M corresponds to the number of frequency bands, and for K: 1 to K, K corresponds to the number of audio objects
The content consumer device can perform rendering, using for example VBAP (described in more detail below). The content consumer device can receive an input audio S, N-number of weightings, and N-number of sets of metadata, with each setting including some or all of azimuth, elevation, distance, gain, and diffuseness. For i=1:N, the content consumer devices can obtain weighted audio according to equation (2) below:
WS
i
=w
i
S (2)
The content consumer device can render WSi using VBAP using an i-th set of azimuth, elevation, distance, gain, diffuseness. The content consumer device may also render WSi using another object renderer, such as SPH or a beam pattern codebook. The content consumer device can provide speaker feeds LSin(i,j) where j is the speaker index, by calculating the j-th speaker contribution according to equation (3):
LSout(j)=Σi=1NLSin(i,j) (3)
In some implementations, in order to reduce complexity, the weighted audio (WSi) may be obtained by calculating the contributions of each loudspeaker. As the same audio source S may be panned with N metadata, for each speaker, the contributions from N metadata can be summed into a single contribution value, li. For each speaker, the content consumer device can use liS as a speaker feed.
According to other aspects of this disclosure, a content consumer device may be configured to change a beam pattern with frequency, using, for example, a flag in the metadata. The content consumer device may, for example, make the beam pattern become more directive at higher frequencies. The beam pattern can, for instance, be specified at frequencies or ERB/Bark/Gammatone scale frequency division.
In one example, frequency dependent beam pattern metadata may include a Freq_dep_beampattern syntax element, where a value of 0 indicates the beam pattern is the same at all frequencies, and a value of 1 indicates the beam pattern changes with frequency. The metadata may also include a Freq_scale syntax element, where one value of the syntax element indicates normal, another value of the syntax element indicates bark, another value of the syntax element indicates ERB, and another value of the syntax element indicates Gammatone. In one example, frequencies between 0-100 Hz may use one type of beam pattern, determined by a codebook or spherical harmonic coefficients, for example, while 12 Khz to 20 Khz uses a different beam pattern. Other frequency ranges may also use different beam patterns.
The content creator system 4 may be operated by various content creators, such as movie studios, television studios, internet streaming services, or other entity that may generate audio content for consumption by operators of content consumer systems, such as the content consumer system 6. Often, the content creator generates audio content in conjunction with video content. The content consumer system 6 may be operated by an individual. In general, the content consumer system 6 may refer to any form of audio playback system capable of outputting multi-channel audio content.
The content creator system 4 includes audio encoding device 14, which may be capable of encoding received audio data into a bitstream. The audio encoding device 14 may receive the audio data from various sources. For instance, the audio encoding device 14 may obtain live audio data 10 and/or pre-generated audio data 12. The audio encoding device 14 may receive the live audio data 10 and/or the pre-generated audio data 12 in various formats. As one example, audio encoding device 14 includes one or more microphones 8 configured to capture one or more audio signals. For instance, the audio encoding device 14 may receive the live audio data 10 from one or more microphones 8 as audio objects. As another example, the audio encoding device 14 may receive the pre-generated audio data 12 as audio objects.
As stated above, the audio encoding device 14 may encode the received audio data into a bitstream, such as bitstream 20, for transmission, as one example, across a transmission channel, which may be a wired or wireless channel, a data storage device, or the like. In some examples, the content creator system 4 directly transmits the encoded bitstream 20 to content consumer system 6. In other examples, the encoded bitstream may also be stored onto a storage medium or a file server for later access by the content consumer system 6 for decoding and/or playback.
Content consumer system 6 may generate loudspeaker feeds 26 based on bitstream 20. As shown in
The audio encoding device 14 and the audio decoding device 22 each may be implemented as any of a variety of suitable circuitry, such as one or more integrated circuits including microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), discrete logic, software, hardware, firmware, or any combinations thereof. When the techniques are implemented partially in software, a device may store instructions for the software in a suitable, non-transitory computer-readable medium and execute the instructions in hardware such as integrated circuitry using one or more processors to perform the techniques of this disclosure.
In the example of
The metadata encoding unit 48 determines encoded metadata 412 for the audio object based on the obtained audio object metadata information.
The audio encoding unit 56 encodes audio signal 50A to generate encoded audio signal 50B. In some examples, the audio encoding unit 56 may encode audio signal 50A using a known audio compression format, such as MP3, AAC, Vorbis, FLAC, and Opus. In some instances, the audio encoding unit 56 may transcode the audio signal 50A from one compression format to another. In some examples, the audio encoding device 14 may include an audio encoding unit to compress and/or transcode audio signal 50A.
Bitstream mixing unit 52 mixes the encoded audio signal 50B with the encoded metadata to generate bitstream 56. In the example of
Thus, the audio encoding device 14 includes a memory configured to store an audio signal of an audio object (e.g., audio signals 50A and 50B and bitstream 56) for a time interval and store metadata (e.g., audio object metadata information 350). Furthermore, the audio encoding device 14 includes one or more processors electrically coupled to the memory.
VBAP uses a geometrical approach to calculate gain factors 416. In examples, such as
I
k,m,n=(Ik,Im,In) (4)
The desired direction Q=(θ, φ) of the audio object may be given as azimuth angle φ and elevation angle θ. The unity length position vector p(Ω) of the virtual source in Cartesian coordinates is therefore defined by:
p(Ω)=(cos φ sin θ, sin φ sin θ, cos θ)T. (5)
A virtual source position can be represented with the vector base and the gain factors g(Ω)=g (Ω)=({tilde over (g)}k, {tilde over (g)}m, {tilde over (g)}n)T by
p(Ω)=Lkmng(Ω)={tilde over (g)}kIk+{tilde over (g)}mIm+{tilde over (g)}nIn. (6)
By inverting the vector base matrix, the required gain factors can be computed by:
g(Q)=Lkmn−1p(Ω). (7)
The vector base to be used is determined according to Equation (7). First, the gains are calculated according to Equation (7) for all vector bases. Subsequently, for each vector base, the minimum over the gain factors is evaluated by g(Ω)=min{{tilde over (g)}k, {tilde over (g)}m, {tilde over (g)}n}. The vector base where {tilde over (g)}min has the highest value is used. In general, the gain factors are not permitted to be negative. Depending on the listening room acoustics, the gain factors may be normalized for energy preservation.
The memory 200 may obtain encoded audio data, such as the bitstream 56. In some examples, the memory 200 may directly receive the encoded audio data (i.e., the bitstream 56) from an audio encoding device. In other examples, the encoded audio data may be stored, and the memory 200 may obtain the encoded audio data (i.e., the bitstream 56) from a storage medium or a file server. The memory 200 may provide access to the bitstream 56 to one or more components of the audio decoding device 22, such as the demultiplexing unit 202.
The demultiplexing unit 202 may obtain encoded metadata 71 and audio signal 62 from the bitstream 56. The encoded metadata 71 includes, for example, the frequency dependent beam pattern metadata described above. Thus, the demultiplexing unit 202 may obtain, from the bitstream 56, data representing an audio signal of an audio object and may obtain, from the bitstream 56, metadata for rendering M frequency bands using M different beam patterns in response to the number of frequency bands being equal to M.
The audio decoding unit 204 may be configured to decode the coded audio signal 62 into audio signal 70. For instance, the audio decoding unit 204 may dequantize, deformat, or otherwise decompress audio signal 62 to generate the audio signal 70. In some examples, the audio decoding unit 204 may be referred to as an audio CODEC. The audio decoding unit 204 may provide the decoded audio signal 70 to one or more components of the audio decoding device 22, such as format generation unit 208.
The metadata decoding unit 207 may decode the encoded metadata 71 to determine the frequency dependent beam pattern metadata described above.
The format generation unit 208 may be configured to generate a soundfield, in a specified format, based on multi-channel audio data and the frequency dependent beam pattern metadata described above. For instance, the format generation unit 208 may generate renderer input 212 based on the decoded audio signal 70 and the decoded metadata 72. The renderer input 212 may, for example, include a set of audio objects and decoded metadata.
The format generation unit 208 may provide the generated the renderer input 212 to one or more other components. For instance, as shown in the example of
The rendering unit 210 may be configured to render a soundfield. In some examples, the rendering unit 210 may render a renderer input 212 to generate audio signals 26 for playback at a plurality of local loudspeakers, such as the loudspeakers 24 of
The rendering unit 210 may generate the audio signals 26 based on local loudspeaker setup information 28, which may represent positions of the plurality of local loudspeakers. The rendering unit 210 may generate a plurality of audio signals 26 by applying a rendering format (e.g., a local rendering matrix) to the audio objects. Each respective audio signal of the plurality of audio signals 26 may correspond to a respective loudspeaker in a plurality of loudspeakers, such as the loudspeakers 24 of
In some examples, the local loudspeaker setup information 28 may be in the form of a local rendering format {tilde over (D)}. In some examples, local rendering format D may be a local rendering matrix. In some examples, such as where the local loudspeaker setup information 28 is in the form of an azimuth and an elevation of each of the local loudspeakers, the rendering unit 210 may determine local rendering format D based on the local loudspeaker setup information 28. In some examples, the local rendering format D may be different than the source rendering format D used to determine spatial positioning vectors. As one example, positions of the plurality of local loudspeakers may be different than positions of the plurality of source loudspeakers. As another example, a number of loudspeakers in the plurality of local loudspeakers may be different than a number of loudspeakers in the plurality of source loudspeakers. As another example, both the positions of the plurality of local loudspeakers may be different than positions of the plurality of source loudspeakers and the number of loudspeakers in the plurality of local loudspeakers may be different than the number of loudspeakers in the plurality of source loudspeakers.
In some examples, the rendering unit 210 may adapt the local rendering format based on information 28 indicating locations of a local loudspeaker setup. The rendering unit 210 may adapt the local rendering format in the manner described below with regard to
The memory 254 may store a metadata codebook 262. The memory 254 may be separate from the metadata decoding unit 207A and may form part of a general memory of the audio decoding device 22. The metadata codebook 262 includes a set of entries, each of which maps an index to a value for a metadata entry. The metadata codebook 262 may match a codebook used by the metadata encoding unit 48 of
The listener location unit 610 may be configured to determine a location of a listener of a plurality of loudspeakers, such as loudspeakers 24 of
The loudspeaker position unit 612 may be configured to obtain a representation of positions of a plurality of local loudspeakers, such as the loudspeakers 24 of
The rendering format unit 614 may be configured to generate local rendering format 622 based on a representation of positions of a plurality of local loudspeakers (e.g., a local reproduction layout) and a position of a listener of the plurality of local loudspeakers. In some examples, the rendering format unit 614 may generate the local rendering format 622 such that, when the audio objects or HOA coefficients of renderer input 212 are rendered into loudspeaker feeds and played back through the plurality of local loudspeakers, the acoustic “sweet spot” is located at or near the position of the listener. In some examples, to generate the local rendering format 622, the rendering format unit 614 may generate a local rendering matrix D. The rendering format unit 614 may provide the local rendering format 622 to one or more other components of rendering unit 210, such as loudspeaker feed generation unit 616 and/or memory 615.
The memory 615 may be configured to store a local rendering format, such as the local rendering format 622. Where the local rendering format 622 comprises local rendering matrix {tilde over (D)}, the memory 615 may be configure to store local rendering matrix {tilde over (D)}.
The loudspeaker feed generation unit 616 may be configured to render audio objects or HOA coefficients into a plurality of output audio signals that each correspond to a respective local loudspeaker of the plurality of local loudspeakers. In the example of
The audio decoding device 22, using various combinations of the components described in more detail above, represent an example of a device configured to store an audio object and audio object metadata associated with the audio object, where the audio object metadata includes frequency dependent beam pattern metadata. The device applies, based on the frequency dependent beam pattern metadata, a renderer to the audio object to obtain one or more first speaker feeds and obtains, based on the one or more speaker feeds, output speaker feeds. the frequency dependent beam pattern metadata is defined for a number of frequency bands. The frequency dependent beam pattern metadata may, for example, define a number of frequency bands. The number of frequency bands may, for example, be equal to M, with M being an integer value greater than 1. The device may render the M frequency bands using M different beam patterns in response to the number of frequency bands being equal to M.
The audio object metadata may, for example, include M sets of weighting values and at least M sets of metadata representative of M directional beams, with each of the M directional beams corresponding to one of the M frequency bands. The device may apply the M sets of weighting values to audio signals of the audio object to obtain weighted audio objects; sum the weighted audio objects to determine a weighted summation of audio objects; apply the renderer to the weighted summation of audio objects to obtain the one or more speaker feeds; and obtain, based on the one or more speaker feeds, the output speaker feeds
Each of the M sets of metadata may include an azimuth value, an elevation value, a distance value, a gain value, and a diffuseness value. In some implementations, some of the metadata values, such as distance, gain, and diffuseness may be optional and not always included in the metadata.
While these techniques are presented in a particular order, the techniques may not necessarily be performed in that order.
The audio encoding unit 56 encodes audio data from one or more mono audio sources. The audio decoding unit 204 decodes the encoded audio data to generate one or more decoded mono audio sources (S1, S2, . . . SK). Metadata encoding unit 48 outputs metadata for frequency-dependent beam-patterns (e.g., M1, M2, . . . , MK, ω1m,i, ω2m,i, . . . , ωKm,i, Λ1m,i, Λ2m,i, . . . , ΛKm,i).
The audio rendering unit 210 generates speaker outputs C1 through CL according to the following process:
The frequency-independent rendering unit 516 generates frequency-independent beam patterns according to Bk=Σi=1Nωk1,iB(Λk1,i). Using Bk, frequency-independent rendering unit 516 performs object rendering of Sk to obtain the speaker output Ck1, Ck2, . . . , CkL.
The frequency-dependent rendering unit 514 initializes speaker outputs Ck1=Ck2= . . . =CkL=0. For m equals 1 to Mk, the frequency-dependent rendering unit 514 generates frequency-dependent beam patterns according to Bkm=Σi=1Nωkm,iB(Λkm,i). The frequency-dependent rendering unit 514 performs bandpass filtering of Sk using {FreqStart_m, FreqEnd_m} and then obtains Skm. Using Bkm, frequency-dependent rendering unit 514 performs object rendering of Skm to obtain the m-th band speaker feeds Ck1,m, Ck2,m, . . . , CkL,m, where:
Various aspects of the techniques of this disclosure may enable one or more of the devices described above to perform the examples listed below.
A device configured for processing coded audio, the device comprising: a memory configured to store an audio object and audio object metadata associated with the audio object, wherein the audio object meta data comprises frequency dependent beam pattern metadata, one or more processors electronically coupled to the memory, the one or more processors configured to: apply, based on the frequency dependent beam pattern metadata, a renderer to the audio object to obtain one or more first speaker feeds; and output the one or more speaker feeds.
The device of example 1, wherein the frequency dependent beam pattern metadata is defined for a number of frequency bands.
The device of example 2, wherein the number of frequency bands is equal to 1.
The device of example 3, wherein the one or more processors are configured to render all frequencies of the audio object using a same beam pattern in response to the number of frequency bands being equal to 1.
The device of any of examples 1-4, wherein: the audio object metadata further comprises a first set of weighting values and at least a first set of metadata representative of a first directional beam for the audio object; and the one or more processors are further configured to: apply the first set of weighting values to the audio object to obtain a weighted audio object; and apply, based on the first set of metadata representative of the first directional beam, the renderer to the weighted audio object to obtain the one or more first speaker feeds.
The device of example 5, wherein the first set of metadata to describe the first directional beam for the audio object comprises an azimuth value.
The device of example 5 or 6, wherein the first set of metadata to describe the first directional beam for the audio object comprises an elevation value.
The device of any of example 5-7, wherein the first set of metadata to describe the first directional beam for the audio object comprises a distance value.
The device of any of examples 5-8, wherein the first set of metadata to describe the first directional beam for the audio object comprises a gain value.
The device of any of examples 5-9, wherein the first set of metadata to describe the first directional beam for the audio object comprises a diffuseness value.
The device of any of examples 1, 2, or 5-10, wherein the number of frequency bands is greater than 1.
The device of example 11, wherein the one or more processors are configured to render a first frequency band of the audio object using a first beam pattern and render a second frequency band of the audio object using a second beam pattern in response to the number of frequency bands being greater than 1.
The device of any of examples 1, 2, or 5-12, wherein:
the audio object metadata further comprises a first set of weighting values and at least a first set of metadata representative of a first directional beam for the first frequency band of the audio object and a second set of weighting values and at least a second set of metadata representative of a second directional beam for the second frequency band of the audio object; and the one or more processors are further configured to: apply the first set of weighting values to audio signals of the audio object within the first frequency band to obtain a first weighted audio object; apply the second set of weighting values to audio signals of the audio object within the second frequency band to obtain a second weighted audio object; sum the first weighted audio object and the second weighted audio object to determine a weighted summation of audio objects; and apply the renderer to the weighted summation of audio objects to obtain the one or more speaker feeds.
The device of example 13, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first azimuth value and the second set of metadata to describe the second directional beam for the audio object comprises a second azimuth value.
The device of example 13 or 14, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first elevation value and the second set of metadata to describe the second directional beam for the audio object comprises a second elevation value.
The device of any of examples 13-15, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first distance value and the second set of metadata to describe the second directional beam for the audio object comprises a second distance value.
The device of any of examples 13-16, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first gain value and the second set of metadata to describe the second directional beam for the audio object comprises a second gain value.
The device of any of examples 13-17, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first diffuseness value and the second set of metadata to describe the second directional beam for the audio object comprises a second diffuseness value.
The device of any of examples 1, 2, or 5-18 wherein the number of frequency bands is equal to M, M being an integer value greater than 1.
The device of example 19, wherein the one or more processors are configured to render the M frequency bands using M different beam patterns in response to the number of frequency bands being equal to M.
The device of example 19 or 20, wherein: the audio object metadata further comprises M sets of weighting values and at least M sets of metadata representative of M directional beams, each of the M directional beams corresponding to one of the M frequency bands; and the one or more processors are further configured to: apply the M sets of weighting values to audio signals of the audio object to obtain weighted audio objects; sum the weighted audio objects to determine a weighted summation of audio objects; and apply the renderer to the weighted summation of audio objects to obtain the one or more speaker feeds.
The device of example 21, wherein each of the M sets of metadata comprises an azimuth value.
The device of example 21 or 22, wherein each of the M sets of metadata comprises an elevation value.
The device of any of examples 21-23, wherein each of the M sets of metadata comprises a distance value.
The device of any of examples 21-24, wherein each of the M sets of metadata comprises a gain value.
The device of any of examples 21-25, wherein each of the M sets of metadata comprises a diffuseness value.
The device of any of examples 1-26, wherein to apply the renderer, the one or more processors are configured to perform vector-based amplitude panning with respect to the weighted audio object.
The device of any of examples 1-27, further comprising: one or more speakers configured to reproduce, based on the output speaker feeds, a soundfield.
The device of any of examples 1-28, wherein the device comprises a vehicle.
The device of any of examples 1-29, wherein the device comprises an unmanned vehicle.
The device of any of examples 1-30, wherein the device comprises a robot.
The device of any of examples 1-28, wherein the device comprises a handset.
The device of any of examples 1-32, wherein the one or more processors comprise processing circuitry.
The device of example 33, wherein the processing circuitry comprises one or more application specific integrated circuits.
A method for processing coded audio, the method comprising: storing an audio object and audio object metadata associated with the audio object, wherein the audio object meta data comprises frequency dependent beam pattern metadata; applying, based on the frequency dependent beam pattern metadata, a renderer to the audio object to obtain one or more first speaker feeds; and outputting the one or more speaker feeds.
The method of example 35, wherein the frequency dependent beam pattern metadata is defined for a number of frequency bands.
The method of example 36, wherein the number of frequency bands is equal to 1.
The method of example 37, further comprising: rendering all frequencies of the audio object using a same beam pattern in response to the number of frequency bands being equal to 1.
The method of any of examples 35-38, wherein the audio object metadata further comprises a first set of weighting values and at least a first set of metadata representative of a first directional beam for the audio object, wherein the method further comprises: applying the first set of weighting values to the audio object to obtain a weighted audio object; and applying, based on the first set of metadata representative of the first directional beam, the renderer to the weighted audio object to obtain the one or more first speaker feeds.
The method of example 39, wherein the first set of metadata to describe the first directional beam for the audio object comprises an azimuth value.
The method of example 39 or 40, wherein the first set of metadata to describe the first directional beam for the audio object comprises an elevation value.
The method of any of examples 39-41, wherein the first set of metadata to describe the first directional beam for the audio object comprises a distance value.
The method of any of examples 39-42, wherein the first set of metadata to describe the first directional beam for the audio object comprises a gain value.
The method of any of examples 39-43, wherein the first set of metadata to describe the first directional beam for the audio object comprises a diffuseness value.
The method of any of examples 35, 36, or 39-45, wherein the number of frequency bands is greater than 1.
The method of example 45, further comprising: rendering a first frequency band of the audio object using a first beam pattern and render a second frequency band of the audio object using a second beam pattern in response to the number of frequency bands being greater than 1.
The method of any of examples 35, 36, or 39-46 wherein the audio object metadata further comprises a first set of weighting values and at least a first set of metadata representative of a first directional beam for the first frequency band of the audio object and a second set of weighting values and at least a second set of metadata representative of a second directional beam for the second frequency band of the audio object, the method further comprising: applying the first set of weighting values to audio signals of the audio object within the first frequency band to obtain a first weighted audio object; applying the second set of weighting values to audio signals of the audio object within the second frequency band to obtain a second weighted audio object; summing the first weighted audio object and the second weighted audio object to determine a weighted summation of audio objects; and applying the renderer to the weighted summation of audio objects to obtain the one or more speaker feeds.
The method of example 47, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first azimuth value and the second set of metadata to describe the second directional beam for the audio object comprises a second azimuth value.
The method of example 47 or 48, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first elevation value and the second set of metadata to describe the second directional beam for the audio object comprises a second elevation value.
The method of any of examples 47-49, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first distance value and the second set of metadata to describe the second directional beam for the audio object comprises a second distance value.
The method of any of examples 47-50, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first gain value and the second set of metadata to describe the second directional beam for the audio object comprises a second gain value.
The method of any of examples 47-51, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first diffuseness value and the second set of metadata to describe the second directional beam for the audio object comprises a second diffuseness value.
The method of any of examples 34, 35, 38-52, wherein the number of frequency bands is equal to M, M being an integer value greater than 1.
The method of example 53, the method further comprising:
rendering the M frequency bands using M different beam patterns in response to the number of frequency bands being equal to M.
The method of example 53 or 54, wherein the audio object metadata further comprises M sets of weighting values and at least M sets of metadata representative of M directional beams, each of the M directional beams corresponding to one of the M frequency bands, the method further comprising: applying the M sets of weighting values to audio signals of the audio object to obtain weighted audio objects; summing the weighted audio objects to determine a weighted summation of audio objects; and applying the renderer to the weighted summation of audio objects to obtain the one or more speaker feeds.
The method of example 55, wherein each of the M sets of metadata comprises an azimuth value.
The method of example 55 or 56, wherein each of the M sets of metadata comprises an elevation value.
The method of any of examples 55-57, wherein each of the M sets of metadata comprises a distance value.
The method of any of examples 55-58, wherein each of the M sets of metadata comprises a gain value.
The method of any of examples 55-59, wherein each of the M sets of metadata comprises a diffuseness value.
The method of any of examples 35-60, wherein applying the renderer comprises performing vector-based amplitude panning with respect to the weighted audio object.
The method of any of examples 35-61, further comprising: reproducing, based on the output speaker feeds, a soundfield using one or more speakers.
The method of any of examples 35-62, wherein the method is performed by a vehicle.
The method of any of examples 35-63, wherein the method is performed by an unmanned vehicle.
The method of any of examples 35-64, wherein the method is performed by a robot.
The method of any of examples 35-62, wherein the method is performed by a handset.
The method of any of examples 35-66, wherein the method is performed by one or more processors comprise processing circuitry.
The method of example 67, wherein the processing circuitry comprises one or more application specific integrated circuits.
A computer-readable storage medium storing instructions that when executed by one or more processors cause the one or more processors to perform the method of any of examples 35-68.
An apparatus for processing coded audio, the apparatus comprising: means for storing an audio object and audio object metadata associated with the audio object, wherein the audio object meta data comprises frequency dependent beam pattern metadata; means for applying, based on the frequency dependent beam pattern metadata, a renderer to the audio object to obtain one or more first speaker feeds; and means for outputting the one or more speaker feeds.
The apparatus of example 70, wherein the frequency dependent beam pattern metadata is defined for a number of frequency bands.
The apparatus of example 71, wherein the number of frequency bands is equal to 1.
The apparatus of example 72, further comprising: means for rendering all frequencies of the audio object using a same beam pattern in response to the number of frequency bands being equal to 1.
The apparatus of any of examples 70-73, wherein the audio object metadata further comprises a first set of weighting values and at least a first set of metadata representative of a first directional beam for the audio object, the apparatus further comprising: means for applying the first set of weighting values to the audio object to obtain a weighted audio object; and means for applying, based on the first set of metadata representative of the first directional beam, the renderer to the weighted audio object to obtain the one or more first speaker feeds.
The apparatus of example 74, wherein the first set of metadata to describe the first directional beam for the audio object comprises an azimuth value.
The apparatus of example 74 or 75, wherein the first set of metadata to describe the first directional beam for the audio object comprises an elevation value.
The apparatus of any of examples 74-76, wherein the first set of metadata to describe the first directional beam for the audio object comprises a distance value.
The apparatus of any of examples 74-77, wherein the first set of metadata to describe the first directional beam for the audio object comprises a gain value.
The apparatus of any of examples 74-78, wherein the first set of metadata to describe the first directional beam for the audio object comprises a diffuseness value.
The apparatus of any of examples 70, 71, or 74-79, wherein the number of frequency bands is greater than 1.
The apparatus of example 80, further comprising: means for rendering a first frequency band of the audio object using a first beam pattern and render a second frequency band of the audio object using a second beam pattern in response to the number of frequency bands being greater than 1.
The apparatus of any of examples 70, 71, or 74-81 wherein the audio object metadata further comprises a first set of weighting values and at least a first set of metadata representative of a first directional beam for the first frequency band of the audio object and a second set of weighting values and at least a second set of metadata representative of a second directional beam for the second frequency band of the audio object, the apparatus further comprising: means for applying the first set of weighting values to audio signals of the audio object within the first frequency band to obtain a first weighted audio object; means for applying the second set of weighting values to audio signals of the audio object within the second frequency band to obtain a second weighted audio object; means for summing the first weighted audio object and the second weighted audio object to determine a weighted summation of audio objects; and means for applying the renderer to the weighted summation of audio objects to obtain the one or more speaker feeds.
The apparatus of example 82, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first azimuth value and the second set of metadata to describe the second directional beam for the audio object comprises a second azimuth value.
The apparatus of example 82 or 83, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first elevation value and the second set of metadata to describe the second directional beam for the audio object comprises a second elevation value.
The apparatus of any of examples 82-84, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first distance value and the second set of metadata to describe the second directional beam for the audio object comprises a second distance value.
The apparatus of any of examples 82-85, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first gain value and the second set of metadata to describe the second directional beam for the audio object comprises a second gain value.
The apparatus of any of examples 82-86, wherein the first set of metadata to describe the first directional beam for the audio object comprises a first diffuseness value and the second set of metadata to describe the second directional beam for the audio object comprises a second diffuseness value.
The apparatus of any of examples 69, 70, 73-87, wherein the number of frequency bands is equal to M, M being an integer value greater than 1.
The apparatus of example 88, the apparatus further comprising: means for rendering the M frequency bands using M different beam patterns in response to the number of frequency bands being equal to M.
The apparatus of example 88 or 89, wherein the audio object metadata further comprises M sets of weighting values and at least M sets of metadata representative of M directional beams, each of the M directional beams corresponding to one of the M frequency bands, the apparatus further comprising: means for applying the M sets of weighting values to audio signals of the audio object to obtain weighted audio objects; means for summing the weighted audio objects to determine a weighted summation of audio objects; and means for applying the renderer to the weighted summation of audio objects to obtain the one or more speaker feeds.
The apparatus of example 90, wherein each of the M sets of metadata comprises an azimuth value.
The apparatus of example 90 or 91, wherein each of the M sets of metadata comprises an elevation value.
The apparatus of any of examples 90-92, wherein each of the M sets of metadata comprises a distance value.
The apparatus of any of examples 90-93, wherein each of the M sets of metadata comprises a gain value.
The apparatus of any of examples 90-94, wherein each of the M sets of metadata comprises a diffuseness value.
The apparatus of any of examples 70-95, wherein the means for applying the renderer comprises means for performing vector-based amplitude panning with respect to the weighted audio object.
The apparatus of any of examples 70-96, further comprising: means for reproducing, based on the output speaker feeds, a soundfield using one or more speakers.
The apparatus of any of examples 70-97, wherein the apparatus comprises a vehicle.
The apparatus of any of examples 70-98, wherein the apparatus comprises an unmanned vehicle.
The apparatus of any of examples 70-99, wherein the apparatus comprises a robot.
The apparatus of any of examples 70-97, wherein the apparatus comprises a handset.
The apparatus of any of examples 70-101, wherein the apparatus comprises one or more processors comprise processing circuitry.
The apparatus of example 102, wherein the processing circuitry comprises one or more application specific integrated circuits.
In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code, and/or data structures for implementation of the techniques described in this disclosure. A computer program product may include a computer-readable medium.
Likewise, in each of the various instances described above, it should be understood that the audio decoding device 22 may perform a method or otherwise comprise means to perform each step of the method for which the audio decoding device 22 is configured to perform. In some instances, the means may comprise one or more processors. In some instances, the one or more processors may represent a special purpose processor configured by way of instructions stored to a non-transitory computer-readable storage medium. In other words, various aspects of the techniques in each of the sets of encoding examples may provide for a non-transitory computer-readable storage medium having stored thereon instructions that, when executed, cause the one or more processors to perform the method for which the audio decoding device 24 has been configured to perform.
By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc, where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
Instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques could be fully implemented in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware.
Various aspects of the techniques have been described. These and other aspects of the techniques are within the scope of the following claims.
This application claims the benefit of U.S. Provisional Application No. 62/784,239 filed Dec. 21, 2018, the entire content of which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62784239 | Dec 2018 | US |