The disclosure relates to the encoding and decoding of audio data and, more particularly, audio data coding techniques for virtual reality and augmented reality environments.
Various technologies have been developed that allow a person to sense and interact with a computer-generated environment, often through visual and sound effects provided to the person or persons by the devices providing the computer-generated environment. These computer-generated environments are sometimes referred to as “virtual reality” or “VR” environments. For example, a user may avail of a VR experience using one or more wearable devices, such as a headset. A VR headset may include various output components, such as a display screen that provides visual images to the user, and speakers that output sounds. In some examples, a VR headset may provide additional sensory effects, such as tactile sensations provided by way of movement or vibrations. In some examples, the computer-generated environment may provide audio effects to a user or users through speakers or other devices not necessarily worn by the user, but rather, where the user is positioned within audible range of the speakers. Similarly, head-mounted displays (HMDs) exist that allow a user to see the real world in front of the user (as the lenses are transparent) and to see graphic overlays (e.g., from projectors embedded in the HMD frame), as a form of “augmented reality” or “AR.” Similarly, systems exist that allow a user to experience the real world with the addition to VR elements, as a form of “mixed reality” or “MR.”
VR, MR, and AR systems may incorporate capabilities to render higher-order ambisonics (HOA) signals, which are often represented by a plurality of spherical harmonic coefficients (SHC) or other hierarchical elements. That is, the HOA signals that are rendered by a VR, MR, or AR system may represent a three dimensional (3D) soundfield. The HOA or SHC representation may represent the 3D soundfield in a manner that is independent of the local speaker geometry used to playback a multi-channel audio signal rendered from the SHC signal. The SHC signal may also facilitate backwards compatibility as the SHC signal may be rendered to well-known and highly adopted multi-channel formats, such as a 5.1 audio channel format or a 7.1 audio channel format. The SHC representation may therefore enable a better representation of a soundfield that also accommodates backward compatibility.
In general, techniques are described by which audio decoding devices and audio encoding devices may leverage video data from a computer-generated environment's video feed, to provide a more accurate representation of the 3D soundfield associated with the computer-generated reality experience. Generally, the techniques of this disclosure may enable various systems to adjust audio objects in the HOA domain to generate a more accurate representation of the energies and directional components of the audio data upon rendering. As one example, the techniques may enable rendering the 3D soundfield to accommodate a six degree-of-freedom (6-DOR) capability of the computer-generated reality system. Moreover, the techniques of this disclosure enable the rendering devices to use data represented in the HOA domain to alter audio data based on characteristics of the video feed being provided for the computer-generated reality experience.
For instance, according to the techniques described herein, the audio rendering device of the computer-generated reality system may adjust foreground audio objects for parallax-related changes that stem from “silent objects” that may attenuate the foreground audio objects. As another example, the techniques of this disclosure may enable the audio rendering device of the computer-generated reality system to determine relative distances between the user and a particular foreground audio object. As another example, the techniques of this disclosure may enable the audio rendering device to apply transmission factors to render the 3D soundfield to provide a more accurate computer-generated reality experience to a user.
In one example, this disclosure is directed to an audio decoding device. The audio decoding device may include processing circuitry and a memory device coupled to the processing circuitry. The processing circuitry is configured to receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, to receive metadata associated with the bitstream, to obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and to apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield. The memory device is configured to store at least a portion of the received bitstream, the received metadata, or the parallax-adjusted audio objects of the 3D soundfield.
In another example, this disclosure is directed to a method that includes receiving, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, and receiving metadata associated with the bitstream. The method may further include obtaining, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and applying the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.
In another example, this disclosure is directed to an audio decoding apparatus. The audio decoding apparatus may include means for receiving, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, and means for receiving metadata associated with the bitstream. The audio decoding apparatus may further include means for obtaining, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and means for applying the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.
In another example, this disclosure is directed to a non-transitory computer-readable storage medium encoded with instructions. The instructions, when executed, cause processing circuitry of an audio decoding device to receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, and to receive metadata associated with the bitstream. The instructions, when executed, further cause the processing circuitry of the audio decoding device to obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and to apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.
The details of one or more aspects of the techniques are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of these techniques will be apparent from the description and drawings, and from the claims.
In some aspects, this disclosure describes techniques by which audio decoding devices and audio encoding devices may leverage video data from a VR, MR, or AR video feed to provide a more accurate representation of the 3D soundfield associated with the VR/MR/AR experience. For instance, techniques of this disclosure may enable various systems to adjust audio objects in the HOA domain to generate a more accurate representation of the energies and directional components of the audio data upon rendering. As one example, the techniques may enable rendering the 3D soundfield to accommodate a six degree-of-freedom (6-DOR) capability of the VR system.
Moreover, the techniques of this disclosure enable the rendering devices to use HOA domain data to alter audio data based on characteristics of the video feed being provided for the VR experience. For instance, according to the techniques described herein, the audio rendering device of the VR system may adjust foreground audio objects for parallax-related changes that stem from “silent objects” that may attenuate the foreground audio objects. As another example, the techniques of this disclosure may enable the audio rendering device of the VR system to determine relative distances between the user and a particular foreground audio object.
Surround sound technology may be particularly suited to incorporation into VR systems. For instance, the immersive audio experience provided by surround sound technology complements the immersive video and sensory experience provided by other aspects of VR systems. Moreover, augmenting the energy of audio objects with directional characteristics as provided by ambisonics technology provides for a more realistic simulation by the VR environment. For instance, the combination of realistic placement of visual objects in combination with corresponding placement of audio objects via the surround sound speaker array may more accurately simulate the environment that is being replicated.
There are various ‘surround-sound’ channel-based formats in the market. They range, for example, from the 5.1 home theatre system (which has been the most successful in terms of making inroads into living rooms beyond stereo) to the 22.2 system developed by NHK (Nippon Hoso Kyokai or Japan Broadcasting Corporation). Content creators (e.g., Hollywood studios) would like to produce the soundtrack for a movie once, and not spend effort to remix it for each speaker configuration. A Moving Pictures Expert Group (MPEG) has released a standard allowing for soundfields to be represented using a hierarchical set of elements (e.g., Higher-Order Ambisonic—HOA—coefficients) that can be rendered to speaker feeds for most speaker configurations, including 5.1 and 22.2 configuration whether in location defined by various standards or in non-uniform locations.
MPEG released the standard as MPEG-H 3D Audio standard, formally entitled “Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: 3D audio,” set forth by ISO/IEC JTC 1/SC 29, with document identifier ISO/IEC DIS 23008-3, and dated Jul. 25, 2014. MPEG also released a second edition of the 3D Audio standard, entitled “Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: 3D audio, set forth by ISO/IEC JTC 1/SC 29, with document identifier ISO/IEC 23008-3:201x(E), and dated Oct. 12, 2016. Reference to the “3D Audio standard” in this disclosure may refer to one or both of the above standards.
As noted above, one example of a hierarchical set of elements is a set of spherical harmonic coefficients (SHC). The following expression demonstrates a description or representation of a soundfield using SHC:
The expression shows that the pressure pi at any point {rr, θr, φr} of the soundfield, at time t, can be represented uniquely by the SHC, Anm(k). Here,
c is the speed of sound (˜343 m/s), {rr, θr, φr} is a point of reference (or observation point), jn(⋅) is the spherical Bessel function of order n, and Ynm(θr,φr) are the spherical harmonic basis functions (which may also be referred to as a spherical basis function) of order n and suborder m. It can be recognized that the term in square brackets is a frequency-domain representation of the signal (i.e., S(ω, rr, θr, φr)) which can be approximated by various time-frequency transformations, such as the discrete Fourier transform (DFT), the discrete cosine transform (DCT), or a wavelet transform. Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of coefficients of multiresolution basis functions.
The SHC Anm(k) can either be physically acquired (e.g., recorded) by various microphone array configurations or, alternatively, they can be derived from channel-based or object-based descriptions of the soundfield. The SHC (which also may be referred to as higher order ambisonic—HOA—coefficients) represent scene-based audio, where the SHC may be input to an audio encoder to obtain encoded SHC that may promote more efficient transmission or storage. For example, a fourth-order representation involving (1+4)2 (25, and hence fourth order) coefficients may be used.
As noted above, the SHC may be derived from a microphone recording using a microphone array. Various examples of how SHC may be derived from microphone arrays are described in Poletti, M., “Three-Dimensional Surround Sound Systems Based on Spherical Harmonics,” J. Audio Eng. Soc., Vol. 53, No. 11, 2005 November, pp. 1004-1025.
To illustrate how the SHCs may be derived from an object-based description, consider the following equation. The coefficients Anm(k) for the soundfield corresponding to an individual audio object may be expressed as:
Anm(k)=g(ω)(−4πik)hn(2)(krs)Ynm*(θs,φs),
where i is √{square root over (−1)}, hn(2)(⋅) is the spherical Hankel function (of the second kind) of order n, and {rs, θs, φs} is the location of the object. Knowing the object source energy g(ω) as a function of frequency (e.g., using time-frequency analysis techniques, such as performing a fast Fourier transform on the PCM stream) allows us to convert each PCM object and the corresponding location into the SHC Anm(k). Further, it can be shown (since the above is a linear and orthogonal decomposition) that the Anm(k) coefficients for each object are additive. In this manner, a number of PCM objects can be represented by the Anm(k) coefficients (e.g., as a sum of the coefficient vectors for the individual objects). Essentially, the coefficients contain information about the soundfield (the pressure as a function of 3D coordinates), and the above represents the transformation from individual objects to a representation of the overall soundfield, in the vicinity of the observation point {rr, θr, φr}. The remaining figures are described below in the context of SHC-based audio coding.
The content creator device 12 may be operated by a movie studio, game programmer, manufacturers of VR systems, or any other entity that may generate multi-channel audio content for consumption by operators of content consumer devices, such as the content consumer device 14. In some examples, the content creator device 12 may be operated by an individual user who would like to compress HOA coefficients 11. Often, the content creator device 12 generates audio content in conjunction with video content and/or content that can be expressed via tactile or haptic output. For instance, the content creator device 12 may include, be, or be part of a system that generates VR, MR, or AR environment data. The content consumer device 14 may be operated by an individual. The content consumer device 14 may include an audio playback system 16, which may refer to any form of audio playback system capable of rendering SHC for play back as multi-channel audio content.
For instance, the content consumer device 14 may include, be, or be part of a system that provides a VR, MR, or AR environment or experience to a user. As such, the content consumer device 14 may also include components for output of video data, for the output and input of tactile or haptic communications, etc. For ease of illustration purposes only, the content creator device 12 and the content consumer device 14 are illustrated in
The content creator device 12 includes an audio editing system 18. The content creator device 12 obtain live recordings 7 in various formats (including directly as HOA coefficients) and audio objects 9, which the content creator device 12 may edit using audio editing system 18. Two or more microphones or microphone arrays (hereinafter, “microphones 5”) may capture the live recordings 7. The content creator device 12 may, during the editing process, render HOA coefficients 11 from audio objects 9, listening to the rendered speaker feeds in an attempt to identify various aspects of the soundfield that require further editing. The content creator device 12 may then edit the HOA coefficients 11 (potentially indirectly through manipulation of different ones of the audio objects 9 from which the source HOA coefficients may be derived in the manner described above). The content creator device 12 may employ the audio editing system 18 to generate the HOA coefficients 11. The audio editing system 18 represents any system capable of editing audio data and outputting the audio data as one or more source spherical harmonic coefficients.
When the editing process is complete, the content creator device 12 may generate a bitstream 21 based on the HOA coefficients 11. That is, the content creator device 12 includes an audio encoding device 20 that represents a device configured to encode or otherwise compress HOA coefficients 11 in accordance with various aspects of the techniques described in this disclosure to generate the bitstream 21. The audio encoding device 20 may generate the bitstream 21 for transmission, as one example, across a transmission channel, which may be a wired or wireless channel, a data storage device, or the like. The bitstream 21 may represent an encoded version of the HOA coefficients 11 and may include a primary bitstream and another side bitstream, which may be referred to as side channel information. As shown in
According to techniques of this disclosure, the audio encoding device 20 may include, in the metadata 23, one or more of directional vector information, silent object information, and transmission factors for the HOA coefficients 11. For instance, the audio encoding device 20 may include transmission factors that, when applied, attenuate the energy of one or more of the HOA coefficients 11 communicated via the bitstream 21. In accordance with various aspects of this disclosure, the audio encoding device 20 may derive the transmission factors using object locations in video frames corresponding to the audio frames represented by the particular coefficients of the HOA coefficients 11. For instance, the audio encoding device 20 may determine that a silent object represented in the video data has a location that would interfere with the volume of certain foreground audio objects represented by the HOA coefficients 11, in a real-life scenario. In turn, the audio encoding device 20 may generate transmission factors that, when applied by the audio decoding device 24, would attenuate the energies of the HOA coefficients 11 to more accurately simulate the way the 3D soundfield would be heard by a listener in the corresponding video scene.
According to the techniques of this disclosure, the audio encoding device 20 may classify the audio objects 9, as expressed by the HOA coefficients 11, into foreground objects and background objects. For instance, the audio encoding device 20 may implement aspects of this disclosure to identify a silence object or silent object based on a determination that the object is represented in the video data, but does not correspond to a pre-identified audio object. Although described with respect to the audio encoding device 20 performing the video analysis, a video encoding device (not shown) or a dedicated visual analysis device or unit may perform the classification of the silent object, providing the classification and transmission factors to audio encoding device 20 for purposes of generating the metadata 23.
In the context of captured video and audio, the audio encoding device 20 may determine that an object does not correspond to a pre-identified audio object if the object is not equipped with a sensor. As used herein, the term “equipped with a sensor” may include scenarios where a sensor is attached (permanently or detachably) to an audio source, or placed within earshot (though not attached to) an audio source. If the sensor is not attached to the audio source but is positioned within earshot, then, in applicable scenarios, multiple audio sources that are within earshot of the sensor are considered to be “equipped” with the sensor. In a synthetic VR environment, the audio encoding device 20 may implement techniques of this disclosure to determine that an object does not correspond to a pre-identified audio object if the object in question does not map to any audio object in a predetermined list. In a combination recorded-synthesized VR or AR environment, the audio encoding device 20 may implement techniques of this disclosure to determine that an object does not correspond to a pre-identified audio object using one or both of the techniques described above.
Moreover, the audio encoding device 20 may determine relative foreground location information that reflects a relationship between the location of the listener and the respective locations of the foreground audio objects represented by the HOA coefficients 11 in the bitstream 21. For instance, the audio encoding device 20 may determine a relationship between the “first person” aspect of the video capture or video synthesis for the VR experience, and may determine the relationship between the location of the “first person” and the respective video object corresponding to each respective foreground audio object of the 3D soundfield.
In some examples, the audio encoding device 20 may also use the relative foreground location information to determine relative location information between the listener location and a silent object that attenuates the energy of the foreground object. For instance, the audio encoding device 20 may apply a scaling factor to the relative foreground location information, to derive the distance between the listener location and the silent object that attenuates the energy of the foreground audio object. The scaling factor may range in value from zero to one, with a zero value indicating that the silent object is co-located or substantially co-located with the listener location, and with the value of one indicating that the silent object is co-located or substantially co-located with the foreground audio object.
In some instances, the audio encoding device 20 may signal the relative foreground location information and/or the listener location-to-silent object distance information to the audio encoding device 24. In other examples, the audio encoding device 20 may signal the listener location information and the foreground audio object location information to the audio decoding device 24, thereby enabling the audio decoding device 24 to derive the relative foreground location information and/or the distance from the listener location to the silent object that attenuates the energy/directional data of the foreground audio object. While the metadata 23 and the bitstream 21 are illustrated in
While shown in
Alternatively, the content creator device 12 may store the bitstream 21 to a storage medium, such as a compact disc, a digital video disc, a high definition video disc or other storage media, most of which are capable of being read by a computer and therefore may be referred to as computer-readable storage media or non-transitory computer-readable storage media. In this context, the transmission channel may refer to the channels by which content stored to the mediums are transmitted (and may include retail stores and other store-based delivery mechanism). In any event, the techniques of this disclosure should not therefore be limited in this respect to the example of
As further shown in the example of
The audio playback system 16 may further include an audio decoding device 24. The audio decoding device 24 may represent a device configured to decode HOA coefficients 11′ from the bitstream 21, where the HOA coefficients 11′ may be similar to the HOA coefficients 11 but differ due to lossy operations (e.g., quantization) and/or transmission via the transmission channel. The audio playback system 16 may, after decoding the bitstream 21 to obtain the HOA coefficients 11′ and render the HOA coefficients 11′ to output loudspeaker feeds 25. The loudspeaker feeds 25 may drive one or more loudspeakers (which are not shown in the example of
While described with respect to loudspeaker feeds 25, the audio playback system 16 may render headphone feeds from either the loudspeaker feeds 25 or directly from the HOA coefficients 11′, outputting the headphone feeds to headphone speakers. The headphone feeds may represent binaural audio speaker feeds, which the audio playback system 16 renders using a binaural audio renderer.
To select the appropriate renderer or, in some instances, generate an appropriate renderer, the audio playback system 16 may obtain loudspeaker information 13 indicative of a number of loudspeakers and/or a spatial geometry of the loudspeakers. In some instances, the audio playback system 16 may obtain the loudspeaker information 13 using a reference microphone and driving the loudspeakers in such a manner as to dynamically determine the loudspeaker information 13. In other instances or in conjunction with the dynamic determination of the loudspeaker information 13, the audio playback system 16 may prompt a user to interface with the audio playback system 16 and input the loudspeaker information 13.
The audio playback system 16 may then select one of the audio renderers 22 based on the loudspeaker information 13. In some instances, the audio playback system 16 may, when none of the audio renderers 22 are within some threshold similarity measure (in terms of the loudspeaker geometry) to the loudspeaker geometry specified in the loudspeaker information 13, generate the one of audio renderers 22 based on the loudspeaker information 13. The audio playback system 16 may, in some instances, generate one of the audio renderers 22 based on the loudspeaker information 13 without first attempting to select an existing one of the audio renderers 22. One or more speakers 3 may then playback the rendered loudspeaker feeds 25.
The audio decoding device 24 may implement various techniques of this disclosure to perform parallax-based adjustments for the encoded representations of the audio objects received via the bitstream 21. For instance, the audio decoding device 24 may apply transmission factors included in the metadata 23 to one or more audio objects conveyed as encoded representations in the bitstream 21. In various examples, the audio decoding device 24 may attenuate the energies and/or adjust directional information with respect to the foreground audio objects, based on the transmission factors. In some examples, the audio decoding device 24 may also use the metadata 23 to obtain silence object location information and/or relative foreground location information that relates a listener's location to the foreground audio objects' respective locations. By attenuating the energy of the foreground audio objects and/or adjusting the directional information of the foreground audio objects using the transmission factors, the audio decoding device 24 may enable the content consumer device 14 to render audio data over the speakers 3 that provides a more realistic auditory experience as part of a VR experience that also provides video data and, optionally, other sensory data as well.
In some examples, the audio decoding device 24 may locally derive the relative foreground location information using information included in the metadata 23. For instance, the audio decoding device 24 may receive listener location information and foreground audio object locations in the metadata 23. In turn, the audio decoding device 24 may derive the relative foreground location information, such as by calculating a displacement between the listener location and the foreground audio location.
For example, the audio decoding device 24 may use a coordinate system to calculate the relative foreground location information, by using the coordinates of the listener location and the foreground audio locations as operands in a distance calculation function. In some examples, the audio decoding device 24 may also receive, as part of the metadata 23, a scaling factor that is applicable to the relative foreground location information. In some such examples, the audio decoding device 24 may apply the scaling factor to the relative foreground location information to calculate the distance between the listener location and a silence object that attenuates the energy or alters the directional information of the foreground audio object(s). While the metadata 23 and the bitstream 21 are illustrated in
The system 10B shown in
The system 10C shown in
The system 10D shown in
As shown in
The different occlusion/masking characteristics at each of virtual positions A, B, and C is illustrated in the left column of
At virtual position B, the lion is roaring directly behind the running person. That is, the foreground audio objects related to the lion's roar are masked, to some degree, by the occlusion caused by the running person as well as by the masking caused by the yelling of the running person. The audio encoding device 20 may perform the masking based on the relative position of the listener (at the virtual position B) and the lion, as well as the distance between the running person and the listener (at the virtual position B).
For instance, the closer the running person is to the lion, the lesser the masking that the audio encoding device 20 may apply to the foreground audio objects of the lion's roar. The closer the running person is to the virtual position B where the listener is positioned, the greater the masking that the audio encoding device 20 may apply to the foreground audio objects of the lion's roar. The audio encoding device 20 may cease the masking to allow for some predetermined minimum energy with respect to the foreground audio objects of the lion's roar. That is, techniques of this disclosure enable the audio encoding device 20 to assign at least a minimum energy to the foreground audio objects of the lion's roar, regardless of how close the running person is to virtual position B, to accommodate some level of the lion's roar that will be heard at virtual position B.
In turn, the audio decoding device 24 may apply the transmission factors in rendering the foreground audio objects associated with the lion's roar, to attenuate the loudness of the lion's roar based on the audio masking and physical occlusion caused by the running person. Additionally, the audio decoding device 24 may adjust the directional data of the foreground audio objects of the lion's roar, to account for the occlusion. For instance, the audio decoding device 24 may adjust the foreground audio objects of the lion's roar to simulate an experience at virtual position B in which the lion's roar is heard, at an attenuated loudness, from above and around the position of the running person's body.
The wall represents a “silent object” in the context of the techniques of this disclosure. As such, the presence of the wall is not directly indicated by audio objects captured by the microphones 5. Instead, the audio encoding device 20 may infer the locations of occlusion caused by the wall by leveraging video data captured by one or more cameras of (or coupled to) the content creator device 12. For instance, the audio encoding device 20 may translate the video scene position of the wall to audio position data, to represent the silent object (“SO”) using HOA coefficients. Using the positional information of the SO derived in this fashion, the audio encoding device may form transmission factors with respect to the foreground audio objects of the lion's roar, with respect to the virtual position B.
Moreover, based on the relative positioning of the running person to the virtual position B and the SO, the audio encoding device 20 may not form transmission factors with respect to foreground audio objects of the yell of the running person. As shown, the SO is not positioned in such a way as to occlude the foreground audio objects of the running person with respect to the virtual position B. The audio encoding device 20 may signal the transmission factors (with respect to the foreground audio objects of the lion's roar) in the metadata 23 to the audio decoding device 24.
In turn, the audio decoding device 24 may apply the transmission factors received in the metadata 23 to the foreground audio objects associated with the lion's roar, with respect to a “sweet spot” position at virtual position B. By applying the transmission factors to the foreground audio objects of the lion's roar at the virtual position B, the audio decoding device 24 may attenuate the energy assigned to the foreground audio objects of the lion's roar, thereby simulating the occlusion caused by the presence of the SO. In this manner, the audio decoding device 24 may implement the techniques of this disclosure to apply transmission factors to render the 3D soundfield to provide a more accurate VR experience to a user of the content consumer device 14.
The audio encoding device 20 may identify a FG object as an audio object that is represented by an audio object in an audio frame, and is also associated with a pre-identified audio object. The audio encoding device 20 may identify a BG object as an audio object that is represented by an audio object in an audio frame, but is not associated with any pre-identified audio object. As used herein, an audio object may be associated with a pre-identified audio object if the audio object is associated with an object that is equipped with a sensor (in case of captured audio/video) or maps to an object in a predetermined list (e.g., in case of synthetic audio/video). The BG audio objects may not change or translate based on listener moving between virtual positions A-C. As discussed above, the SO may not generate audio objects of its own, but is used by the audio encoding device 20 to determine transmission factors for the attenuation of the FG objects. As such, the audio encoding device 20 may represent the FG and BG objects separately in the bitstream 21. As discussed above, the audio encoding device 20 may represent the transmission factors derived from the SO in the metadata 23.
As shown in the legend 52 of
Fi: ith FG audio signal (person and lion) where i=1,Λ,I
V(ri,θi,ϕi): ith directional vector (from a distance, azimuth, elevation)
Bj: jth BG audio signal (ambient sound from safari) where j=1,Λ,J
Sk: location of an kth SO where k=1,Λ,K
In various examples, the audio encoding device 20 may transmit one or more of the V vector calculation (with its parameters/arguments), and the Sk value in the metadata 23. The audio encoding device may transmit the values of Fi and Bj in the bitstream 21.
{F1,Λ,F1}
{V(r1,θ1,ϕ1),Λ,V(r1,θ1,ϕ1)}
{B1,Λ,BJ}
{S1,Λ,SK}
In turn, the audio decoding device 24 may combine data indicating the user location estimation with the FG object location and directional vector calculations, the FG object attenuation (via application of the transmission factors), and the BG object translation calculations. In
As shown, the audio decoding device 24 may calculate one summation with respect to FG objects, and a second summation with respect to BG objects. With respect to the FG object summation, the audio decoding device 24 may apply the transmission factor ρ for an ith object to a product of the FG audio signal for the ith object and the directional vector calculation for the ith object. In turn, the audio decoding device 24 may perform a summation of the resulting product values for a series of values of i.
With respect to the BG objects, the audio decoding device 24 may calculate a product of the jth BG audio signal and the corresponding translation factor for the jth BG audio signal. In turn, the audio decoding device 24 may add the FG object-related summation value and the BG object-related summation value to calculate H, for rendering of the 3D soundfield.
Again, the specific example of
As shown in
That is, the equations presented above are applicable to FG and BG object-based calculations, such as the foreground and background signals applicable for a particular location i. In terms of the directional vectors and the silent objects at various locations, the audio decoding device 24 may perform the interpolation operations of this disclosure according to the following equations:
{V(
{S1,Λ,SK}
Aspects of the silent object interpolation may be calculated by the following operations, as illustrated in
[(sin θ1)/L1]=[(sin θ2)/L2]=[(sin θ3)/L3]
The processing circuitry of the audio encoding device may determine that the video object is not associated with any pre-identified audio object (1308). In turn, responsive to the determinations that the video object is not represented by any corresponding audio object in the first audio scene and that the video object is not associated with any pre-identified audio object, the processing circuitry of the audio encoding device may identify the video object as a silent object (1310).
As such, in some examples of this disclosure, an audio encoding device of this disclosure includes a memory device configured to store audio objects obtained from one or more microphone arrays with respect to a three-dimensional (3D) soundfield, wherein each obtained audio object is associated with a respective audio scene, and to store video data obtained from one or more video capture devices, the video data comprising one or more video scenes, each respective video scene being associated with a respective audio scene of the obtained audio data. The device further includes processing circuitry coupled to the memory device, the processing circuitry being configured to determine that a video object included in a first video scene is not represented by any corresponding audio object in a first audio scene that corresponds to the first video scene, to determine that the video object is not associated with any pre-identified audio object, and to identify, responsive to the determinations that the video object is not represented by any corresponding audio object in the first audio scene and that the video object is not associated with any pre-identified audio object, the video object as a silent object.
In some examples, the processing circuitry is further configured to determine that a first audio object included in obtained audio data is associated with a pre-identified audio object, and to identify, responsive to the determination that the audio object is associated with the pre-identified audio object, the first audio object as a foreground audio object. In some examples, the processing circuitry is further configured to determine that a second audio object included in obtained audio data is not associated with any pre-identified audio object, and to identify, responsive to the determination that the second audio object is not associated with any pre-identified audio object, the second audio object as a background audio object.
In some examples, the processing circuitry being is configured to determine that the first audio object is associated with a pre-identified audio object by determining that the first audio object is associated with an audio source that is equipped with one or more sensors. In some examples, the audio encoding device further includes the one or more microphone arrays coupled to the processing circuitry, the one or more microphone arrays being configured to capture the audio objects associated with the 3D soundfield. In some examples, the audio encoding device further includes the one or more video capture devices coupled to the processing circuitry, the one or more video capture devices being configured to capture the video data. The video capture devices may include, be, or be part of, the cameras illustrated in the drawings and described above with respect to the drawings. For example, the video capture devices may represent multiple (e.g., dual) cameras positioned such that the cameras capture video data or images of a scene from different perspectives. In some examples, the foreground audio object is included in the first audio scene that corresponds to the first video scene, and the processing circuitry being further configured to determine whether positional information of the silent object with respect to the first video scene causes attenuation of the foreground audio object.
In some examples, the processing circuitry is further configured to generate, responsive to determining that the silent object causes the attenuation of the foreground audio object, one or more transmission factors with respect to the foreground audio object, wherein the generated transmission factors represent adjustments with respect to the foreground audio object. In some examples, the generated transmission factors represent adjustments with respect to an energy of the foreground audio object. In some examples, the generated transmission factors represent adjustments with respect to directional characteristics of the foreground audio object. In some examples, the processing circuitry is further configured to transmit the transmission factors out of band with respect to a bitstream that includes the foreground audio object. In some examples, the generated transmission factors represent metadata with respect to the bitstream.
The processing circuitry of the audio decoding device may obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects (1406). In addition, the processing circuitry of the audio decoding device may apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield (1408). The audio decoding device may further comprise a memory coupled to the processing circuitry. The memory device may store at least a portion of the received bitstream, the received metadata, or the parallax-adjusted audio objects of the 3D soundfield. The processing circuitry of the audio decoding device may render the parallax-adjusted audio objects of the 3D soundfield to one or more speakers (1410). For instance, the processing circuitry of the audio decoding device may render the parallax-adjusted audio objects of the 3D soundfield into one or more speaker feeds that drive the one or more speakers.
In some examples of this disclosure, an audio decoding device includes processing circuitry configured to receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield, to receive metadata associated with the bitstream, to obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects, and to apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield. The device further includes a memory device coupled to the processing circuitry, the memory device being configured to store at least a portion of the received bitstream, the received metadata, or the parallax-adjusted audio objects of the 3D soundfield. In some examples, the processing circuitry is further configured to determine listener location information, and to apply the listener location information in addition to applying the transmission factors to the one or more audio objects. In some examples, the processing circuitry is further configured to apply relative foreground location information between the listener location information and respective locations associated with foreground audio objects of the one or more audio objects. In some examples, the processing circuitry is further configured to apply background translation factors that are calculated using respective locations associated with background audio objects of the one or more audio objects.
In some examples, the processing circuitry is further configured to apply foreground attenuation factors to respective foreground audio objects of the one or more audio objects. In some examples, the processing circuitry is further configured to determine a minimum transmission value for the respective foreground audio objects, to determine whether applying the transmission factors to the respective foreground audio objects produces an adjusted transmission value that is lower than the minimum transmission value, and to render, responsive to determining that the adjusted transmission value that is lower than the minimum transmission value, the respective foreground audio objects using the minimum transmission value. In some examples, the processing circuitry is further configured to adjust an energy of the respective foreground audio objects. In some examples, the processing circuitry being further configured to attenuate respective energies of the respective foreground audio objects. In some examples, the processing circuitry is further configured to adjust directional characteristics of the respective foreground audio objects. In some examples, the processing circuitry is further configured to adjust parallax information of the respective foreground audio objects. In some examples, the processing circuitry is further configured to adjust the parallax information to account for one or more silent objects represented in a video stream associated with the 3D soundfield. In some examples, the processing circuitry is further configured to receive the metadata within the bitstream.
In some examples, the processing circuitry is further configured to receive the metadata out of band with respect to the bitstream. In some examples, the processing circuitry is further configured to output video data associated with the 3d soundfield to one or more displays. In some examples, the device further includes the one or more displays, the one or more displays being configured to receive the video data from the processing circuitry, and to output the received video data in visual form.
The memory, in turn, may be configured to store the listener location and respective locations associated with the one or more foreground audio objects of the 3D soundfield. The respective locations associated with the one or more foreground audio objects may be obtained from video data associated with the 3D soundfield. In turn, the processing circuitry of the audio decoding device may render the 3D soundfield to one or more speakers (1504). For instance, the processing circuitry of the audio decoding device may render the 3D soundfield into one or more speaker feeds that drive one or more loudspeakers, headphones, etc. that are communicatively coupled to the audio decoding device.
In some examples of this disclosure, an audio decoding device includes a memory device configured to store a listener location and respective locations associated with one or more foreground audio objects of a three-dimensional (3D) soundfield, the respective locations associated with the one or more foreground audio objects being obtained from video data associated with the 3D soundfield, and also includes processing circuitry coupled to the memory device the processing circuitry being configured to determine relative foreground location information between the listener location and the respective locations associated with the one or more foreground audio objects of the 3D soundfield. In some examples, the processing circuitry is further configured to apply a coordinate system to determine the relative foreground location information. In some examples, the processing circuitry is further configured to determine the listener location information by detecting a device. In some examples, the detected device includes a virtual reality (VR) headset. In some examples, the processing circuitry is further configured to determine the listener location information by detecting a person. In some examples, the processing circuitry is further configured to determine the listener location using a point cloud based interpolation process. In some examples, the processing circuitry is further configured to obtain a plurality of listener location candidates, and to interpolate the listener location between at least two listener location candidates of the obtained plurality of listener location candidates.
The processing circuitry of the audio encoding device may generate a bitstream that includes the encoded representations of the audio objects of the 3D soundfield (1606). The processing circuitry of the audio encoding device may generate metadata associated with the bitstream that includes the encoded representations of the audio objects of the 3D soundfield (1608). The metadata may include one or more of transmission factors with respect to the audio objects, relative foreground location information between listener location information and respective locations associated with foreground audio objects of the audio objects, or location information for one or more silent objects of the audio objects. Although steps 1606 and 1608 of process 1600 are illustrated in a particular order for ease of illustration and discussion, it will be appreciated that the processing circuitry of the audio encoding device may generate the bitstream and the metadata in any order, including the reverse order of the order illustrated in
The processing circuitry of the audio encoding device may signal the bitstream (1610). The processing circuitry of the audio encoding device may signal the metadata associated with the bitstream (1612). For instance, the processing circuitry may use a communication unit or other communication interface hardware of the audio encoding device to signal the bitstream and/or the metadata. Although the signaling operations (steps 1610 and 1612) of process 1600 are illustrated in a particular order for ease of illustration and discussion, it will be appreciated that the processing circuitry of the audio encoding device may signal the bitstream and the metadata in any order, including the reverse order of the order illustrated in
In some examples of this disclosure, an audio encoding device includes a memory device configured to store encoded representations of audio objects of a three-dimensional (3D) soundfield, and further includes processing circuitry coupled to the memory device and configured to generate metadata associated with a bitstream that includes the encoded representations of the audio objects of the 3D soundfield, the metadata including one or more of transmission factors with respect to the audio objects, relative foreground location information between listener location information and respective locations associated with foreground audio objects of the audio objects, or location information for one or more silent objects of the audio objects. In some examples, the processing circuitry is configured to generate the transmission factors based on attenuation information associated with the silent objects and the foreground audio objects.
In some examples, the transmission factors represent energy attenuation information with respect to the foreground audio objects based on the location information for the silent objects. In some examples, the transmission factors represent directional attenuation information with respect to the foreground audio objects based on the location information for the silent objects. In some examples, the processing circuitry is further configured to determine the transmission factors based on the listener location information and the location information for the silent objects. In some examples, the processing circuitry is further configured to determine the transmission factors based on the listener location information and location information for the foreground audio objects. In some examples, the processing circuitry is further configured to generate the bitstream that includes the encoded representations of the audio objects of the 3D soundfield, and to signal the bitstream. In some examples, the processing circuitry being configured to signal the metadata within the bitstream. In some examples, the processing circuitry being configured to signal the metadata out-of-band with respect to the bitstream.
In some examples of this disclosure, an audio decoding device includes a memory device configured to store one or more audio objects of a three-dimensional (3D) soundfield, and also includes processing circuitry coupled to the memory device. The processing circuitry is configured to obtain metadata that includes transmission factors with respect to the one or more audio objects of the 3D soundfield, and to apply the transmission factors to audio signals associated with the one or more audio objects of the 3D soundfield. In some examples, the processing circuitry is further configured to attenuate energy information for the one or more audio signals. In some examples the one or more audio objects include foreground audio objects of the 3D soundfield.
The processing circuitry of the audio decoding device may render the foreground audio signal to one or more speakers (1704). In some instances, the processing circuitry of the audio decoding device may also render a background audio signal (associated with a background audio object of the 3D soundfield) to the one or more speakers (1704). For instance, the processing circuitry of the audio decoding device may render the foreground audio signal (and optionally, the background audio signal) into one or more speaker feeds that drive one or more loudspeakers, headphones, etc. that are communicatively coupled to the audio decoding device.
Additionally, the processing circuitry of the audio decoding device may calculate a respective product of a respective set of a transmission factor, a background audio signal, and a directional vector (1806). The memory may be configured to store the plurality of background audio objects (which may be part of the same 3D soundfield as the plurality of foreground audio objects stored to the memory). The processing circuitry of the audio decoding device may calculate a summation of the respective products for all background audio objects of the plurality of background audio objects (1808). In turn, the processing circuitry of the audio decoding device may render the 3D soundfield to one or more speakers based on a sum of both calculated summations (1810).
That is, the processing circuitry of the audio decoding device may calculate a summation of (i) the calculated summation of the respective products calculated for all of the stored foreground audio objects, and (ii) the calculated summation of the respective products calculated for all of the stored background audio objects. In turn, the processing circuitry of the audio decoding device may render the 3D soundfield into one or more speaker feeds that drive one or more loudspeakers, headphones, etc. that are communicatively coupled to the audio decoding device.
In some examples of this disclosure, an audio decoding device includes a memory device configured to store a foreground audio object of a three-dimensional (3D) soundfield, and processing circuitry coupled to the memory device. The processing circuitry is configured to apply a transmission factor to a foreground audio signal for a foreground audio object to attenuate one or more characteristics of the foreground audio signal. In some examples, the processing circuitry is configured to attenuate an energy of the foreground audio signal. In some examples, the processing circuitry is configured to apply a translation factor to a background audio object.
In some examples of this disclosure, an audio decoding device includes a memory device configured to store a plurality of foreground audio objects of a three-dimensional (3D) soundfield. The device also includes processing circuitry coupled to the memory device, and being configured to calculate, for each respective foreground audio object of the plurality of foreground audio objects, a respective product of a respective set of a transmission factor, a foreground audio signal, and a directional vector, and to calculate a summation of the respective products for all foreground audio objects of the plurality of foreground audio objects. In some examples, the memory device is further configured to store and a plurality of background audio objects, and the processing circuitry is further configured to calculate, for each respective background audio object of a plurality of background audio objects, a respective product of a respective background audio signal and a respective translation factor, and to calculate a summation of the respective products for all background audio objects of the plurality of background audio objects. In some examples, the processing circuitry is further configured to add the summation of the products for the foreground audio objects to the summation of the products for the background audio objects. In some examples, the processing circuitry is further configured to perform all calculations in a higher order ambisonics (HOA) domain.
In some instances, a non-transitory computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors to obtain an audio object, obtain a video object, associate the audio object and the video object, compare the audio object to the associated video object and render the audio object based on the comparison between the audio object and the associated video object.
Various aspects of the techniques described in this disclosure may also be performed by a device that generates an audio output signal. The device may comprise means for identifying a first audio object associated with a first video object counterpart based on a first comparison of a data component of the first audio object and a data component of the first video object, and means for identifying a second audio object not associated with a second video object counterpart based on a second comparison of a data component of the second audio object and a data component of the second video object. The device may additionally comprise means for rendering the first audio object in a first zone, means for rendering the second audio object in a second zone, and means for generating the audio output signal based on combining the rendered first audio object in the first zone and the rendered second audio object in the second zone. The various means described herein may comprise one or more processors configured to perform the functions described with respect to each of the means.
In some instances, the data component of the first audio object comprises one of a location and a size. In some instances, the data component of the first video object data comprises one of a location and a size. In some instances, the data component of the second audio object comprises one of a location and a size. In some instances, the data component of the second video object comprises one of a location and a size.
In some instances, the first zone and second zone are different zones within an audio foreground or different zones within an audio background. In some instances, the first zone and second zone are a same zone within an audio foreground or a same zone within an audio background. In some instances, the first zone is within an audio foreground and the second zone is within an audio background. In some instances, the first zone is within an audio background and the second zone is within an audio foreground.
In some instances, the data component of the first audio object, the data component of the second audio object, the data component of the first video object, and the data component of the second video object each comprises metadata.
In some instances, the device further comprises means for determining whether the first comparison is outside a confidence interval, and means for weighting the data component of the first audio object and the data component of first video object based on the determination of whether the first comparison is outside the confidence interval. In some instances, the means for weighting comprises means for averaging the data component of the first audio object data and the data component of the first video object. In some instances, the device may also include means for allocating a different number of bits based on one or more of the first comparison and the second comparison.
In some instances, the techniques may provide for a non-transitory computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors to identify a first audio object associated with a first video object counterpart based on a first comparison of a data component of the first audio object and a data component of the first video object, identify a second audio object not associated with a second video object counterpart based on a second comparison of a data component of the second audio object and a data component of the second video object, render the first audio object in a first zone, means for rendering the second audio object in a second zone, and generate the audio output signal based on combining the rendered first audio object in the first zone and the rendered second audio object in the second zone.
Various examples of this disclosure are described below. In accordance with some of the examples described below, a “device” such as an audio encoding device may include, be, or be part of one or more of a flying device, a robotic device, or an automobile. In accordance with some of the examples described below, the operation of “rendering” or a configuration causing processing circuitry to “render” may include rendering to loudspeaker feeds, or rendering to headphone feeds to headphone speakers, such as by using binaural audio speaker feeds. For instance, an audio decoding device of this disclosure may render binaural audio speaker feeds by invoking or otherwise using a binaural audio renderer.
A method comprising: obtaining, from one or more microphone arrays, audio objects of a three-dimensional (3D) soundfield, wherein each obtained audio object is associated with a respective audio scene; obtaining, from one or more video capture devices, video data comprising one or more video scenes, each respective video scene being associated with a respective audio scene of the obtained audio data; determining that a video object included in a first video scene is not represented by any corresponding audio object in a first audio scene that corresponds to the first video scene; determining that the video object is not associated with any pre-identified audio object; and responsive to the determinations that the video object is not represented by any corresponding audio object in the first audio scene and that the video object is not associated with any pre-identified audio object, identifying the video object as a silent object.
The method of example 1a, further comprising: determining that a first audio object included in obtained audio data is associated with a pre-identified audio object; and responsive to the determination that the audio object is associated with the pre-identified audio object, identifying the first audio object as a foreground audio object.
The method of any of examples 1a or 2a, further comprising: determining that a second audio object included in obtained audio data is not associated with any pre-identified audio object; and responsive to the determination that the second audio object is not associated with any pre-identified audio object, identifying the second audio object as a background audio object.
The method of any of examples 2a or 3a, wherein determining that the first audio object is associated with a pre-identified audio object comprises determining that the first audio object is associated with an audio source that is equipped with one or more sensors.
The method of any of examples 1a-4a, wherein the foreground audio object is included in the first audio scene that corresponds to the first video scene, the method further comprising: determining whether positional information of the silent object with respect to the first video scene causes attenuation of the foreground audio object.
The method of example 5a, further comprising: responsive to determining that the silent object causes the attenuation of the foreground audio object, generating one or more transmission factors with respect to the foreground audio object, wherein the generated transmission factors represent adjustments with respect to the foreground audio object.
The method of example 6a, wherein the generated transmission factors represent adjustments with respect to an energy of the foreground audio object.
The method of any of examples 6a or 7a, wherein the generated transmission factors represent adjustments with respect to directional characteristics of the foreground audio object.
The method of any of examples 6a-8a, further comprising transmitting the transmission factors out of band with respect to a bitstream that includes the foreground audio object.
The method of example 9a, wherein the generated transmission factors represent metadata with respect to the bitstream.
An audio encoding device comprising: a memory device configured to: store audio objects obtained from one or more microphone arrays with respect to a three-dimensional (3D) soundfield, wherein each obtained audio object is associated with a respective audio scene; and store video data obtained from one or more video capture devices, the video data comprising one or more video scenes, each respective video scene being associated with a respective audio scene of the obtained audio data. The audio encoding device further comprises processing circuitry coupled to the memory device, the processing circuitry being configured to: determine that a video object included in a first video scene is not represented by any corresponding audio object in a first audio scene that corresponds to the first video scene; determine that the video object is not associated with any pre-identified audio object; and identify, responsive to the determinations that the video object is not represented by any corresponding audio object in the first audio scene and that the video object is not associated with any pre-identified audio object, the video object as a silent object.
The audio encoding device of example 11a, the processing circuitry being further configured to: determine that a first audio object included in obtained audio data is associated with a pre-identified audio object; and identify, responsive to the determination that the audio object is associated with the pre-identified audio object, the first audio object as a foreground audio object.
The audio encoding device of any of examples 11a or 12a, the processing circuitry being further configured to: determine that a second audio object included in obtained audio data is not associated with any pre-identified audio object; and identify, responsive to the determination that the second audio object is not associated with any pre-identified audio object, the second audio object as a background audio object.
The audio encoding device of any of examples 12a or 13a, the processing circuitry being further configured to: determine that the first audio object is associated with a pre-identified audio object by determining that the first audio object is associated with an audio source that is equipped with one or more sensors.
The audio encoding device of example 14a, further comprising one or more microphone arrays coupled to the processing circuitry, the one or more microphone arrays being configured to capture the audio objects associated with the 3D soundfield.
The audio encoding device of any of examples 11a-14a(i), further comprising the one or more video capture devices coupled to the processing circuitry, the one or more video capture devices being configured to capture the video data.
The audio encoding device of any of examples 11 a-14a, wherein the foreground audio object is included in the first audio scene that corresponds to the first video scene, the processing circuitry being further configured to: determine whether positional information of the silent object with respect to the first video scene causes attenuation of the foreground audio object.
The audio encoding device of example 15a, the processing circuitry being further configured to: generate, responsive to determining that the silent object causes the attenuation of the foreground audio object, one or more transmission factors with respect to the foreground audio object, wherein the generated transmission factors represent adjustments with respect to the foreground audio object.
The audio encoding device of example 16a, wherein the generated transmission factors represent adjustments with respect to an energy of the foreground audio object.
The audio encoding device of any of examples 16a or 17a, wherein the generated transmission factors represent adjustments with respect to directional characteristics of the foreground audio object.
The audio encoding device of any of examples 16a-18a, the processing circuitry being further configured to transmit the transmission factors out of band with respect to a bitstream that includes the foreground audio object.
The audio encoding device of example 19a, wherein the generated transmission factors represent metadata with respect to the bitstream.
An audio encoding apparatus comprising: means for obtaining, from one or more microphone arrays, audio objects of a three-dimensional (3D) soundfield, wherein each obtained audio object is associated with a respective audio scene; means for obtaining, from one or more video capture devices, video data comprising one or more video scenes, each respective video scene being associated with a respective audio scene of the obtained audio data; means for determining that a video object included in a first video scene is not represented by any corresponding audio object in a first audio scene that corresponds to the first video scene; means for determining that the video object is not associated with any pre-identified audio object; and means for identifying, responsive to the determinations that the video object is not represented by any corresponding audio object in the first audio scene and that the video object is not associated with any pre-identified audio object, the video object as a silent object.
A non-transitory computer-readable storage medium encoded with instructions that, when executed, cause processing circuitry of an audio encoding device to: obtain, from one or more microphone arrays, audio objects of a three-dimensional (3D) soundfield, wherein each obtained audio object is associated with a respective audio scene; obtain, from one or more video capture devices, video data comprising one or more video scenes, each respective video scene being associated with a respective audio scene of the obtained audio data; determine that a video object included in a first video scene is not represented by any corresponding audio object in a first audio scene that corresponds to the first video scene; determine that the video object is not associated with any pre-identified audio object; and identify, responsive to the determinations that the video object is not represented by any corresponding audio object in the first audio scene and that the video object is not associated with any pre-identified audio object, the video object as a silent object.
An audio decoding device comprising: processing circuitry configured to: receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield; receive metadata associated with the bitstream; obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects; and apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield; and a memory device coupled to the processing circuitry, the memory device being configured to store at least a portion of the received bitstream, the received metadata, or the parallax-adjusted audio objects of the 3D soundfield.
The audio decoding device of example 1b, the processing circuitry being further configured to: determine listener location information; apply the listener location information in addition to applying the transmission factors to the one or more audio objects.
The audio decoding device of example 2b, the processing circuitry being further configured to apply relative foreground location information between the listener location information and respective locations associated with foreground audio objects of the one or more audio objects.
The audio decoding device of example 3b, the processing circuitry being further configured to apply a coordinate system to determine the relative foreground location information.
The audio decoding device of example 2b, the processing circuitry being further configured to the processing circuitry being further configured to determine the listener location information by detecting a device.
The audio decoding device of claim 5b, wherein the detected device comprises one or more of a virtual reality (VR) headset, a mixed reality (MR) headset, or an augmented reality (AR) headset.
The audio decoding device of example 2b, the processing circuitry being further configured to the processing circuitry being further configured to determine the listener location information by detecting a person.
The audio decoding device of example 2b, the processing circuitry being further configured to determine the listener location using a point cloud based interpolation process.
The audio decoding device of example 7b, the processing circuitry being further configured to: obtain a plurality of listener location candidates; and interpolate the listener location between at least two listener location candidates of the obtained plurality of listener location candidates.
The audio decoding device of example 1b, the processing circuitry being further configured to apply background translation factors that are calculated using respective locations associated with background audio objects of the one or more audio objects.
The audio decoding device of example 1b, the processing circuitry being further configured to apply foreground attenuation factors to respective foreground audio objects of the one or more audio objects.
The audio decoding device of example 1b, the processing circuitry being further configured to: determine a minimum transmission value for the respective foreground audio objects; determine whether applying the transmission factors to the respective foreground audio objects produces an adjusted transmission value that is lower than the minimum transmission value; and render, responsive to determining that the adjusted transmission value that is lower than the minimum transmission value, the respective foreground audio objects using the minimum transmission value.
The audio decoding device of example 1b, the processing circuitry being further configured to adjust an energy of the respective foreground audio objects.
The audio decoding device of example 12b, the processing circuitry being further configured to attenuate respective energies of the respective foreground audio objects.
The audio decoding device of example 12b, the processing circuitry being further configured to adjust directional characteristics of the respective foreground audio objects.
The audio decoding device of example 12b, the processing circuitry being further configured to adjust parallax information of the respective foreground audio objects.
The audio decoding device of example 16b, the processing circuitry being further configured to adjust the parallax information to account for one or more silent objects represented in a video stream associated with the 3D soundfield.
The audio decoding device of example 1b, the processing circuitry being further configured to receive the metadata within the bitstream.
The audio decoding device of example 1b, the processing circuitry being further configured to receive the metadata out of band with respect to the bitstream.
The audio decoding device of example 1b, the processing circuitry being further configured to output video data associated with the 3D soundfield to one or more displays.
The audio decoding device of example 20b, further comprising the one or more displays, the one or more displays being configured to: receive the video data from the processing circuitry; and output the received video data in visual form.
The audio decoding device of example 1b, the processing circuitry being further configured to attenuate an energy of a foreground audio object of the one or more audio objects.
The audio decoding device of example 1b, the processing circuitry being further configured to apply a translation factor to a background audio object.
The audio decoding device of example 1b, the processing circuitry being further configured to: calculate, for each respective background audio object of a plurality of background audio objects of the one or more audio objects, a respective product of a respective background audio signal and a respective translation factor; and calculate a summation of the respective products for all background audio objects of the plurality of background audio objects.
The audio decoding device of example 24b, the processing circuitry being further configured to add the summation of the products for the foreground audio objects to the summation of the products for the background audio objects.
A method comprising: receiving, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield; receiving metadata associated with the bitstream; obtaining, from the received metadata, one or more transmission factors associated with one or more of the audio objects; and applying the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.
The method of example 26b, wherein applying the transmission factors comprises applying background translation factors that are calculated using respective locations associated with background audio objects of the one or more audio objects.
The method of example 26b, wherein applying the transmission factors comprises applying foreground attenuation factors to respective foreground audio objects of the one or more audio objects.
The method of example 26b, further comprising: determining a minimum transmission value for the respective foreground audio objects; determining whether applying the transmission factors to the respective foreground audio objects produces an adjusted transmission value that is lower than the minimum transmission value; and responsive to determining that the adjusted transmission value is lower than the minimum transmission value, rendering the respective foreground audio objects using the minimum transmission value.
The method of example 26b, wherein applying the transmission factors comprises adjusting an energy of the respective foreground audio objects.
The method of claim 30b, wherein adjusting the energy comprises attenuating respective energies of the respective foreground audio objects.
The method of example 26b, wherein applying the transmission factors comprises adjusting directional characteristics of the respective foreground audio objects.
The method of example 26b, wherein applying the transmission factors comprises adjusting parallax information of the respective foreground audio objects.
The method of claim 33b, wherein adjusting the parallax information comprises adjusting the parallax information to account for one or more silent objects represented in a video stream associated with the 3D soundfield.
The method of example 26b, wherein receiving the metadata comprises receiving the metadata within the bitstream.
The method of example 26b, wherein receiving the metadata comprises receiving the metadata out of band with respect to the bitstream.
A non-transitory computer-readable storage medium encoded with instructions that, when executed, cause processing circuitry of an audio encoding device to: receive, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield; receive metadata associated with the bitstream; obtain, from the received metadata, one or more transmission factors associated with one or more of the audio objects; and apply the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.
An audio decoding apparatus comprising: means for receiving, in a bitstream, encoded representations of audio objects of a three-dimensional (3D) soundfield; means for receiving metadata associated with the bitstream; means for obtaining, from the received metadata, one or more transmission factors associated with one or more of the audio objects; and means for applying the transmission factors to the one or more audio objects to obtain parallax-adjusted audio objects of the 3D soundfield.
Example 1c. A method comprising: determining relative foreground location information between a listener location and respective locations associated with one or more foreground audio objects of a three-dimensional (3D) soundfield, the respective locations associated with the one or more foreground audio objects being obtained from video data associated with the 3D soundfield.
The method of example 1c, further comprising applying a coordinate system to determine the relative foreground location information.
The method of any of examples 1c or 2c, further comprising determining the listener location information by detecting a device.
The method of example 3c, wherein the device comprises a virtual reality (VR) headset.
The method of any of examples 1c or 2c, further comprising determining the listener location information by detecting a person.
The method of any of examples 1c or 2c, further comprising determining the listener location using a point cloud based interpolation process.
The method of example 6c, wherein using the point cloud based interpolation process comprises: obtaining a plurality of listener location candidates; and interpolating the listener location between at least two listener location candidates of the obtained plurality of listener location candidates.
An audio decoding device comprising: a memory device configured to store a listener location and respective locations associated with one or more foreground audio objects of a three-dimensional (3D) soundfield, the respective locations associated with the one or more foreground audio objects being obtained from video data associated with the 3D soundfield; and processing circuitry coupled to the memory device, the processing circuitry being configured to determine relative foreground location information between the listener location and the respective locations associated with the one or more foreground audio objects of the 3D soundfield.
The audio decoding device of example 8c, the processing circuitry being further configured to apply a coordinate system to determine the relative foreground location information.
The audio decoding device of any of examples 8c or 9c, the processing circuitry being further configured to determine the listener location information by detecting a device.
The audio decoding device of example 10c, wherein the detected device comprises one or more of a virtual reality (VR) headset, a mixed reality (MR) headset, or an augmented reality (AR) headset.
The audio decoding device of any of examples 8c or 9c, the processing circuitry being further configured to determine the listener location information by detecting a person.
The audio decoding device of any of examples 8c or 9c, the processing circuitry being further configured to determine the listener location using a point cloud based interpolation process.
The audio decoding device of example 13c, the processing circuitry being further configured to: obtain a plurality of listener location candidates; and interpolate the listener location between at least two listener location candidates of the obtained plurality of listener location candidates.
An audio decoding apparatus comprising: means for determining relative foreground location information between a listener location and respective locations associated with one or more foreground audio objects of a three-dimensional (3D) soundfield, the respective locations associated with the one or more foreground audio objects being obtained from video data associated with the 3D soundfield.
A non-transitory computer-readable storage medium encoded with instructions that, when executed, cause processing circuitry of an audio decoding device to: determine relative foreground location information between a listener location and respective locations associated with one or more foreground audio objects of a three-dimensional (3D) soundfield, the respective locations associated with the one or more foreground audio objects being obtained from video data associated with the 3D soundfield.
A method comprising: generating metadata associated with a bitstream that includes encoded representations of audio objects of a three-dimensional (3D) soundfield, the metadata including one or more of transmission factors with respect to the audio objects, relative foreground location information between listener location information and respective locations associated with foreground audio objects of the audio objects, or location information for one or more silent objects of the audio objects.
The method of example 1 d, wherein generating the metadata comprises generating the transmission factors based on attenuation information associated with the silent objects and the foreground audio objects.
The method claim 2d, wherein the transmission factors represent energy attenuation information with respect to the foreground audio objects based on the location information for the silent objects.
The method of any of examples 2d or 3d, wherein the transmission factors represent directional attenuation information with respect to the foreground audio objects based on the location information for the silent objects.
The method of any of examples 2d-4d, further comprising determining the transmission factors based on the listener location information and the location information for the silent objects.
The method of any of examples 2d-5d, further comprising determining the transmission factors based on the listener location information and location information for the foreground audio objects.
The method of any of examples 1d-6d, further comprising: generating the bitstream that includes the encoded representations of the audio objects of the 3D soundfield; and signaling the bitstream.
The method of example 7d, further comprising signaling the metadata within the bitstream.
The method of example 7d, further comprising signaling the metadata out-of-band with respect to the bitstream.
A method comprising: obtaining metadata that includes transmission factors with respect to one or more audio objects of a three-dimensional (3D) soundfield; and applying the transmission factors to audio signals associated with the one or more audio objects of the 3D soundfield.
The method of example 10d, wherein applying the transmission factors to the audio signals comprises attenuating energy information for the one or more audio signals.
The method of any of examples 10d or 11d, wherein the one or more audio objects comprise foreground audio objects of the 3D soundfield.
An audio encoding device comprising: a memory device configured to store encoded representations of audio objects of a three-dimensional (3D) soundfield; and processing circuitry coupled to the memory device and configured to generate metadata associated with a bitstream that includes the encoded representations of the audio objects of the 3D soundfield, the metadata including one or more of transmission factors with respect to the audio objects, relative foreground location information between listener location information and respective locations associated with foreground audio objects of the audio objects, or location information for one or more silent objects of the audio objects.
The audio encoding device of example 13d, the processing circuitry being configured to generate the transmission factors based on attenuation information associated with the silent objects and the foreground audio objects.
The audio encoding device of example 14d, wherein the transmission factors represent energy attenuation information with respect to the foreground audio objects based on the location information for the silent objects.
The audio encoding device of any of examples 14d or 15d, wherein the transmission factors represent directional attenuation information with respect to the foreground audio objects based on the location information for the silent objects.
The audio encoding device of any of examples 14d-16d, the processing circuitry being further configured to determine the transmission factors based on the listener location information and the location information for the silent objects.
The audio encoding device of any of examples 14d-17d, the processing circuitry being further configured to determine the transmission factors based on the listener location information and location information for the foreground audio objects.
The audio encoding device of any of examples 13d-18d, the processing circuitry being further configured to: generate the bitstream that includes the encoded representations of the audio objects of the 3D soundfield; and signal the bitstream.
The audio encoding device of example 19d, the processing circuitry being configured to signal the metadata within the bitstream.
The audio encoding device of example 19d, the processing circuitry being configured to signal the metadata out-of-band with respect to the bitstream.
An audio decoding device comprising: a memory device configured to store one or more audio objects of a three-dimensional (3D) soundfield; and processing circuitry coupled to the memory device, and configured to: obtain metadata that includes transmission factors with respect to the one or more audio objects of the 3D soundfield; and apply the transmission factors to audio signals associated with the one or more audio objects of the 3D soundfield.
The audio decoding device of example 22d, the processing circuitry being further configured to attenuate energy information for the one or more audio signals.
The audio decoding device of any of examples 22d or 23d, wherein the one or more audio objects comprise foreground audio objects of the 3D soundfield.
An audio encoding apparatus comprising: means for generating metadata associated with a bitstream that includes encoded representations of audio objects of a three-dimensional (3D) soundfield, the metadata including one or more of transmission factors with respect to the audio objects, relative foreground location information between listener location information and respective locations associated with foreground audio objects of the audio objects, or location information for one or more silent objects of the audio objects.
An audio decoding apparatus comprising: means for obtaining metadata that includes transmission factors with respect to one or more audio objects of a three-dimensional (3D) soundfield; and means for applying the transmission factors to audio signals associated with the one or more audio objects of the 3D soundfield.
An integrated device comprising: the audio encoding device of example 13d; and the audio decoding device of example 14d.
A method of rendering a three-dimensional (3D) soundfield, the method comprising: applying a transmission factor to a foreground audio signal for a foreground audio object to attenuate one or more characteristics of the foreground audio signal.
The method of example 1e, wherein attenuating the characteristics of the foreground audio signal comprises attenuating an energy of the foreground audio signal.
The method of any of examples 1e or 2e, further comprising applying a translation factor to a background audio object.
An audio decoding device comprising: a memory device configured to store a foreground audio object of a three-dimensional (3D) soundfield; and processing circuitry coupled to the memory device and configured to apply a transmission factor to a foreground audio signal for a foreground audio object to attenuate one or more characteristics of the foreground audio signal.
The audio decoding device of example 4e, the processing circuitry being configured to attenuate an energy of the foreground audio signal.
The audio decoding device of any of examples 4e or 5e, the processing circuitry being configured to apply a translation factor to a background audio object.
An audio decoding apparatus comprising: means for applying a transmission factor to a foreground audio signal for a foreground audio object of a three-dimensional (3d) soundfield to attenuate one or more characteristics of the foreground audio signal.
A method of rendering a three-dimensional (3D) soundfield, the method comprising: calculating, for each respective foreground audio object of a plurality of foreground audio objects, a respective product of a respective of a transmission factor, a foreground audio signal, and a directional vector; and calculating a summation of the respective products for all foreground audio objects of the plurality of foreground audio objects.
The method of example 1f, further comprising: calculating, for each respective background audio object of a plurality of background audio objects, a respective product of a respective background audio signal and a respective translation factor; and calculating a summation of the respective products for all background audio objects of the plurality of background audio objects.
The method of example 2f, further comprising adding the summation of the products for the foreground audio objects to the summation of the products for the background audio objects.
The method of any of examples 1f-3f, further comprising performing all calculations in a higher order ambisonics (HOA) domain.
An audio decoding device comprising: a memory device configured to store a plurality of foreground audio objects of a three-dimensional (3D) soundfield; and processing circuitry coupled to the memory device, and being configured to: calculate, for each respective foreground audio object of the plurality of foreground audio objects, a respective product of a respective set of a transmission factor, a foreground audio signal, and a directional vector; and calculate a summation of the respective products for all foreground audio objects of the plurality of foreground audio objects.
The audio decoding device of example 5f, the memory device being further configured to store and a plurality of background audio objects, the processing circuitry being further configured to: calculate, for each respective background audio object of a plurality of background audio objects, a respective product of a respective background audio signal and a respective translation factor; and calculate a summation of the respective products for all background audio objects of the plurality of background audio objects.
The audio decoding device of example 6f, the processing circuitry being further configured to add the summation of the products for the foreground audio objects to the summation of the products for the background audio objects.
The audio decoding device of any of examples 5f-7f, the processing circuitry being further configured to perform all calculations in a higher order ambisonics (HOA) domain.
An audio decoding apparatus comprising: means for calculating, for each respective foreground audio object of a plurality of foreground audio objects of a three-dimensional (3D) soundfield, a respective product of a respective of a transmission factor, a foreground audio signal, and a directional vector; and means for calculating a summation of the respective products for all foreground audio objects of the plurality of foreground audio objects.
It should be understood that, depending on the example, certain acts or events of any of the methods described herein can be performed in a different sequence, may be added, merged, or left out altogether (e.g., not all described acts or events are necessary for the practice of the method). Moreover, in certain examples, acts or events may be performed concurrently, e.g., through multi-threaded processing, interrupt processing, or multiple processors, rather than sequentially. In addition, while certain aspects of this disclosure are described as being performed by a single module or unit for purposes of clarity, it should be understood that the techniques of this disclosure may be performed by a combination of units or modules associated with a video coder.
In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol.
In this manner, computer-readable media generally may correspond to (1) tangible computer-readable storage media which is non-transitory or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure. A computer program product may include a computer-readable medium.
By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transient media, but are instead directed to non-transient, tangible storage media. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
Instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. The term “processor” may be formed in one or more microprocessors, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), digital signal processors (DSPs), processing circuitry (including fixed function circuitry and/or programmable processing circuitry), or other equivalent integrated or discrete logic circuitry. In addition, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques could be fully implemented in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware
Various embodiments of the techniques have been described. These and other embodiments are within the scope of the following claims.
This application claims the benefit of U.S. Provisional Application No. 62/446,324 filed 13 Jan. 2017, the entire content of which is incorporated by reference herein.
Number | Name | Date | Kind |
---|---|---|---|
8805561 | Wilcock et al. | Aug 2014 | B2 |
20110200196 | Disch | Aug 2011 | A1 |
20110249821 | Jaillet et al. | Oct 2011 | A1 |
20110316967 | Etter | Dec 2011 | A1 |
20120177204 | Hellmuth | Jul 2012 | A1 |
20120206452 | Geisner | Aug 2012 | A1 |
20140233917 | Xiang | Aug 2014 | A1 |
20150055937 | Van Hoff et al. | Feb 2015 | A1 |
20150070274 | Morozov | Mar 2015 | A1 |
20150230040 | Squires et al. | Aug 2015 | A1 |
20160124707 | Ermilov et al. | May 2016 | A1 |
20160134988 | Gorzel | May 2016 | A1 |
20160227340 | Peters et al. | Aug 2016 | A1 |
20160241980 | Najaf-Zadeh et al. | Aug 2016 | A1 |
20160269712 | Ostrover et al. | Sep 2016 | A1 |
20160337630 | Raghoebardajal et al. | Nov 2016 | A1 |
20160373640 | Van Hoff et al. | Dec 2016 | A1 |
20170098453 | Wright et al. | Apr 2017 | A1 |
20170318360 | Tran et al. | Nov 2017 | A1 |
20170332186 | Riggs et al. | Nov 2017 | A1 |
20180220251 | Brettle et al. | Aug 2018 | A1 |
20180253275 | Helwani et al. | Sep 2018 | A1 |
20180300940 | Sakthivel et al. | Oct 2018 | A1 |
20190005986 | Peters et al. | Jan 2019 | A1 |
Number | Date | Country |
---|---|---|
2017098949 | Jun 2017 | WO |
2018031123 | Feb 2018 | WO |
Entry |
---|
Audio, “Call for Proposals for 3D Audio,” International Organisation for Standardisation Organisation Internationale de Normalisation ISO/IEC JTC1/SC29/WG11 Coding of Moving Pictures and Audio, ISO/IEC PTC1/SC29/WG11/N13411, Geneva, CH, Jan. 2013, pp. 1-20. |
“Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: 3D Audio,” ISO/IEC JTC 1/SC 29N, ISO/IEC 23008-3; 201*(E), Oct. 12, 2016, 797 pp. |
Rébillat M., et al., “SMART-I2: Spatial Multi-users Audio-visual Real Time Interactive Interface, a broadcast application context,” 3DTV Conference: The True Vision—Capture, Transmission and Display of 3D Video, May 2009, Postdam, Germany, pp. 1-4. |
Response to Written Opinion dated Mar. 6, 2018, from International Application No. PCT/US2018/013526, filed on Sep. 12, 2018, 18 pp. |
Second Written Opinion, dated Oct. 29, 2018, for International Application No. PCT/US2018/013526, 8 pp. |
International Search Report and Written Opinion—PCT/US2018/013526—ISA/EPO—Mar. 6, 2018, 16 pp. |
“Call for Proposals for 3D Audio,” ISO/IEC JTC1/SC29/WG11/N13411, Jan. 2013, 20 pp. |
“Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: Part 3: 3D Audio, Amendment 3: MPEG-H 3D Audio Phase 2,” ISO/IEC JTC 1/SC 29N, Jul. 25, 2015, 208 pp. |
“Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: 3D Audio,” ISO/IEC JTC 1/SC 29N, Apr. 4, 2014, 337 pp. |
“Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: 3D Audio,” ISO/IEC JTC 1/SC 29, Jul. 25, 2014, 433 pp. |
“Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: 3D audio,” ISO/IEC JTC 1/SC 29, ISO/IEC DIS 23008-3, Jul. 25, 2014, 311 pp. |
“Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: 3D Audio,” ISO/IEC JTC 1/SC 29, ISO/IEC 23008-3:201x(E), Oct. 12, 2016, 797 pp. |
Herre et al., “MPEG-H 3D Audio—The New Standard for Coding of Immersive Spatial Audio,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, No. 5, Aug. 2015, pp. 770-779. |
Hollerweger et al., “An Introduction to Higher Order Ambisonic,” Oct. 2008, 13 pp. |
Poletti, “Three-Dimensional Surround Sound Systems Based on Spherical Harmonics,” J. Audio Eng. Soc., vol. 53, No. 11, Nov. 2005, pp. 1004-1025. |
Schonefeld, “Spherical Harmonics,” Jul. 1, 2005, Accessed online [Jul. 9, 2013] at URL:http://videoarch1.s-inf.de/˜volker/prosem_paper.pdf., 25 pp. |
Sen et al., “RM1-HOA Working Draft Text,” MPEG Meeting; Jan. 13-17, 2014; San Jose, CA; (Motion Picture Expert Group or ISO/IEC JTC1/SC2/WG11), No. M31827, XP030060280, 83 pp. |
Sen et al., “Technical Description of the Qualcomm's HOA Coding Technology for Phase II,” ISO/IEC JTC/SC29/WG11 MPEG 2014/M34104, Jul. 2014, 4 pp. |
Tingvall J., “Interior Design and Navigation in Virtual Reality,” Information Coding, last updated Nov. 30, 2015, accessed from [http://liu.divaportal.org/smash/record.jsf?pid=diva2%3A8750948,dswid=6015], 88 pp. |
Van Gelderen M., “The shift operators and translations of spherical harmonics,” DEOS Progress Letter 1998.1:57-67, accessed on Jul. 11, 2017, 11 pp. |
Lavalle S.M., et al., “Head Tracking for the Oculus Rift,” Oculus VR, Inc., accessed on Aug. 17, 2017, 8 pp. |
Peterson et al., “Virtual Reality, Augmented Reality, and Mixed Reality Definitions,” EMA, version 1.0, Jul. 7, 2017,4 pp. |
U.S. Appl. No. 15/672,058, filed by Nils Gunther Peters, filed Aug. 8, 2017. |
U.S. Appl. No. 15/782,252, filed by Nils Gunther Peters, filed Oct. 12, 2017. |
Porschmann C., et al.,“3-D Audio in Mobile Communication Devices: Methods for Mobile Head-Tracking,” Journal of Virtual Reality and Broadcasting, 2007, vol. 4, No. 13, 14 pages. |
Tylka J.G., et al., “Comparison of Techniques for Binaural Navigation of Higher-order Ambisonic Soundfields,” AES Convention 139; Oct. 2015, AES, 60 East 42nd Street, Room 2520 New York 10165-2520, USA, Oct. 23, 2015, XP040672273, 13 pages. |
International Preliminary Report on Patentability dated Mar. 11, 2019 from International Application No. PCT/US2018/013526, filed Sep. 12, 2018, 24 pages. |
Number | Date | Country | |
---|---|---|---|
20180206057 A1 | Jul 2018 | US |
Number | Date | Country | |
---|---|---|---|
62446324 | Jan 2017 | US |