This application claims the benefit of Greece Patent Application No. 20200100088, filed Feb. 20, 2020, the entire contents of which are hereby incorporated by reference in its entirety.
This disclosure relates to processing of audio data.
Computer-mediated reality systems are being developed to allow computing devices to augment or add to, remove or subtract from, or generally modify existing reality experienced by a user. Computer-mediated reality systems may include, as a couple of examples, virtual reality (VR) systems, augmented reality (AR) systems, and mixed reality (MR) systems. The perceived success of computer-mediated reality systems are generally related to the ability of such computer-mediated reality systems to provide a realistically immersive experience in terms of both the video and audio experience where the video and audio experience align in ways expected by the user. Although the human visual system is more sensitive than the human auditory systems (e.g., in terms of perceived localization of various objects within the scene), ensuring a adequate auditory experience is an increasingly import factor in ensuring a realistically immersive experience, particularly as the video experience improves to permit better localization of video objects that enable the user to better identify sources of audio content.
This disclosure generally relates to techniques for signaling the use of renders and associated parameters for audio data. The techniques may improve the listener experience, as the use or disuse of renderers may better preserve the artistic intent of the content producer.
As one example, an audio decoding device includes a memory configured to store at least a portion of a coded audio bitstream; and one or more processors configured to: decode, based on the coded audio bitstream, a representation of a soundfield; decode, based on the coded audio bitstream, a syntax element indicating a selection of either a head-related transfer function (HRTF) or a binaural room impulse response (BRIR); and render, using the selected HRTF or BRIR, speaker feeds from the soundfield.
As another example, an audio encoding device includes a memory configured to store at least a portion of a coded audio bitstream; and one or more processors configured to: encode, in the coded audio bitstream, a representation of a soundfield; and encode, in the coded audio bitstream, a syntax element indicating a selection of either a head-related transfer function (HRTF) or a binaural room impulse response (BRIR) for rendering the soundfield.
The details of one or more examples of this disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of various aspects of the techniques will be apparent from the description and drawings, and from the claims.
There are various ‘surround-sound’ channel-based formats in the market. They range, for example, from the 5.1 home theatre system (which has been the most successful in terms of making inroads into living rooms beyond stereo) to the 22.2 system developed by NHK (Nippon Hoso Kyokai or Japan Broadcasting Corporation). Content creators (e.g., Hollywood studios) would like to produce the soundtrack for a movie once, and not spend effort to remix it for each speaker configuration. A Moving Pictures Expert Group (MPEG) has released a standard allowing for soundfields to be represented using a hierarchical set of elements (e.g., Higher-Order Ambisonic—HOA—coefficients) that can be rendered to speaker feeds for most speaker configurations, including 5.1 and 22.2 configuration whether in location defined by various standards or in non-uniform locations.
MPEG released the standard as MPEG-H 3D Audio standard, formally entitled “Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: 3D audio,” set forth by ISO/IEC JTC 1/SC 29, with document identifier ISO/IEC DIS 23008-3, and dated Jul. 25, 2014. MPEG also released a second edition of the 3D Audio standard, entitled “Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: 3D audio, set forth by ISO/IEC JTC 1/SC 29, with document identifier ISO/IEC 23008-3:201x(E), and dated Oct. 12, 2016. Reference to the “3D Audio standard” in this disclosure may refer to one or both of the above standards.
As noted above, one example of a hierarchical set of elements is a set of spherical harmonic coefficients (SHC). The following expression demonstrates a description or representation of a soundfield using SHC:
The expression shows that the pressure pi at any point {rr, θt, φr} of the soundfield, at time t, can be represented uniquely by the SHC, Anm(k). Here,
c is the speed of sound (˜343 m/s), {rr, θr, φr} is a point of reference (or observation point), jn(⋅) is the spherical Bessel function of order n, and Ynm(θr, φr) are the spherical harmonic basis functions (which may also be referred to as a spherical basis function) of order n and suborder m. It can be recognized that the term in square brackets is a frequency-domain representation of the signal (i.e., S(ω, rr, θr, φr)) which can be approximated by various time-frequency transformations, such as the discrete Fourier transform (DFT), the discrete cosine transform (DCT), or a wavelet transform. Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of coefficients of multiresolution basis functions.
The SHC Anm(k) can either be physically acquired (e.g., recorded) by various microphone array configurations or, alternatively, they can be derived from channel-based or object-based descriptions of the soundfield. The SHC (which also may be referred to as higher order ambisonic—HOA—coefficients) represent scene-based audio, where the SHC may be input to an audio encoder to obtain encoded SHC that may promote more efficient transmission or storage. For example, a fourth-order representation involving (1+4)2 (25, and hence fourth order) coefficients may be used.
As noted above, the SHC may be derived from a microphone recording using a microphone array. Various examples of how SHC may be derived from microphone arrays are described in Poletti, M., “Three-Dimensional Surround Sound Systems Based on Spherical Harmonics,” J. Audio Eng. Soc., Vol. 53, No. 11, 2005 November, pp. 1004-1025.
To illustrate how the SHCs may be derived from an object-based description, consider the following equation. The coefficients Anm(k) for the soundfield corresponding to an individual audio object may be expressed as:
Anm(k)=g(ω)(−4πik)hn(2)(krs)Ynm*(θs,φs),
where i is √{square root over (−1)}, hn(2)(⋅) is the spherical Hankel function (of the second kind) of order n, and {rs, θs, φs} is the location of the object. Knowing the object source energy g(ω) as a function of frequency (e.g., using time-frequency analysis techniques, such as performing a fast Fourier transform on the PCM stream) allows us to convert each PCM object and the corresponding location into the SHC Anm(k). Further, it can be shown (since the above is a linear and orthogonal decomposition) that the Anm(k) coefficients for each object are additive. In this manner, a number of PCM objects can be represented by the Anm(k) coefficients (e.g., as a sum of the coefficient vectors for the individual objects). Essentially, the coefficients contain information about the soundfield (the pressure as a function of 3D coordinates), and the above represents the transformation from individual objects to a representation of the overall soundfield, in the vicinity of the observation point {rr, θr, φr}.
Scene-based audio formats, such as the above noted SHC, represent one way by which to represent a soundfield. Other possible formats include channel-based audio formats and object-based audio formats. Channel-based audio formats refer to the 5.1 surround sound format, 7.1 surround sound formats, 22.2 surround sound formats, or any other channel-based format that localizes audio channels to particular locations around the listener in order to recreate a soundfield.
Object-based audio formats may refer to formats in which audio objects, often encoded using pulse-code modulation (PCM) and referred to as PCM audio objects, are specified in order to represent the soundfield. Such audio objects may include metadata identifying a location of the audio object relative to a listener or other point of reference in the soundfield, such that the audio object may be rendered to one or more speaker channels for playback in an effort to recreate the soundfield. The techniques described in this disclosure may apply to any of the foregoing formats, including scene-based audio formats, channel-based audio formats, object-based audio formats, or any combination thereof.
The source device 12 may be operated by an entertainment company or other entity that may generate multi-channel audio content for consumption by operators of content consumer devices, such as the content consumer device 14. In many VR scenarios, the source device 12 generates audio content in conjunction with video content. The source device 12 includes a content capture device 300 and a soundfield representation generator 302. The content capture device 300 may be configured to interface or otherwise communicate with a microphone 5. The microphone 5 may represent an Eigenmike® or other type of 3D audio microphone capable of capturing and representing the soundfield as audio data 11, which may refer to one or more of the above noted scene-based audio data (such as HOA coefficients), object-based audio data, and channel-based audio data.
The content capture device 300 may, in some examples, include an integrated microphone 5 that is integrated into the housing of the content capture device 300. The content capture device 300 may interface wirelessly or via a wired connection with the microphone 5. Rather than capture, or in conjunction with capturing, audio data via microphone 5, the content capture device 300 may process the audio data 11 after the audio data 11 is input via some type of removable storage, wirelessly and/or via wired input processes. As such, various combinations of the content capture device 300 and the microphone 5 are possible in accordance with this disclosure.
The content capture device 300 may also be configured to interface or otherwise communicate with the soundfield representation generator 302. The soundfield representation generator 302 may include any type of hardware device capable of interfacing with the content capture device 300. The soundfield representation generator 302 may use the audio data 11 provided by the content capture device 300 to generate various representations of the same soundfield represented by the audio data 11.
For instance, to generate the different representations of the soundfield using HOA coefficients (which again is one example of the audio data 11), soundfield representation generator 302 may use a coding scheme for ambisonic representations of a soundfield, referred to as Mixed Order Ambisonics (MOA) as discussed in more detail in U.S. application Ser. No. 15/672,058, entitled “MIXED-ORDER AMBISONICS (MOA) AUDIO DATA FO COMPUTER-MEDIATED REALITY SYSTEMS,” and filed Aug. 8, 2017.
To generate a particular MOA representation of the soundfield, the soundfield representation generator 302 may generate a partial subset of the full set of HOA coefficients. For instance, each MOA representation generated by the soundfield representation generator 302 may provide precision with respect to some areas of the soundfield, but less precision in other areas. In one example, an MOA representation of the soundfield may include eight (8) uncompressed HOA coefficients of the HOA coefficients, while the third order HOA representation of the same soundfield may include sixteen (16) uncompressed HOA coefficients of the HOA coefficients. As such, each MOA representation of the soundfield that is generated as a partial subset of the HOA coefficients may be less storage-intensive and less bandwidth intensive (if and when transmitted as part of the bitstream 21 over the illustrated transmission channel) than the corresponding third order HOA representation of the same soundfield generated from the HOA coefficients.
Although described with respect to MOA representations, the techniques of this disclosure may also be performed with respect to full-order ambisonic (FOA) representations in which all of the HOA coefficients for a given order N are used to represent the soundfield. In other words, rather than represent the soundfield using a partial, non-zero subset of the HOA coefficients, the soundfield representation generator 302 may represent the soundfield using all of the HOA coefficients for a given order N, resulting in a total of HOA coefficients equaling (N+1)2.
In this respect, the higher order ambisonic audio data (which is another way to refer to HOA coefficients in either MOA representations or FOA representations) may include higher order ambisonic coefficients associated with spherical basis functions having an order of one or less (which may be referred to as “1st order ambisonic audio data”), higher order ambisonic coefficients associated with spherical basis functions having a mixed order and suborder (which may be referred to as the “MOA representation” discussed above), or higher order ambisonic coefficients associated with spherical basis functions having an order greater than one (which is referred to above as the “FOA representation”).
The content capture device 300 may also be configured to interface or otherwise communicate with the soundfield representation generator 302. The soundfield representation generator 302 may include any type of hardware device capable of interfacing with the content capture device 300. The soundfield representation generator 302 may the use audio data 11 provided by the content capture device 300 to generate various representations of the same soundfield represented by the audio data 11. For instance, to generate the different representations of the soundfield using the audio data 11, soundfield representation generator 302 may use a coding scheme for ambisonic representations of a soundfield, referred to as Mixed Order Ambisonics (MOA) as discussed in more detail in U.S. application Ser. No. 15/672,058, entitled “MIXED-ORDER AMBISONICS (MOA) AUDIO DATA FOR COMPUTER-MEDIATED REALITY SYSTEMS,” and filed Aug. 8, 2017.
To generate a particular MOA representation of the soundfield, the soundfield representation generator 302 may generate a partial subset of the full set of HOA coefficients (e.g., audio data 11). For instance, each MOA representation generated by the soundfield representation generator 302 may provide precision with respect to some areas of the soundfield, but less precision in other areas. In one example, an MOA representation of the soundfield may include eight (8) uncompressed HOA coefficients of the HOA coefficients, while the third order HOA representation of the same soundfield may include sixteen (16) uncompressed HOA coefficients of the HOA coefficients. As such, each MOA representation of the soundfield that is generated as a partial subset of the HOA coefficients may be less storage-intensive and less bandwidth intensive (if and when transmitted as part of the bitstream 21 over the illustrated transmission channel) than the corresponding third order HOA representation of the same soundfield generated from the HOA coefficients.
Although described with respect to MOA representations, the techniques of this disclosure may also be performed with respect to full-order ambisonic (FOA) representations in which all of the HOA coefficients for a given order N are used to represent the soundfield. In other words, rather than represent the soundfield using a partial, non-zero subset of the HOA coefficients, the soundfield representation generator 302 may represent the soundfield using all of the HOA coefficients for a given order N, resulting in a total of HOA coefficients equaling (N+1)2.
In this respect, the audio data 11 may include higher order ambisonic coefficients associated with spherical basis functions having an order of one or less (which may be referred to as “1st order ambisonic audio data”), higher order ambisonic coefficients associated with spherical basis functions having a mixed order and suborder (which may be referred to as the “MOA representation” discussed above), or higher order ambisonic coefficients associated with spherical basis functions having an order greater than one (which is referred to above as the “FOA representation”).
The content capture device 300 may, in some examples, be configured to wirelessly communicate with the soundfield representation generator 302. In some examples, the content capture device 300 may communicate, via one or both of a wireless connection or a wired connection, with the soundfield representation generator 302. Via the connection between the content capture device 300 and the soundfield representation generator 302, the content capture device 300 may provide content in various forms of content, which, for purposes of discussion, are described herein as being portions of the HOA coefficients.
In some examples, the content capture device 300 may leverage various aspects of the soundfield representation generator 302 (in terms of hardware or software capabilities of the soundfield representation generator 302). For example, the soundfield representation generator 302 may include dedicated hardware configured to (or specialized software that when executed causes one or more processors to) perform psychoacoustic audio encoding (such as a unified speech and audio coder denoted as “USAC” set forth by the Motion Picture Experts Group (MPEG) or the MPEG-H 3D audio coding standard). The content capture device 300 may not include the psychoacoustic audio encoder dedicated hardware or specialized software and instead provide audio aspects of the content 301 in a non-psychoacoustic-audio-coded form. The soundfield representation generator 302 may assist in the capture of content 301 by, at least in part, performing psychoacoustic audio encoding with respect to the audio aspects of the content 301.
The soundfield representation generator 302 may also assist in content capture and transmission by generating one or more bitstreams 21 based, at least in part, on the audio content (e.g., MOA representations and/or third order HOA representations) generated from the HOA coefficients. The bitstream 21 may represent a compressed version of the HOA coefficients (and/or the partial subsets thereof used to form MOA representations of the soundfield) and any other different types of the content 301 (such as a compressed version of spherical video data, image data, or text data).
The soundfield representation generator 302 may generate the bitstream 21 for transmission, as one example, across a transmission channel, which may be a wired or wireless channel, a data storage device, or the like. The bitstream 21 may represent an encoded version of the HOA coefficients (and/or the partial subsets thereof used to form MOA representations of the soundfield) and may include a primary bitstream and another side bitstream, which may be referred to as side channel information. In some instances, the bitstream 21 representing the compressed version of the HOA coefficients may conform to bitstreams produced in accordance with the MPEG-H 3D audio coding standard.
The content consumer device 14 may be operated by an individual, and may represent a VR client device. Although described with respect to a VR client device, content consumer device 14 may represent other types of devices, such as an augmented reality (AR) client device, a mixed reality (MR) client device (or any other type of head-mounted display device), a standard computer, a headset, headphones, or any other device capable of tracking head movements and/or general translational movements of the individual operating the client consumer device 14. As shown in the example of
While shown in
Alternatively, the source device 12 may store the bitstream 21 to a storage medium, such as a compact disc, a digital video disc, a high definition video disc or other storage media, most of which are capable of being read by a computer and therefore may be referred to as computer-readable storage media or non-transitory computer-readable storage media. In this context, the transmission channel may refer to the channels by which content stored to the mediums are transmitted (and may include retail stores and other store-based delivery mechanism). In any event, the techniques of this disclosure should not therefore be limited in this respect to the example of
As noted above, the content consumer device 14 includes the audio playback system 16. The audio playback system 16 may represent any system capable of playing back multi-channel audio data. The audio playback system 16 may include a number of different audio renderers 22. The renderers 22 may each provide for a different form of audio rendering, where the different forms of rendering may include one or more of the various ways of performing vector-base amplitude panning (VBAP), and/or one or more of the various ways of performing soundfield synthesis. As used herein, “A and/or B” means “A or B”, or both “A and B”.
The audio playback system 16 may further include an audio decoding device 24. The audio decoding device 24 may represent a device configured to decode bitstream 21 to output reconstructed audio data 11A′-11N′ (which may form the full third order HOA representation or a subset thereof that forms an MOA representation of the same soundfield or decompositions thereof, such as the predominant audio signal, ambient HOA coefficients, and the vector based signal described in the MPEG-H 3D Audio Coding Standard).
As such, the audio data 11A′-11N′ (“audio data 11′”) may be similar to a full set or a partial subset of the HOA coefficients, but may differ due to lossy operations (e.g., quantization) and/or transmission via the transmission channel. The audio playback system 16 may, after decoding the bitstream 21 to obtain the HOA coefficients 11′, obtain HOA audio data 15 from the different streams of HOA coefficients 11′, and render the HOA audio data 15 to output speaker feeds 25. The speaker feeds 25 may drive one or more speakers (which are not shown in the example of
To select the appropriate renderer or, in some instances, generate an appropriate renderer, the audio playback system 16 may obtain loudspeaker information 13 indicative of a number of loudspeakers and/or a spatial geometry of the loudspeakers. In some instances, the audio playback system 16 may obtain the loudspeaker information 13 using a reference microphone and driving the loudspeakers in such a manner as to dynamically determine the loudspeaker information 13. In other instances, or in conjunction with the dynamic determination of the loudspeaker information 13, the audio playback system 16 may prompt a user to interface with the audio playback system 16 and input the loudspeaker information 13.
The audio playback system 16 may select one of the audio renderers 22 based on the loudspeaker information 13. In some instances, the audio playback system 16 may, when none of the audio renderers 22 are within some threshold similarity measure (in terms of the loudspeaker geometry) to the loudspeaker geometry specified in the loudspeaker information 13, generate the one of audio renderers 22 based on the loudspeaker information 13. The audio playback system 16 may, in some instances, generate one of the audio renderers 22 based on the loudspeaker information 13 without first attempting to select an existing one of the audio renderers 22.
When outputting the speaker feeds 25 to headphones, the audio playback system 16 may utilize one of the renderers 22 that provides for binaural rendering using head-related transfer functions (HRTF) or other functions capable of rendering to left and right speaker feeds 25 for headphone speaker playback. The terms “speakers” or “transducer” may generally refer to any speaker, including loudspeakers, headphone speakers, etc. One or more speakers may then playback the rendered speaker feeds 25.
Although described as rendering the speaker feeds 25 from the HOA audio data 11′, reference to rendering of the speaker feeds 25 may refer to other types of rendering, such as rendering incorporated directly into the decoding of the HOA audio data 15 from the bitstream 21. An example of the alternative rendering can be found in Annex G of the MPEG-H 3D audio coding standard, where rendering occurs during the predominant signal formulation and the background signal formation prior to composition of the soundfield. As such, reference to rendering of the HOA audio data 15 should be understood to refer to both rendering of the actual HOA audio data 15 or decompositions or representations thereof of the HOA audio data 15 (such as the above noted predominant audio signal, the ambient HOA coefficients, and/or the vector-based signal—which may also be referred to as a V-vector).
As described above, the content consumer device 14 may represent a VR device in which a human wearable display is mounted in front of the eyes of the user operating the VR device.
Video, audio, and other sensory data may play important roles in the VR experience. To participate in a VR experience, the user 402 may wear the VR device 400 (which may also be referred to as a VR headset 400) or other wearable electronic device. The VR client device (such as the VR headset 400) may track head movement of the user 402, and adapt the video data shown via the VR headset 400 to account for the head movements, providing an immersive experience in which the user 402 may experience a virtual world shown in the video data in visual three dimensions.
While VR (and other forms of AR and/or MR, which may generally be referred to as a computer mediated reality device) may allow the user 402 to reside in the virtual world visually, often the VR headset 400 may lack the capability to place the user in the virtual world audibly. In other words, the VR system (which may include a computer responsible for rendering the video data and audio data—that is not shown in the example of
The audio aspects of VR have been classified into three separate categories of immersion. The first category provides the lowest level of immersion, and is referred to as three degrees of freedom (3DOF). 3DOF refers to audio rendering that accounts for movement of the head in the three degrees of freedom (yaw, pitch, and roll), thereby allowing the user to freely look around in any direction. 3DOF, however, cannot account for translational head movements in which the head is not centered on the optical and acoustical center of the soundfield.
The second category, referred to 3DOF plus (3DOF+), provides for the three degrees of freedom (yaw, pitch, and roll) in addition to limited spatial translational movements due to the head movements away from the optical center and acoustical center within the soundfield. 3DOF+ may provide support for perceptual effects such as motion parallax, which may strengthen the sense of immersion.
The third category, referred to as six degrees of freedom (6DOF), renders audio data in a manner that accounts for the three degrees of freedom in term of head movements (yaw, pitch, and roll) but also accounts for translation of the user in space (x, y, and z translations). The spatial translations may be induced by sensors tracking the location of the user in the physical world or by way of an input controller.
3DOF rendering is the current state of the art for VR. As such, the audio aspects of VR are less immersive than the video aspects, thereby potentially reducing the overall immersion experienced by the user, and introducing localization errors (such as when the auditory playback does not match or correlate exactly to the visual scene).
Various ways by which to perform interpolation with respect to the existing audio streams 11 and thereby allow for 6DOF immersion. As described below, the techniques may improve the listener experience, while also reducing soundfield reproduction localization errors, as the interpolated audio stream may better reflect a location of a listener relative to the existing audio streams, thereby improving the operation of a playback device (that performs the techniques to reproduce the soundfield) itself.
In operation, the audio playback system 16 may include an interpolation device 30 (“INT DEVICE 30”), e.g., as shown in
The interpolation device 30 may first obtain one or more microphone locations, each of the one or more microphone locations identifying a location of a microphone that captured the one or more audio streams 11. More information regarding operation of the interpolation device 30 is described with respect to the examples of
The systems layer 502, which may also be referred to as a transport layer, may carry various parameters for use by the decoder system 500 when decoding and rendering audio data. As shown in
The MPEG-H decoder 504 may receive the MPEG-I audio bitstream and decode said bitstream into channels, objects, and/or HOA coefficients. The MPEG-H decoder 504 may provide the decoded data to the audio renderer 510.
The 6DoF metadata processor 506 may process 6DoF metadata from the MPEG-I audio bitstream. The 6DoF metadata processor 506 may the 6DoF metadata to the audio renderer 510.
The low-delay decoder 508 may decode an audio bitstream into one or more audio channels. The low-delay decoder 508 may provide the audio channels to the audio renderer 510.
The audio renderer 510 may perform one or more operations to render audio data for playback at a device (e.g., a VR client device). As shown in
Some versions of MPEG-H support (e.g., via the systems layer 502) the transmission of: downmix matrices (e.g., for format conversion), HOA rendering matrices, and EQ settings. However, in some examples, it may be desirable for the decoder system 500 to be able to utilize additional tools (e.g., in order to better preserve the artistic intent of a content producer).
In accordance with one or more techniques of this disclosure, the decoder system 500 may be configured to receive, and a corresponding encoder system may be configured to transmit, syntax elements indicating the use/disuse and any relevant parameters of additional audio tools. For instance, the decoder system 500 may receive the syntax elements to better preserve artistic intent in AR/VR applications.
As one example, the decoder system 500 may obtain syntax elements representing a plurality of room reverb coefficient sets, each of the room reverb coefficient sets corresponding to a different candidate position (e.g., multiple room reverb coefficients (RIRs) at variable candidate positions). The audio renderer 510 may then select one of the room reverb coefficient sets based on obtained metadata (e.g., user position/orientation).
As another example, the decoder system 500 may obtain a syntax element (e.g., a flag) indicating a selection of either a head-related transfer function (HRTF) or a binaural room impulse response (BRIR). The audio renderer 510 may then use the selected HRTF or BRIR when rendering the audio data. In some examples, the syntax element may indicate HRTF for outdoor and BRIR for indoor.
As another example, the decoder system 500 may obtain one or more syntax elements to recreate BRIR (HRTF+room reverb characteristics (RIR)→BRIR). Some example, room reverb characteristics/parameters include, but are not limited to, T60, direct to reverberant ratio at specific position, change in direct to reverberant position with distance, and first reflection time. In some examples, the decoder system 500 may utilize these parameters together with a convolution based renderer coefficients (e.g., for high bandwidth TX).
As another example, the decoder system 500 may obtain one or more syntax elements to indicate which audio tools shall be used or shall not be used. Some example audio tools include, but are not limited to, reverb (e.g., for synthetic or recorded scenes), doppler, occlusion, etc. In some examples, the one or more syntax elements that indicate the use of audio tools include a separate flag for each audio tool.
In this way, decoder system 500 may support transmission of one or more of the following:
The audio playback system 16 may output the left and right speaker feeds 103 to headphones 104, which may represent another example of a wearable device and which may be coupled to additional wearable devices to facilitate reproduction of the soundfield, such as a watch, the VR headset noted above, smart glasses, smart clothing, smart rings, smart bracelets or any other types of smart jewelry (including smart necklaces), and the like. The headphones 104 may couple wirelessly or via wired connection to the additional wearable devices.
Additionally, the headphones 104 may couple to the audio playback system 16 via a wired connection (such as a standard 3.5 mm audio jack, a universal system bus (USB) connection, an optical audio jack, or other forms of wired connection) or wirelessly (such as by way of a Bluetooth™ connection, a wireless network connection, and the like). The headphones 104 may recreate, based on the left and right speaker feeds 103, the soundfield represented by the HOA coefficients 11. The headphones 104 may include a left headphone speaker and a right headphone speaker which are powered (or, in other words, driven) by the corresponding left and right speaker feeds 103.
The wearable device 800 may represent other types of devices, such as a watch (including so-called “smart watches”), glasses (including so-called “smart glasses”), headphones (including so-called “wireless headphones” and “smart headphones”), smart clothing, smart jewelry, and the like. Whether representative of a VR device, a watch, glasses, and/or headphones, the wearable device 800 may communicate with the computing device supporting the wearable device 800 via a wired connection or a wireless connection.
In some instances, the computing device supporting the wearable device 800 may be integrated within the wearable device 800 and as such, the wearable device 800 may be considered as the same device as the computing device supporting the wearable device 800. In other instances, the wearable device 800 may communicate with a separate computing device that may support the wearable device 800. In this respect, the term “supporting” should not be understood to require a separate dedicated device but that one or more processors configured to perform various aspects of the techniques described in this disclosure may be integrated within the wearable device 800 or integrated within a computing device separate from the wearable device 800.
For example, when the wearable device 800 represents the VR device 400, a separate dedicated computing device (such as a personal computer including the one or more processors) may render the audio and visual content, while the wearable device 800 may determine the translational head movement upon which the dedicated computing device may render, based on the translational head movement, the audio content (as the speaker feeds) in accordance with various aspects of the techniques described in this disclosure. As another example, when the wearable device 800 represents smart glasses, the wearable device 800 may include the one or more processors that both determine the translational head movement (by interfacing with one or more sensors of the wearable device 800) and render, based on the determined translational head movement, the speaker feeds.
As shown, the wearable device 800 includes a rear camera, one or more directional speakers, one or more tracking and/or recording cameras, and one or more light-emitting diode (LED) lights. In some examples, the LED light(s) may be referred to as “ultra bright” LED light(s). In addition, the wearable device 800 includes one or more eye-tracking cameras, high sensitivity audio microphones, and optics/projection hardware. The optics/projection hardware of the wearable device 800 may include durable semi-transparent display technology and hardware.
The wearable device 800 also includes connectivity hardware, which may represent one or more network interfaces that support multimode connectivity, such as 4G communications, 5G communications, etc. The wearable device 800 also includes ambient light sensors, and bone conduction transducers. In some instances, the wearable device 800 may also include one or more passive and/or active cameras with fisheye lenses and/or telephoto lenses. Various devices of this disclosure, such as the content consumer device 14 of
Furthermore, the tracking and recording cameras and other sensors may facilitate the determination of translational distance. Although not shown in the example of
Although described with respect to particular examples of wearable devices, such as the VR device 400 discussed above with respect to the examples of
In the example of
The headphones 104 may couple to the audio playback system 16 via a wired connection (such as a standard 3.5 mm audio jack, a universal system bus (USB) connection, an optical audio jack, or other forms of wired connection) or wirelessly (such as by way of a Bluetooth™ connection, a wireless network connection, and the like). The headphones 104 may recreate, based on the left and right speaker feeds 103, the soundfield represented by the HOA coefficients 11. The headphones 104 may include a left headphone speaker and a right headphone speaker which are powered (or, in other words, driven) by the corresponding left and right speaker feeds 103.
It is to be recognized that depending on the example, certain acts or events of any of the techniques described herein can be performed in a different sequence, may be added, merged, or left out altogether (e.g., not all described acts or events are necessary for the practice of the techniques). Moreover, in certain examples, acts or events may be performed concurrently, e.g., through multi-threaded processing, interrupt processing, or multiple processors, rather than sequentially.
In some examples, the VR device (or the streaming device) may communicate, using a network interface coupled to a memory of the VR/streaming device, exchange messages to an external device, where the exchange messages are associated with the multiple available representations of the soundfield. In some examples, the VR device may receive, using an antenna coupled to the network interface, wireless signals including data packets, audio packets, video packets, or transport protocol data associated with the multiple available representations of the soundfield. In some examples, one or more microphone arrays may capture the soundfield.
In some examples, the multiple available representations of the soundfield stored to the memory device may include a plurality of object-based representations of the soundfield, higher order ambisonic representations of the soundfield, mixed order ambisonic representations of the soundfield, a combination of object-based representations of the soundfield with higher order ambisonic representations of the soundfield, a combination of object-based representations of the soundfield with mixed order ambisonic representations of the soundfield, or a combination of mixed order representations of the soundfield with higher order ambisonic representations of the soundfield.
In some examples, one or more of the soundfield representations of the multiple available representations of the soundfield may include at least one high-resolution region and at least one lower-resolution region, and wherein the selected presentation based on the steering angle provides a greater spatial precision with respect to the at least one high-resolution region and a lesser spatial precision with respect to the lower-resolution region.
An audio encoding device may encode, in a coded audio bitstream, a representation of a soundfield (602). For instance, the soundfield representation generator 302 of source device 12 may generate one or more bitstreams 21 based, at least in part, on audio content (e.g., MOA representations and/or third order HOA representations) generated from HOA coefficients. The bitstream 21 may represent a compressed version of the HOA coefficients (and/or the partial subsets thereof used to form MOA representations of the soundfield) and any other different types of the content 301 (such as a compressed version of spherical video data, image data, or text data).
The audio encoding device may encode, in the coded audio bitstream, a syntax element indicating a selection of either a head-related transfer function (HRTF) or a binaural room impulse response (BRIR) for rendering the soundfield (604). For instance, soundfield representation generator 302 may encode, in the bitstream 21, a flag indicating whether the HRTF or the BRIR should be used to render the soundfield represented in the bitstream 21.
In some examples, the audio encoding device may select the HRTF or the BRIR based on a current location of one or more microphones (e.g., that generated signals on which the representation of the soundfield is based, such as microphone 5). For instance, source device 12 may determine a current location of the one or more microphones, such as microphone 5. Source device 12 may determine the current location based on signals generated by one or more sensors. Such as one or more ambient light sensors, one or more global positioning system (GPS), or any other such sensor or combination of sensors. Source device 12 may select, based the current location of the one or more microphones, the HRTF or the BRIR. As one example, source device 12 may select the HRTF responsive to determining that the current location is outdoor. As another example, source device 12 may select the BRIR responsive to determining that the current location is indoor.
An audio decoding device may obtain a coded audio bitstream (702). For instance, audio decoding device 24 of audio playback system 16 may obtain the bitstream 21.
The audio decoding device may decode based on the coded audio bitstream, a representation of a soundfield (704). For instance, audio decoding device 24 may decode bitstream 21 to output reconstructed audio data 11A′-11N′ (which may form the full third order HOA representation or a subset thereof that forms an MOA representation of the same soundfield or decompositions thereof, such as the predominant audio signal, ambient HOA coefficients, and the vector based signal described in the MPEG-H 3D Audio Coding Standard).
The audio decoding device may decode, based on the coded audio bitstream, a syntax element indicating a selection of either a head-related transfer function (HRTF) or a binaural room impulse response (BRIR) (706). For instance, audio decoding device 24 may obtain, from the bitstream 21, a flag indicating whether the HRTF or the BRIR should be used to render the soundfield represented in the bitstream 21.
The audio decoding device 24 may render, using the selected HRTF or BRIR, speaker feeds from the soundfield (708). For instance, audio playback system 16 may select renderers 22 or binaural render 102 to render speaker feeds. As one example, where the flag indicates the use of HRTF, audio playback system 16 may select binaural renderer 102 to perform binaural rendering using one or more HRTFs to render to left and right speaker feeds 103. As another example, where the flag indicates the user of BRIR, audio playback system 16 may select renderers 22 to render speaker feeds 25.
The following numbered examples may illustrate one or more aspects of the disclosure:
In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol. In this manner, computer-readable media generally may correspond to (1) tangible computer-readable storage media which is non-transitory or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure. A computer program product may include a computer-readable medium.
By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc, where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
Instructions may be executed by one or more processors, including fixed function processing circuitry and/or programmable processing circuitry, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques could be fully implemented in one or more circuits or logic elements.
The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware.
Various examples have been described. These and other examples are within the scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
20200100088 | Feb 2020 | GR | national |
Number | Name | Date | Kind |
---|---|---|---|
10405126 | Peters et al. | Sep 2019 | B2 |
20110249821 | Jaillet et al. | Oct 2011 | A1 |
20130202129 | Kraemer | Aug 2013 | A1 |
20140355796 | Xiang | Dec 2014 | A1 |
20170188174 | Lee | Jun 2017 | A1 |
20180075764 | Bachani | Mar 2018 | A1 |
20180192177 | Krisztal | Jul 2018 | A1 |
20190007781 | Peters | Jan 2019 | A1 |
20210168550 | Terentiv | Jun 2021 | A1 |
Number | Date | Country |
---|---|---|
WO-2019197404 | Oct 2019 | WO |
Entry |
---|
“Information Technology, High Efficiency Coding and Media Delivery in Heterogeneous Environments, Part 3, 3D audio”, ISO/IEC JTC 1/SC 29, 23008-3:201x(E), Oct. 12, p. 797, IDS filed on Apr. 14, 2021 (Year: 2016). |
Audio: “Call for Proposals for 3D Audio”, International Organisation for Standardisation Organisation Internationale De Normalisation, ISO/IEC JTC1/SC29/WG11, Coding of Moving Pictures and Audio, ISO/IEC JTC1/SC29/WG11/N13411, Geneva, Jan. 2013, pp. 1-20. |
ETSI TS 103 589 V1.1.1, “Higher Order Ambisonics (HOA) Transport Format”, Jun. 2018, 33 pages. |
Herre J., et al., “MPEG-H 3D Audio—The New Standard for Coding of Immersive Spatial Audio”, IEEE Journal of Selected Topics in Signal Processing, vol. 9, No. 5, Aug. 1, 2015 (Aug. 1, 2015), XP055243182, pp. 770-779, US ISSN: 1932-4553, DOI: 10.1109/JSTSP.2015.2411578. |
Hollerweger F., “An Introduction to Higher Order Ambisonic”, Oct. 2008, pp. 1-13, Accessed online [Jul. 8, 2013]. |
“Information Technology—High Efficiency Coding and Media Delivery in Heterogeneous Environments—Part 3: 3D Audio”, ISO/IEC JTC 1/SC 29, ISO/IEC DIS 23008-3, Jul. 25, 2014, 433 Pages. |
“Information Technology—High Efficiency Coding and Media Delivery in Heterogeneous Environments—Part 3: 3D Audio”, ISO/IEC JTC 1/SC 29/WG11, ISO/IEC 23008-3, 201x(E), Oct. 12, 2016, 797 Pages. |
“Information Technology—High Efficiency Coding and Media Delivery in Heterogeneous Environments—Part 3: Part 3: 3D Audio, Amendment 3: MPEG-H 3D Audio Phase 2,” ISO/IEC JTC 1/SC 29N, ISO/IEC 23008-3:2015/PDAM 3, Jul. 25, 2015, 208 Pages. |
ISO/IEC/JTC: “ISO/IEC JTC 1/SC 29 N ISO/IEC CD 23008-3 Information Technology—High Efficiency Coding and Media Delivery in Heterogeneous Environments—Part 3: 3D Audio”, Apr. 4, 2014 (Apr. 4, 2014), 337 Pages, XP055206371, Retrieved from the Internet: URL:http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_to_browse.htm?commid=45316 [retrieved on Aug. 5, 2015]. |
Kim M-Y., et al., “Towards a Unified 6DOF VR Evaluation Framework for MPEG-I”, Coding of Moving Pictures and Audio, ISO/IEC JTC1/SC29/WG11 MPEG2018/M44878, Macau, China, Oct. 2018, 4 pages. |
Peterson J., et al., “Virtual Reality, Augmented Reality, and Mixed Reality Definitions”, EMA, version 1.0, Jul. 7, 2017, 4 Pages. |
Poletti M.A., “Three-Dimensional Surround Sound Systems Based on Spherical Harmonics”, The Journal of the Audio Engineering Society, vol. 53, No. 11, Nov. 2005, pp. 1004-1025. |
Schonefeld V., “Spherical Harmonics”, Jul. 1, 2005, XP002599101, 25 Pages, Accessed online [Jul. 9, 2013] at URL:http://heim.c-otto.de/˜volker/prosem_paper.pdf. |
Sen D., et al., “RM1-HOA Working Draft Text”, 107. MPEG Meeting, Jan. 13, 2014-Jan. 17, 2014, San Jose, (Motion Picture Expert Group or ISO/IEC JTC1/SC29/WG11), No. M31827, Jan. 11, 2014 (Jan. 11, 2014), San Jose, USA, XP030060280, 83 Pages, p. 11, paragraph 5.2.4-paragraph 5.2.5 p. 16, paragraph 6.1.10-p. 17; Figure 4 p. 18, paragraph 6.3-p. 22, Paragraph 6.3.2.2 p. 64, paragraph B.1-p. 66, Paragraph B.2.1; figures B.1, B.2 p. 70, paragraph B.2.1.3-p. 71 p. 74, paragraph B.2.4.1-p. 75, Paragraph B.2.4.2. |
Sen D., et al., “Technical Description of the Qualcomm's HoA Coding Technology for Phase II”, 109th MPEG Meeting, Jul. 7, 2014-Jul. 11, 2014, Sapporo, (Motion Picture Expert Group or ISO/IEC JTC1/SC29/WG11), No. M34104, Jul. 2, 2014 (Jul. 2, 2014), XP030062477, 4 Pages, figure 1. |
Number | Date | Country | |
---|---|---|---|
20210264927 A1 | Aug 2021 | US |