Disclosed are embodiments related to rendering of occluded audio elements.
Spatial audio rendering is a process used for presenting audio within an extended reality (XR) scene (e.g., a virtual reality (VR), augmented reality (AR), or mixed reality (MR) scene) in order to give a listener the impression that sound is coming from physical sources within the scene at a certain position and having a certain size and shape (i.e., extent). The presentation can be made through headphone speakers or other speakers. If the presentation is made via headphone speakers, the processing used is called binaural rendering and uses spatial cues of human spatial hearing that make it possible to determine from which direction sounds are coming. The cues involve inter-aural time delay (ITD), inter-aural level difference (ILD), and/or spectral difference.
The most common form of spatial audio rendering is based on the concept of point-sources, where each sound source is defined to emanate sound from one specific point. Because each sound source is defined to emanate sound from one specific point, the sound source doesn't have any size or shape. In order to render a sound source having an extent (size and shape), different methods have been developed.
One such known method is to create multiple copies of a mono audio element at positions around the audio element. This arrangement creates the perception of a spatially homogeneous object with a certain size. This concept is used, for example, in the “object spread” and “object divergence” features of the MPEG-H 3D Audio standard (see references [1] and [2]), and in the “object divergence” feature of the EBU Audio Definition Model (ADM) standard (see reference [4]). This idea using a mono audio source has been developed further as described in reference [7], where the area-volumetric geometry of a sound object is projected onto a sphere around the listener and the sound is rendered to the listener using a pair of head-related (HR) filters that is evaluated as the integral of all HR filters covering the geometric projection of the object on the sphere. For a spherical volumetric source this integral has an analytical solution. For an arbitrary area-volumetric source geometry, however, the integral is evaluated by sampling the projected source surface on the sphere using what is called a Monte Carlo ray sampling.
Another rendering method renders a spatially diffuse component in addition to a mono audio signal, which creates the perception of a somewhat diffuse object that, in contrast to the original mono audio element, has no distinct pin-point location. This concept is used, for example, in the “object diffuseness” feature of the MPEG-H 3D Audio standard (see reference [3]) and the “object diffuseness” feature of the EBU ADM (see reference [5]).
Combinations of the above two methods are also known. For example, the “object extent” feature of the EBU ADM combines the creation of multiple copies of a mono audio element with the addition of diffuse components (see reference [6]).
In many cases the actual shape of an audio element can be described well enough with a basic shape (e.g., a sphere or a box). But sometimes the actual shape is more complicated and needs to be described in a more detailed form (e.g., a mesh structure or a parametric description format).
In the case of heterogeneous audio elements, as are described in reference [8], the audio element comprises at least two audio channels (i.e., audio signals) to describe a spatial variation over its extent.
In some XR scenes there may be an object that blocks at least part of an audio element in the XR scene. In such a scenario the audio element is said to be at least partially occluded.
That is, occlusion happens when, from the viewpoint of a listener at a given listening position, an audio element is completely or partly hidden behind some object such that no or less direct sound from the occluded part of the audio element reaches the listener. Depending on the material of the occluding object, the occlusion effect might be either complete occlusion (e.g. when the occluding object is a thick wall), or soft occlusion where some of the audio energy from the audio element passes through the occluding object (e.g., when the occluding object is made of thin fabric such as a curtain).
Certain challenges presently exist. For example, available occlusion rendering techniques deal with point sources where the occurrence of occlusion can be detected easily using raytracing between the listener position and the position of the point source, but for an audio element with an extent, the situation is more complicated since an occluding object may occlude only a part of the extended audio element. Therefore, a more elaborate occlusion detection technique is needed (e.g., one that determines which part of the extended audio element is occluded). For a heterogeneous extended audio element (i.e., an audio element with an extent which has non-homogeneous spatial audio information distributed over its extent (e.g. an extended audio element that is represented by a stereo signal)), the situation is even more complicated because the rendering of a partly occluded object of this type should take into account what would be the expected result of the partly occlusion on the spatial audio information that reaches the listener. A special version of the latter problem appears when a heterogeneous extended audio element is rendered by means of a discrete number of virtual loudspeakers. If using traditional occlusion, operating on individual virtual loudspeakers, and one or more of the virtual loudspeakers are occluded, which, for example, in the case of using two virtual loudspeakers (e.g. a left (L) and right (R) speaker) would mean that basically all spatial information is lost whenever either the L or R virtual loudspeaker is occluded. More generally in the case of extended objects that are rendered using a discrete number of virtual loudspeakers (so also including non-heterogeneous audio elements, e.g. homogeneous or diffuse extended audio elements), there is a problem with the amount of occlusion changing in a step-wise manner when the audio element, the occluding object, and/or listener are moving relative to each other.
Accordingly, in one aspect there is provided a method for rendering an audio element that is at least partially occluded, where the audio element is represented using a set of two or more virtual loudspeakers, the set comprising a first virtual loudspeaker. In one embodiment, the method includes modifying a first virtual loudspeaker signal for the first virtual loudspeaker, thereby producing a first modified virtual loudspeaker signal. The method also includes using the first modified virtual loudspeaker signal to render the audio element (e.g., generate an output signal using the first modified virtual loudspeaker signal). In another embodiment the method includes moving the first virtual loudspeaker from an initial position to a new position. The method also includes generating a first virtual loudspeaker signal for the first virtual loudspeaker based on the new position of the first virtual loudspeaker. The method also includes using the first virtual loudspeaker signal to render the audio element.
In another aspect there is provided a computer program comprising instructions which when executed by processing circuitry of an audio renderer causes the audio renderer to perform either of the above described methods. In one embodiment, there is provided a carrier containing the computer program wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium. In another aspect there is provided a rendering apparatus that is configured to perform either of the above described methods. The rendering apparatus may include memory and processing circuitry coupled to the memory.
An advantage of the embodiments disclosed herein is that the rendering of an audio element that is at least partially occluded is done in a way that preserves the quality of the spatial information of the audio element.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.
The occurrence of occlusion may be detected using raytracing methods where the direct path between the listener position and the position of the audio element is searched for any occluding objects.
One strategy to for solving the occlusion problem for an audio element having an extent (see audio element 302 of
Accordingly, this disclosure describes additional embodiments that do not suffer these drawbacks discussed in the preceding paragraph. In one aspect, a method according to one embodiment comprises the steps of:
Given the knowledge of what sub-areas of the audio element (more precisely a projection of the audio element) are at least partially occluded and given knowledge about the occluding object (e.g., a parameter indicating the amount of audio energy from the audio element that passes through the occluding object), an amount of occlusion can be calculated for each said sub-area. In a scenario where the parameter indicates that no energy from the audio element passes through the occluding object, then the amount of occlusion can be calculated as the percentage of the sub-area that is occluded from the listening position.
The sub-areas of the projection of the audio element can be defined in many different ways. In one embodiment, there are as many sub-areas as there are virtual loudspeakers used for the rendering, and each sub-area corresponds to one virtual loudspeaker. In another embodiment, the sub-areas are defined independently from the number and/or positions of the virtual loudspeakers used for the rendering. The sub-areas may be equal in size. The sub-areas may be directly adjacent to each other. The sub-areas together may completely fill the surface area of the projected extent of the audio element, i.e. the total size of the projected extent is equal to the sum of the surface areas of all the sub-areas.
For each sub-area, a gain factor can be calculated depending on the amount of occlusion for that area. For example, in some scenarios where the occluding object is a thick, brick wall or the like, a sub-area that is completely occluded (amount is 100%) by the occluding brick wall may be completely muted and the gain factor should therefore be set to 0.0. For a sub-area where the occlusion amount is 0, the gain factor should be set to 1.0. For other amounts of occlusion, the gain factor should be somewhere in-between 0.0 and 1.0, but the exact behavior may depend on the spatial character of the audio element. In one embodiment the gain factor is calculated as:
g=(1.0−0.01*O), where O is the occlusion amount in percent.
In one embodiment, O for a given sub-area is a function of a frequency dependent occlusion factor (OF) and a value P, where P is the percentage of the sub-area that is covered by the occluding object (i.e., the percentage of the sub-area that cannot be seen by the listener due to the fact that the occluding object is located between the listener and the sub-area). For example, O=OF*P, where OF=Of1 for frequencies below f1, OF=Of2 for frequencies between f1 and f2, and OF−Of3 for frequencies above f2. That is, for a given frequency, different types of occluding objects may have a different occlusion factor. For instance, for a first frequency, a brick wall may have an occlusion factor of 1, whereas a thin curtain of cotton may have an occlusion factor of 0.2, and for a second frequency, the brick wall may have an occlusion factor of 0.8, whereas a thin curtain of cotton may have an occlusion factor of 0.1.
In another embodiment, the gain factor is calculated using the assumption that the audio element is mostly diffuse in spatial information and a 50% occlusion amount should give a −3 dB reduction in audio energy from that sub-area. The gain factor can then be calculated as:
The embodiments are not limited to the above examples as other gain functions for calculating the gain of a sub-area are possible. As exemplified by the two embodiments described above, the effect of the occlusion can be a gradual one when the audio element is partly occluded, so that the signal from a virtual loudspeaker is not necessarily completely muted whenever the virtual loudspeaker is occluded for the listener. This prevents that, for example, in the case of a stereo rendering with two virtual loudspeakers, no sound at all is received from, for example, the left half of the audio element whenever the left virtual loudspeaker is occluded. Additionally, it prevents the undesirable “step-wise” occlusion effect when the occluding object, the audio element and/or the listener are moving relative to each other.”
When a part of the audio element is occluded, the positions of the virtual loudspeakers representing the audio element can be moved so that they better represent the non-occluded part. If one of the edges of the extent of the audio element is occluded, the virtual loudspeaker(s) representing this edge should be move to the edge where the occlusion is happening as illustrated in
In the case where an occluding object is covering the middle of the audio element, as shown in
In the case that the audio element is only represented by virtual loudspeakers in the horizontal plane, an occlusion that covers either the bottom or top part can be rendered by changing the vertical position of the virtual loudspeakers so that their vertical position corresponds to the middle of the non-occluded part of the extent.
In another embodiment, the vertical position of each virtual loudspeaker is controlled by the ratio of occlusion amount in the upper sub-area and the lower sub-area. An example of how this position can be calculated is given by:
where PY is the vertical coordinate of the loudspeaker, OU and OL are the occlusion amount of the upper part and the lower part of the extent. PYT and PYB are the vertical coordinate of the top and bottom edges of the extent.
In some embodiments, the process further includes obtaining information indicating that the audio element is at least partially occluded, wherein the modifying is performed as a result of obtaining the information.
In some embodiments, the process further includes detecting that the audio element is at least partially occluded, wherein the modifying is performed as a result of the detection.
In some embodiments, modifying the first virtual loudspeaker signal comprises adjusting the gain of the first virtual loudspeaker signal.
In some embodiments, the process further includes moving the first virtual loudspeaker from an initial position (e.g., default position) to a new position and then generating the first virtual loudspeaker signal using information indicating the new position.
In some embodiments, the process further includes determining an occlusion amount (O) associated with the first virtual loudspeaker and the step of modifying the first virtual loudspeaker signal for the first virtual loudspeaker comprises modifying the first virtual loudspeaker signal based on O. In some embodiments, modifying the first virtual loudspeaker signal based on O comprises modifying the first virtual loudspeaker signal VS1 such that the modified loudspeaker signal equals (g*VS1), where g is a gain factor that is calculated using O and VS1 is the first virtual loudspeaker signal. In one embodiment, g=1−0.01*O or g=sqrt(1−0.01*O). In one embodiment determining O comprises obtaining a particular occlusion factor (Of) for the occluding object and determining a percentage of a sub-area of a projection of the audio element that is covered by the occluding object, where the first virtual loudspeaker is associated with the sub-area.
Orientation sensing unit 1201 is configured to detect a change in the orientation of the listener and provides information regarding the detected change to processing unit 1203. In some embodiments, processing unit 1203 determines the absolute orientation (in relation to some coordinate system) given the detected change in orientation detected by orientation sensing unit 1201. There could also be different systems for determination of orientation and position, e.g. a system using lighthouse trackers (lidar). In one embodiment, orientation sensing unit 1201 may determine the absolute orientation (in relation to some coordinate system) given the detected change in orientation. In this case the processing unit 1203 may simply multiplex the absolute orientation data from orientation sensing unit 1201 and positional data from position sensing unit 1202. In some embodiments, orientation sensing unit 1201 may comprise one or more accelerometers and/or one or more gyroscopes.
Directional mixer 1404 receives audio input 1261, which in this example includes a pair of audio signals 1401 and 1402 associated with an audio element (e.g. audio element 602), and produces a set of k virtual loudspeaker signals (VS1, VS2, . . . , VSk) based on the audio input and control information 1471. In one embodiment, the signal for each virtual loudspeaker can be derived by, for example, the appropriate mixing of the signals that comprise the audio input 1261. For example: VS1=α×L+β×R, where L is input audio signal 1401, R is input audio signal 1402, and α and β are factors that are dependent on, for example, the position of the listener relative to the audio element and the position of the virtual loudspeaker to which VS1 corresponds.
In the example where audio element 602 is associated with three virtual loudspeakers (SpL, SpC, and SpR), then k will equal 3 for the audio element and VS1 may correspond to SpL, VS2 may correspond to SpC, and VS3 may correspond to SpR. The control information 1471 used by directional mixer to produce the virtual loudspeaker signals may include the positions of each virtual loudspeaker relative to the audio element. In some embodiments, controller 1301 is configured such that, when the audio element is occluded, controller 1301 may adjust the position of one or more of the virtual loudspeakers associated with the audio element and provide the position information to directional mixer 1404 which then uses the updated position information to produce the signals for the virtual loudspeakers (i.e., VS1, VS2, . . . , VSk).
Gain adjuster 1406 may adjust the gain of any one or more of the virtual loudspeaker signals based on control information 1472, which may include the above described gain factors as calculated by controller 1301. That is, for example, when the audio element is at least partially occluded, controller 1301 may control gain adjuster 1406 to adjust the gain of one or more of the virtual loudspeaker signals by providing one or more gain factors to gain adjuster 1406. For instance, if the entire left portion of the audio element is occluded, then controller 1301 may provide to gain adjuster 1406 control information 1472 that causes gain adjuster 1406 to reduce the gain of VS1 by 100% (i.e., gain factor=0 so that VS1′=0). As another example, if only 50% of the left portion of the audio element is occluded and 0% of the center portion is occluded, then controller 1301 may provide to gain adjuster 1406 control information 1472 that causes gain adjuster 1406 to reduce the gain of VS1 by 50% (i.e., VS1′=50% VS1) and to not reduce the gain of VS2 at all (i.e., gain factor=1 so that VS2′=VS2).
Using virtual loudspeaker signals VS1′, VS2′, . . . , VSk′, speaker signal producer 1408 produces output signals (e.g., output signal 1281 and output signal 1282) for driving speakers (e.g., headphone speakers or other speakers). In one embodiment where the speakers are headphone speakers, speaker signal producer 1408 may perform conventional binaural rendering to produce the output signals. In embodiments where the speakers are not headphone speakers, speaker signal producer 1408 may perform conventional speaking panning to produce the output signals.
A1. A method for rendering an at least partially occluded audio element (602, 902) represented using a set of two or more virtual loudspeakers (e.g., SpL and SpR), the set comprising a first virtual loudspeaker (e.g., any one of SpL, SpC, SpR), the method comprising: modifying a first virtual loudspeaker signal (e.g., VS1, VS2, or . . . ) for the first virtual loudspeaker, thereby producing a first modified virtual loudspeaker signal, and using the first modified virtual loudspeaker signal to render the audio element (e.g., generate an output signal using the first modified virtual loudspeaker signal).
A2. The method of embodiment A1, further comprising obtaining information indicating that the audio element is at least partially occluded, wherein the modifying is performed as a result of obtaining the information.
A3. The method of embodiment A1 or A2, further comprising detecting that the audio element is at least partially occluded, wherein the modifying is performed as a result of the detection.
A4. The method of any one of embodiments A1-A3, wherein modifying the first virtual loudspeaker signal comprises adjusting the gain of the first virtual loudspeaker signal.
A5. The method of any one of embodiments A1-A4, further comprising moving the first virtual loudspeaker from an initial position (e.g., default position) to a new position and then generating the first virtual loudspeaker signal using information indicating the new position.
A6. The method of any one of embodiments A1-A5, further comprising determining a first occlusion amount (OA1), wherein the step of modifying the first virtual loudspeaker signal for the first virtual loudspeaker comprises modifying the first virtual loudspeaker signal based on OA1.
A7. The method of embodiment A6, wherein modifying the first virtual loudspeaker signal based on OA1 comprises modifying the first virtual loudspeaker signal such that the modified loudspeaker signal is equal to: g1*VS1, where g1 is a gain factor that is calculated using OA1 and VS1 is the first virtual loudspeaker signal.
A8. The method of embodiment A7, wherein g1 if is a function of OA1 (e.g., g1=(1−(0.01*OA1)) or g1=sqrt(1−0.01*OA1)).
A9. The method of embodiment A6, A7, or A8, wherein the audio element is at least partially occluded by an occluding object, and determining OA1 comprises obtaining an occlusion factor for the occluding object and determining a percentage of a first sub-area of a projection of the audio element that is covered by the occluding object, where the first virtual loudspeaker is associated with the first sub-area.
A10. The method of embodiment A9, wherein obtaining the occlusion factor comprises selecting the occlusion factor from a set of occlusion factors, wherein the selection is based on a frequency associated with the audio element. For example, each occlusion factor (OF) included in the set of occlusion factors is associated with a different frequency range, and the selection is based on a frequency associated with the audio element such that the selected OF is associated with a frequency range that encompasses the frequency associated with the audio element.
A11. The method of embodiment A9 or A10, wherein determining OA1 comprises calculating: OA1=Of1*P, where Of1 is the occlusion factor and P is the percentage.
A12. The method of any one of embodiments A1-A11, further comprising: modifying a second virtual loudspeaker signal for the second virtual loudspeaker, thereby producing a second modified virtual loudspeaker signal, and using the first and second modified virtual loudspeaker signals to render the audio element.
A13. The method of embodiment A12, further comprising determining a second occlusion amount (OA2) associated with the second virtual loudspeaker, wherein the step of modifying the second virtual loudspeaker signal comprises modifying the second virtual loudspeaker signal based on OA2.
A14. The method of embodiment A13, wherein modifying the second virtual loudspeaker signal based on OA2 comprises modifying the second virtual loudspeaker signal such that the second modified loudspeaker signal is equal to: g2*VS2, where g2 is a gain factor that is calculated using OA2 and VS2 is the second virtual loudspeaker signal.
A15. The method of embodiment A13 or A14, wherein determining OA2 comprises determining a percentage of a second sub-area of the projection of the audio element that is covered by the occluding object, where the second virtual loudspeaker is associated with the second sub-area.
B1. A method for rendering an at least partially occluded audio element (602, 902) represented using a set of two or more virtual loudspeakers, the set comprising a first virtual loudspeaker and a second virtual loudspeaker, the method comprising: moving the first virtual loudspeaker from an initial position to a new position, generating a first virtual loudspeaker signal for the first virtual loudspeaker based on the new position of the first virtual loudspeaker, and using the first virtual loudspeaker signal to render the audio element.
B2. The method of embodiment B1, further comprising obtaining information indicating that the audio element is at least partially occluded, wherein the moving is performed as a result of obtaining the information.
B3. The method of embodiment B1 or B2, further comprising detecting that the audio element is at least partially occluded, wherein the moving is performed as a result of the detection.
C1. A computer program comprising instructions which when executed by processing circuitry of an audio renderer causes the audio renderer to perform the method of any one of the above embodiments.
C2. A carrier containing the computer program wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
D1. An audio rendering apparatus that is configured to perform the method of any one of the above embodiments.
D2. The audio rendering apparatus of embodiment D1, wherein the audio rendering apparatus comprises memory and processing circuitry coupled to the memory.
While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above described exemplary embodiments. Moreover, any combination of the above-described objects in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/059762 | 4/12/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63174727 | Apr 2021 | US |