This disclosure relates to methods and apparatus for configuring virtual loudspeakers.
Spatial audio rendering is the process used for presenting an audio element within virtual reality (VR), augmented reality (AR), or mixed reality (MR) in order to give the listener the impression that the sound is coming from physical source(s) that is located at certain position(s) and that has a certain size and a certain shape (i.e., extent).
The presentation can be made using headphones or speakers. If the presentation is made using headphones, the rendering process is called binaural rendering. Binaural rendering uses spatial cues of the human spatial hearing which enables the listener to recognize the direction from which sounds are coming from. The spatial cues include Inter-aural Time Difference (ITD), Inter-aural Level Difference (ILD), and/or spectral difference.
The most common form of spatial audio rendering is based on the concept of point sources. A point source is defined to emanate sound from one specific point. Thus, a point-source does not have any extent. Accordingly, in order to render an audio source with an extent, different methods need to be used.
One of the methods for rendering an audio source with an extent is to create multiple duplicate copies of a mono audio object at positions around the mono audio object's position. This creates the perception of a spatially homogeneous object with a certain size. This concept is used, for example, in the “object spread” and “object divergence” features of the MPEG-H 3D Audio standard (clauses 8.4.4.7—“Spreading” and 18.1—“Element Metadata Preprocessing”), and in the “object divergence” feature of the EBU Audio Definition Model (ADM) standard (EBU ADM Renderer Tech 3388, Clause 7.3.6: “Divergence”).
This idea using a mono audio source has been developed further in Efficient HRTF-based Spatial Audio for Area and Volumetric Sources”, IEEE Transactions on Visualization and Computer Graphics 22(4):1-1, January 2016. According to the article, the area-volumetric geometry of an audio object may be projected onto a sphere around the listener and the sound can be rendered to the listener using a pair of head-related (HR) filters that is evaluated as the integral of all the HR filters covering the geometric projection of the audio object on the sphere. For a spherical volumetric source, this integral has an analytical solution, while for an arbitrary area-volumetric source geometry, the integral is evaluated by sampling the projected source surface on the sphere using what is called a Monte Carlo ray sampling.
Another method for rendering an audio source with an extent is to render a spatially diffuse component in addition to the mono audio signal. The spatially diffuse component creates the perception of a somewhat diffuse object that, in contrast to the original mono object, has no distinct pin-point location. This concept is used, for example, in the “object diffuseness” feature of the MPEG-H 3D Audio standard (clause 18.11) and the EBU ADM “object diffuseness” feature (EBU ADM Renderer Tech 3388, Clause 7.4: “Decorrelation Filters”).
The combination of the above two methods is also known, for example, in the EBU ADM “object extent” feature which combines the creation of multiple copies of a mono audio object with addition of diffuse components. See EBU ADM Renderer Tech 3388, Clause 7.3.7: “Extent Panner.”
These methods, however, do not provide a method of rendering of audio elements that have a distinct “spatially-heterogeneous” character, i.e., an audio element that has a certain amount of spatial source variation within its spatial extent. Often these sources are made up of a sum of a multitude of sources, e.g., the sound of a forest or the sound of a cheering crowd. Most of the known solutions are only able to create objects with either a “spatially-homogeneous” (i.e., with no spatial variation within the element), or a spatially diffuse character, which may be too limited for rendering some of the examples given above in a convincing way.
Other techniques exist, for rendering these heterogeneous audio elements. For example, the audio element may be represented by a multi-channel audio recording and the rendering may use several virtual loudspeakers to represent the extent of the audio element and the spatial variation within it. By placing the virtual loudspeakers at positions that correspond to the extent of the audio element, an illusion of audio emanating from the extent can be conveyed.
In many cases, the extent of an audio element can be described adequately using a basic shape (e.g., a sphere or a box). But sometimes the shape of the audio element may be more complicated, and thus needs to be described in a more detailed form, e.g., with a mesh structure or a parametric description format. In these cases, the real-time rendering needs to calculate how the extent of the audio element should be rendered depending on the current position of the audio element with respect to the listening position.
One existing solution for rendering an audio element with a defined spatial extent is described in WO 2021180820, which is hereby incorporated by reference in its entirety. This solution involves a method that simplifies the complex extent of an audio element into a one-dimensional (1D) representation or a two-dimensional (2D) representation that describes the width and/or height of the extent, as seen from the listening position. In this disclosure, the complex extent that is simplified (i.e., the 1D representation or the 2D representation) is referred as a simplified extent.
Certain challenges exist. Generally, the number of virtual loudspeakers is predefined. This could be problematic since experiments show that, depending on the extent of the audio object and the position of the listener relative to the audio object, different numbers of virtual loudspeakers may be required to render the audio object (i.e., producing an audio signal representing the audio object) in an optimal way.
For example, if two or more virtual loudspeakers are used for producing an audio signal representing an audio object, then depending on the extent of the audio object and the position of the listener relative to the audio object, in some situations, the virtual loudspeakers may be too close to each other such that a pronounced comb-filtering effect that degrades the overall quality of the rendered audio object may occur.
For example, if (i) a white noise source is rendered using a virtual loudspeaker placed at a front-middle position of a listener and (ii) the same white noise source is rendered using a virtual loudspeaker that moves from a front-right position towards a front-left position, thereby passing the front-middle position, as the moving virtual loudspeaker passes through the front-middle position at which the stationary virtual speaker is located, there will be a mix of the audio from the two virtual loudspeakers, thereby resulting in creating in the audio spectrum notches that change as the moving virtual loudspeaker moves. In some scenarios, the changes may be stepwise changes. The stepwise changes may result from the use of a head related transfer function (HRTF) dataset with a limited spatial resolution and without interpolations between the HRTF sample-points.
If the extent of the audio object is large and/or the listener is close to the audio object, a higher number of virtual loudspeakers may be needed to properly render all the spatial information of the audio object. This is especially true if the audio object is represented with a multi-channel audio signal which provides spatial information in both height and width dimensions.
On the other hand, if the size of the audio object is small or the distance between the listener and the audio object is large, using multiple virtual loudspeakers to generate an audio signal representing the audio object may not be the most efficient solution.
Accordingly, in one aspect, there is provided a method for rendering an audio element. The method comprises obtaining size information indicating a size of a representation of the audio element and/or distance information indicating a distance between the audio element and a listener; and based on the size information and/or the distance information, determining a number of virtual loudspeakers to use for rendering the audio element.
In another aspect, there is provided a computer program comprising instructions which when executed by processing circuitry cause the processing circuitry to perform the method of any one of embodiments described above.
In another aspect, there is provided an apparatus for rendering an audio element. The apparatus is configured to obtain size information indicating a size of a representation of the audio element and/or distance information indicating a distance between the audio element and a listener; and based on the size information and/or the distance information, determine a number of virtual loudspeakers to use for rendering the audio element.
In another aspect, there is provided an apparatus, the apparatus comprising a memory and processing circuitry coupled to the memory. The apparatus is configured to perform the method of any one of embodiments described above.
Some embodiments of this disclosure provide an efficient method of rendering a heterogeneous audio element by adaptively deciding the number of virtual loudspeakers needed for rendering the audio element based on the size of the audio element and/or the distance between the audio element and the listening position. By reducing the number of virtual loudspeakers used for the rendering, the problem of the comb-filtering effects resulting from the use of two or more loudspeakers that are too close to each other can be avoided. Also, by reducing the number of virtual loudspeakers, the embodiments allow avoiding the excessive complexity resulting from using too many virtual loudspeakers to render an audio element with little extent or that is far away from the listener.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.
The 1D representation 202 and/or the 2D representation 204 may be used for rendering the audio element 102. Here, a multi-channel audio signal may be generated and used for audio rendering such that the perceived spatial extent matches the simplified extent. To render the 1D representation 202, virtual loudspeakers 222, 224, and 226 may be used. Similarly, to render the 2D representation 204, virtual loudspeakers 232, 234, 236, and 238 may be used. The positions and/or the locations of the virtual loudspeakers are shown in
If either the width or the height of the 2D representation 204 becomes negligible, the representation of the audio element 102 may be switched from the 2D representation 204 to the 1D representation 202. Similarly, if both of the width and the height of the 2D representation 204 becomes negligible, the representation of the audio element 102 may be switched from the 2D representation 204 to a point source representation.
According to some embodiments, the corner points may be used to place virtual loudspeakers. If the width and the height of the 2D representation shown in
In
Some embodiments of this disclosure provide a solution of adjusting the number of virtual loudspeakers for rendering the audio element 102 (a.k.a., an audio object or an audio source) based on the extent of the audio element 102 and the position of the listener 104 relative to the audio element 102. More specifically, in some embodiments, a method is provided for monitoring an azimuth angle (a.k.a., a width angle) and an elevation angle (a.k.a., a height angle) from the listener 104's point of view towards (the simplified extent corresponding to) the audio element 102, and determining (i) the number of virtual loudspeakers that is optimal for rendering the current frame of the audio signal, and (ii) the positions of the virtual loudspeakers (e.g., where to put the virtual loudspeakers on (the simplified extent corresponding to) the audio element 102).
Rendering an audio element with an extent may involve placing a number of virtual loudspeakers on the audio element such that audio signal(s) for rendering the audio element produce a plausible representation of the audio element. Depending on the degree (i.e., size) (e.g., height, width, etc.) of the extent (or the corresponding simplified extent) of the audio element and the distance between the listener and the audio element, a different number of virtual loudspeakers may be needed to produce a subjectively convincing representation of the audio element.
For example, for an audio element with small extent, fewer number of virtual loudspeakers may be preferred as large number of virtual loudspeakers may cause comb-filtering effects if the generated audio signals have some amount of correlation. On the other hand, when the listener is close to a large audio element (i.e., the audio element with large extent), a higher number of virtual loudspeakers may be needed to avoid the problem of a psychoacoustical hole in front of listener.
In
On the other hand, in
Accordingly, based on the extent of an audio element and a distance between the audio element and the listener, a different number of virtual loudspeakers may be needed in order to properly render the audio element. Also, in case an audio element is represented by multiple audio channels, it may be better to render the audio element with a higher number of virtual loudspeakers so that all spatial information indicated by the multiple audio channels is rendered.
For example, in case an audio element has audio channels representing the spatial information in the vertical dimension, then the rendering setup needs virtual loudspeakers positioned so they can render the vertical as well as horizontal spatial information.
Therefore, it is desirable to adjust the number of virtual loudspeakers to use for rendering an audio element during audio rendering process based on the extent of the audio element (e.g., the height and/or the width of the audio element) and/or the position of the listener with respect to the audio element, in order to provide a plausible representation of the audio element.
In order to set or adjust the number of virtual loudspeakers to use for rendering an audio element based on the extent of the audio element and/or the position of the listener relative to the audio element, azimuth angle (a.k.a., the width angle) and elevation angles (a.k.a., the height angle) as shown in
where NSP
As discussed above, to provide a plausible representation of the audio element 102, it may be desirable to adjust the number of virtual loudspeakers to use for audio rendering based on the extent of the audio element and/or the position of the listener with respect to the audio element for every audio frame.
However, changing the number of virtual loudspeakers between frames may have a negative impact on gain stability between those frames. To overcome this negative impact, the overall gain of all virtual loudspeakers may follow a constant gain rule. In other words, regardless of whether and/or how the number of virtual loudspeaker is changed, the sum of the gains of all virtual loudspeakers should remain the same in each frame.
For example, in a scenario where there is one virtual loudspeaker in frame #1, which has a gain value of 1, if, in frame #2, the number of virtual loudspeakers is changed to three, then the sum of the gains of the three virtual loudspeakers should be 1. This zero-sum concept may be formulated as follows:
where i is an index of the current frame, OVG
The above equation assumes that the signals going to each virtual loudspeaker are correlated. If the signals are completely uncorrelated, the gains may be adjusted according to a constant power rule. In other words, the gains may be adjusted in a way that is preserving the energy rather than the amplitude. In most cases, the signals will be at least partly correlated, which means that preserving the amplitude might be desirable.
A more elaborate solution may be calculating the gain according to both the amplitude and energy preserving rules and using a gain that is a balance between these two depending on the actual amount of correlation between the channels of the signal.
The gain adjustment method described above may be a complementary step and does not undermine the necessity of further gain adjustments in other steps of the renderer.
In some embodiments, the virtual loudspeakers setup may be further optimized by adapting the positions of the virtual loudspeakers to the horizontal and height angles.
where PSP
In
In
In some embodiments, the number of virtual loudspeakers to use for audio rendering may be selected from a group of predetermined values (e.g., 1, 3, 5, etc.), the selection depending on the width angle and the height angle.
When both the width angle and the height angle are very small, e.g., less than one or more threshold values (like the scenario shown in
On the other hand, if only one of the width angle and the height angle is very small, e.g., less than one or more threshold values (like the scenario shown in
When both the width angle and the height angle are large enough (like the scenario shown in
The terms “too small,” and “large enough” may be defined in terms of reducing or preventing the comb-filtering effect and the psychoacoustical hole. The terms may be defined mathematically as follows:
where hc(t) and vc(i) are flags in ith frame and they are used for deciding the number of virtual loudspeakers.
α=α (which is the horizontal angle)/2 and β=e (which is the vertical angle)/2, and Chthr∈(0,1] and Cvthr∈(0,1] are the constants defining ranges of the horizontal and height angles that are considered to be “too small” and/or “large enough.”
The reason why a half of the width angle or a half of the height angle is used to obtain hc(i) and vc(i) is that theoretically each of the width angle and the height angle can be any value that is greater than 0 but less than or equal to π (i.e., a & e∈(0, π]). Since the value of sin(x) is proportional to the value of x as long as x is between 0 and 90 degree, by dividing each of the width angle and the height angle by 2, α and β are within a range between 0 and 90 degree (i.e., α & β∈(0, π/2]).
In some embodiments, the number of virtual loudspeakers in ith frame may be formulated as below:
Also, in some embodiments, the position of each virtual loudspeaker PSP
where PSP
centerpoint(x, y, z) is the position of the center point of the (point/1D/2D) representation 902, 904, 906, or 908 of the audio element 102, leftpoint(x, y, z) is the position of the left corner of the 1D representation 904, rightpoint(x, y, z) is the position of the right corner of the 1D representation 904, toppoint(x, y, z) is the position of the top corner of the 1D representation 906, and bottompoint(x, y, z) is the position of the bottom corner of the 1D representation 906. bottomleftpoint(x, y, z) is the position of the bottom left corner of the 2D representation 908, bottomrightpoint(x, y, z) is the position of the bottom right corner of the 2D representation 908, topleftpoint(x, y, z) is the position of the top left corner of the 2D representation 908, and toprightpoint(x, y, z) is the position of the top right corner of the 2D representation 908.
The gain adjustment of each virtual loudspeaker may be determined using the equation (1) discussed above.
The 2D representation of the audio element 102 may be made by combining the 1D representation 904 shown in
As discussed above, the number and/or the positions of virtual loudspeakers to use for rendering the audio element 102 may vary based on the size of the representation of the audio element 102 and/or a distance between the audio element 102 and the listener 104.
However, a sudden change in the number and/or the positions of the virtual loudspeakers may result in an undesirable artifact in the audio signal output for rendering the audio element. To reduce and/or prevent such undesirable artifact, it is desirable to provide a smooth transition from one virtual loudspeaker setup (that is associated with a particular number and particular positions of the virtual loudspeakers) to another virtual loudspeaker setup (that is associated with a different number and/or different positions of the virtual loudspeakers). Some embodiments of this disclosure provide a way to achieve a smooth transition between the different virtual loudspeaker setups.
A transition from the point representation 902 to the 1D representation 904 or 906 and a transition from the 1D representation 904 or 906 to the 2D representation 908 may be achieved by either transition scheme #1—transitioning from the point representation 902 to the 1D representation 904 (“1D horizontal representation”) and then to the 2D representation 908—or transition scheme #2—transitioning from the point representation 902 to the 1D representation 906 (“1D vertical representation”) and then to the 2D representation 908.
Thus, in some embodiments, appropriate transition scheme for switching the representation of the audio element may be selected from the two transition schemes based on the width angle (e.g., 706 shown in
For example, in the VR environment 100 shown in
On the other hand, if the listener 104 moves closer to the audio element 102 in a particular direction, there may be a scenario where the height angle (704) changes at a rate faster than the rate at which the width angle (706) changes, and thus, the height angle (704) will pass a threshold before the width angle (706) angle passes the threshold. In such scenario, the transition scheme #2—transitioning from the point representation 902 to the 2D representation 908 via the 1D vertical representation 906—may be applied.
There may also be a rare scenario where as the listener 104 moves closer to the audio element 102, the height angle 704 and the width angle 706 are changed at the same rate, and thus the height angle 704 and the width angle 706 pass the threshold at substantially the same time. In such scenario, both methods are applicable.
Once the transition scheme is selected, the selected transition scheme is continuously applied regardless of whether there is a change as to which one of the height angle and the width angle changes faster as long as the current height angle and the current width angle are continued to be greater than or equal to a respective threshold. For example, because the rate of the width angle change is higher than the rate of the height angle change, the transition scheme #1 may be selected at time t=t0. But there may be a scenario where at time t=t1, the rate of the height angle change becomes greater than the rate of the width angle change. In such scenario, according to one embodiment, the transition scheme #1 is continuously applied as long as the current height angle and the current width angle continue to be greater than or equal to a respective threshold.
On the other hand, if, after time t=t0, if a distance between the audio element 102 and the listener 104 is increased such that at time t=t1, the width angle is less than a width angle threshold and the height angle is less than a height angle threshold, then the transition scheme selected at time t=t0 is no longer applicable, and a new transition scheme will be selected according to the method described above.
In scenarios where the width (e.g., 950 shown in
In such scenarios, if the initial representation of the audio element 102 was a point representation (e.g., 902 shown in
More specifically, as shown in
In some embodiments, one way to increase the number of virtual loudspeakers to use for rendering the audio element 102 from one to three is by maintaining the virtual speaker (e.g., 942 shown in
In order to make a smooth transition from the point representation 902 to 1D horizontal representation 904, the gain of each of the newly added virtual loudspeakers 944 and 946 may be increased gradually. For example, in some embodiments, the gain of each of the newly added virtual loudspeakers 944 and 946 may be determined based on the width angle 972. For example, SG2,i=ƒ(α)*SG2,i0 and SG3,i=ƒ(α)*SG3,i0, where SG2,i is the adjusted gain of the virtual loudspeaker 944 and SG3,i is the adjusted gain of the virtual loudspeaker 946. SG2,i0 and SG3,i0 are default gains and may be predefined. In some embodiments, the default gains may be 1. ƒ(α) is a gain adjustment factor which may vary between 0 and 1 (i.e., ƒ(α)∈[0,1]) based on α∈[0, π/2]. Note that
In some embodiments, ƒ(α) may be set to be a constant value if α is less than a start threshold angle value (αst) but starts to increase (e.g., linearly, exponentially, etc) from the constant value if α increases. When α becomes an end threshold angle value (αend), ƒ(α) may be set to be another constant value. For example,
αst and αend may be adjustable between 0 to 90 degrees but may always need to satisfy the condition of αst<αend.
In other embodiments, gain adjustment factor ƒ(α) may also be a trigonometric function of α. For example, ƒ(α)=k*sin(α), where k is a constant controlling the pace of the transition.
After the representation of the audio source 102 is transitioned from the point source representation 902 to the 1D horizontal representation 904, there may be a scenario where the height angle 974 becomes greater. As the height angle 974 becomes greater, β (which is equal to the height angle/2) becomes greater, thereby becoming more significant. Once β becomes sufficiently significant, the representation of the audio element 102 may further be transitioned from the 1D horizontal representation 904 to the 2D representation 908.
The transition from the 1D horizontal representation 904 to the 2D representation 908 may begin by determining the boundary of the 2D representation 908 of the audio element 102. After determining the boundary of the 2D representation 908, two new virtual loudspeakers 947 and 948 may be added to the top left corner and the top right corner of the 2D representation 908.
Also the two virtual loudspeakers 944 and 946 that existed in the 1D horizontal representation 904 may be moved from their initial positions in the 1D horizontal representation 904 towards the bottom left corner and the bottom right corner of the 2D representation 908.
That is:
where PSP
PSP
βst and βend may be adjustable between 0 to 90 degrees but may always need to satisfy the condition of βst<βend.
When transitioning from the 1D representation 904 to the 2D representation 908, initially, when the height angle 974 is substantially low, the position of the virtual loudspeaker 944 and 946 remains the same with respect the position of the virtual loudspeaker 942. However, as the height of the representation of the audio element 102 increases, the position of the virtual loudspeaker 944 moves toward the bottom left corner of the 2D representation 908. Similarly, as the height of the representation of the audio element 102 increases, the position of the virtual loudspeaker 946 moves toward the bottom right corner of the 2D representation 908.
and g(β) is a gain adjustment factor function which varies between 0 and 0.5 (g(β)∈[0,0.5]) based on β∈[0, π/2].
SG4,i and SG5,i are the gains of the newly added virtual loudspeakers 1114 and 1116 respectively. SG4,i0 and SG5,i0 are default gains that may be predefined.
The gain adjustment factor function g(β) may cause the gain change to occur at a particular height (elevation) angle. That is, at β=βst, g(β) starts to increase (e.g., linearly, exponentially, etc.) from 0 and at β=βend, g(β) reaches 0.5:
Also, to preserve the stability of the overall gain of all virtual loudspeakers, as the gains of the two new virtual loudspeakers 1114 and 1116 increase (e.g., during the transition from the intermediate 2D representation 1106 to the 2D representation 1108), the gains of the two virtual loudspeakers that existed in the 1D representation 1104—the virtual loudspeakers 1112 and 1118—may be attenuated gradually using:
where SG2,i and SG3,i are the gains of the existing virtual loudspeakers 1112 and 1118 respectively. SG2,i0 and SG3,i0 are default gains that may be predefined.
As discussed above, this gain adjustment method may be a complementary step and does not undermine the necessity of further gain adjustments in other steps of the renderer.
In scenarios where the height of the audio element is greater than or equal to the width of the audio element (i.e., width<height or width=height), the transition from the point representation (e.g., 902 shown in
That is, for the transition from the point representation 902 to the 1D vertical representation 906, the position of the two newly added virtual loudspeakers 982 and 984 may be set as follows:
where PSP
To make the transition from the point representation 902 to the 1D vertical representation 906 smooth, the gain of the newly added virtual loudspeakers 982 and 984 may gradually increase. This gain adjustment of the virtual loudspeakers 982 and 984 may be determined based on the height (elevation) angle:
where ƒ(β) is a gain adjustment factor which varies between 0 and 1 (ƒ(β)∈[0,1]) based on β∈[0, π/2], SG2,i0 is the default gain of the virtual loudspeaker 982, and SG3,i0 is the default gain of the virtual loudspeaker 984.
The gain adjustment factor function ƒ(β) may cause the gain change to occur at a particular height (elevation) angle. That is, at β=βst, ƒ(β) starts to increase (e.g., linearly, exponentially, etc.) from 0 and at β=βend, ƒ(β) reaches 1:
βst and βend can vary between 0 to 90 degrees with the condition of βst<βend.
As α becomes significant, the transition from the 1D representation 906 to the 2D representation 908 may begin to occur by adding two virtual loudspeakers 986 and 988 at the top left and bottom left corners of the 2D representation 908 and moving the two already added virtual loudspeakers 982 and 984 from the initial positions towards the top right and bottom right corners of the 2D representation 908 respectively. That is:
where PSP
PSP
In one example, the gain of each of the virtual loudspeakers 1226 and 1228 may be set as follows:
where g(α) is a gain adjustment factor which may vary between 0 and 0.5 (g(α)∈[0,0.5]) based on α∈
SG4,i0 is the default gain of the virtual loudspeaker 1226, and SG5,i0 is the default gain of the virtual loudspeaker 1228.
An example function for the gain adjustment factor g(α) is shown below:
As shown above, the gain adjustment factor remains to be 0 until α reaches a lower threshold value αst. In other words, the gain adjustment factor remains to be 0 until the width angle reaches a certain threshold angle. Once the width angle reaches the threshold angle, and thus α reaches the lower threshold value αst, g(α) starts to increase (e.g., linearly, exponentially, etc.) from 0 to 0.5 as α increases from the lower threshold value αst to a higher threshold value αend. Once α reaches the higher threshold value αend, g(α) is set to be 0.5 regardless of whether α further increases beyond the higher threshold value αend.
As shown in
In order to preserve the stability of the overall gain of all virtual loudspeakers, as the gain of the virtual loudspeakers 1226 and 1228 increases, the gain of the pre-existing two virtual loudspeakers 1222 and 1224 may be attenuated gradually using:
where SG2,i is the gain of the virtual loudspeaker 1222 and SG3,i is the gain of the virtual loudspeaker 1224. Similarly, SG2,i0 is the default gain of the virtual loudspeaker 1222 and SG3,i0 is the default gain of the virtual loudspeaker 1224. The default gains may be predetermined.
The transition methods explained above is not limited to perform the transition from the point representation 1202 to the 1D representation 1204 and then from the 1D representation 1204 to the 2D representation 1208. The transition methods explained above are also applicable to the scenario where during the transition from the point representation to the 1D horizontal representation, the transition from the 1D horizontal representation to the 2D representation starts.
In the embodiments shown in
where PSP
Also as shown in
The number of virtual loudspeakers shown in
The point representation 1302 of the audio element 102 may be achieved by setting the gain of each of the virtual loudspeakers 1322, 1324, 1326, and 1328 low while setting the gain of the center virtual loudspeaker 1330 high relative to the gain of the remaining loudspeakers. For example, the gain of each of the virtual loudspeakers 1322, 1324, 1326, and 1328 may be set to zero or close to zero. By setting the gain of the center virtual speaker 1330 high while setting the gain of the remaining four loudspeakers low, the audio element 102 will be perceived by the listener as a point source.
In order to switch from the point representation 1302 to the 2D representation 1308, there is no need to change the number of the virtual loudspeakers because the point source representation 1302 of the audio element 102 includes the number of virtual loudspeakers (e.g., in
Thus, only the gain of each of the virtual loudspeakers need to be adjusted to switch the representation of the audio element 102 from the point representation 1302 to the 2D representation 1308. However, increasing the gain of each of the virtual loudspeakers 1324, 1324, 1326, and 1328 suddenly to create the 2D representation 1308 may result in an undesirable artifact in the audio signal output for rendering the audio element 102. Thus, to smooth the transition from the point source representation 1302 to the 2D representation 1308, the gain of each of the virtual loudspeakers 1322, 1324, 1326, and 1328 may be increased gradually, thereby going through the first and second intermediate representations 1304 and 1306.
In some embodiments, the degree of adjusting the gains may depend on the width (azimuth) angle 706 and the height (elevation) angle 704 (e.g., linearly, exponentially or trigonometrically). For example,
where SG2,i is the gain of the virtual loudspeaker 1322, SG3,i is the gain of the virtual loudspeaker 1324, SG4,i is the gain of the virtual loudspeaker 1326, SG5,i is the gain of the virtual loudspeaker 1328, SG2,i0 is the default gain of the virtual loudspeaker 1322, SG3,i0 is the default gain of the virtual loudspeaker 1324, SG4,i0 is the default gain of the virtual loudspeaker 1326, and SG5,i0 is the default gain of the virtual loudspeaker 1328.
As explained above,
Also, r is a constant that controls the transition rate (i.e., how fast or slow the transition from the point representation 1302 to the 2D representation 1308 occurs). In one example, r may be set such that 0 r*sin(α)*sin(β)≤1.
Even though
In another alternative embodiment, the transition from the point representation to the 2D representation may be made using nine virtual loudspeakers—1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430—as shown in
where PSP
centerpoint(x, y, z) is the center point of the 2D representation 1400 of the audio element 102, leftedgepoint(x, y, z) is the center point of the left side of the 2D representation 1400, rightedgepoint(x, y, z) is the center point of the right side of the 2D representation 1400, topedgepoint(x, y, z) is the center point of the top side of the 2D representation 1400, bottomedgepoint(x, y, z) is the center point of the bottom side of the 2D representation 1400, topleftpoint(x, y, z) is the position of the top left corner of the 2D representation 1400, and bottomleftpoint(x, y, z) is the position of the bottom left corner of the 2D representation 1400, topleftpoint(x, y, z) is the position of the top left corner of the 2D representation 1400, and bottomleftpoint(x, y, z) is the position of the bottom left corner of the 2D representation 1400.
Like the embodiments shown in
Thus, to smooth the transition from the point source representation to the 2D representation, the gain of each of the virtual loudspeakers may be adjusted gradually, thereby going through the first and second intermediate representations 1404 and 1406.
In some embodiments, the degree of adjusting the gains may depend on the azimuth angle 122 and the elevation angle 124 (e.g., linearly, exponentially or trigonometrically). For example,
where SG1,i is the gain of the virtual loudspeaker 1430, SG2,i is the gain of the virtual loudspeaker 1422, SG3,i is the gain of the virtual loudspeaker 1423, SG4,i is the gain of the virtual loudspeaker 1424, SG5,i is the gain of the virtual loudspeaker 1425, SG6,i is the gain of the virtual loudspeaker 1426, SG7,i is the gain of the virtual loudspeaker 1427, SG8,i is the gain of the virtual loudspeaker 1428, and SG9,i is the gain of the virtual loudspeaker 1429.
Similarly, SG1,i0 is the default gain of the virtual loudspeaker 1430, SG2,i0 is the default gain of the virtual loudspeaker 1422, SG3,i0 is the default gain of the virtual loudspeaker 1423, SG4,i0 is the default gain of the virtual loudspeaker 1424, SG5,i0 is the default gain of the virtual loudspeaker 1425, SG6,i0 is the default gain of the virtual loudspeaker 1426, SG7,i0 is the default gain of the virtual loudspeaker 1427, SG8,i0 is the default gain of the virtual loudspeaker 1428, and SG9,i0 is the default gain of the virtual loudspeaker 1429. Each of the default gains may be predetermined.
d may be a variable that controls how fast/slow to fade-in and/or fade-out the virtual loudspeakers 1426-1429 and p may be a variable that controls how fast/slow to fade-in and/or fade-out the virtual loudspeakers 1422-1425. In some embodiments, both d and p are chosen such that:
In the above embodiments, the gain of the virtual loudspeakers 1422-1429 that surround the center virtual loudspeaker 1430 is faded-in as either the width angle or the height angle increases (by using the coefficient p*sin(α) or p*sin(β)) and faded-out as both of the width angle and the height angle decrease (by using the coefficient (1−d*sin(α)*(sin(β))).
As shown in
Orientation sensing unit 1501 is configured to detect a change in the orientation of the listener and provides information regarding the detected change to processing unit 1503. In some embodiments, processing unit 1503 determines the absolute orientation (in relation to some coordinate system) given the detected change in orientation detected by orientation sensing unit 1501. There could also be different systems for determination of orientation and position, e.g. a system using lighthouse trackers (lidar). In one embodiment, orientation sensing unit 1501 may determine the absolute orientation (in relation to some coordinate system) given the detected change in orientation. In this case the processing unit 1503 may simply multiplex the absolute orientation data from orientation sensing unit 1501 and positional data from position sensing unit 1502. In some embodiments, orientation sensing unit 1101 may comprise one or more accelerometers and/or one or more gyroscopes.
Audio renderer 1551 produces the audio output signals based on input audio signals 1561, metadata 1562 regarding the XR scene the listener is experiencing, and information 1563 about the location and orientation of the listener. The metadata 1562 for the XR scene may include metadata for each object and audio element included in the XR scene, and the metadata for an object may include information about the dimensions of the object. The metadata 1152 may also include control information, such as a reverberation time value, a reverberation level value, and/or an absorption parameter. Audio renderer 1551 may be a component of XR device 1510 or it may be remote from the XR device 1510 (e.g., audio renderer 1551, or components thereof, may be implemented in the so called “cloud”).
Directional mixer receives audio input 1561, which in this example includes a pair of audio signals 1701 and 1702 associated with an audio element (e.g. the audio element associated with extent), and produces a set of k virtual loudspeaker signals (VS1, VS2, . . . , VSk) based on the audio input and control information 1791. In one embodiment, the signal for each virtual loudspeaker can be derived by, for example, the appropriate mixing of the signals that comprise the audio input 1561. For example: VS1=α×L+β×R, where L is input audio signal 1701, R is input audio signal 1702, and α and β are factors that are dependent on, for example, the position of the listener relative to the audio element and the position of the virtual loudspeaker to which VS1 corresponds.
Gain adjuster 1706 may adjust the gain of any one or more of the virtual loudspeaker signals based on control information 1792, which may include the above described gain factors as calculated by controller 1601. That is, for example, when the middle speaker is placed close to another speaker (e.g., left speaker 202 as shown in
Using virtual loudspeaker signals VS1, VS2, . . . , VSk, speaker signal producer produces output signals (e.g., output signal 1581 and output signal 1582) for driving speakers (e.g., headphone speakers or other speakers). In one embodiment where the speakers are headphone speakers, speaker signal producer 1508 may perform conventional binaural rendering to produce the output signals. In embodiments where the speakers are not headphone speakers, speaker signal produce may perform conventional speaking panning to produce the output signals.
In some embodiments, the size of the representation is a width of the representation and/or a height of the representation, the method comprises determining (i) a width angle value associated with the width of the representation and the distance and/or (ii) a height angle value associated with the height of the representation and the distance, and the number of the virtual loudspeakers to use for rendering the audio element is determined based on the width angle value and/or the height angle value.
In some embodiments, the method further comprises (i) comparing the width angle value with a first threshold value; and (ii) comparing the height angle value with a second threshold value, wherein the number of the virtual loudspeakers to use for rendering the audio element is determined based on the comparison (i) and/or the comparison (ii).
In some embodiments, the number of the virtual loudspeakers to use for rendering the audio element is determined to be a first value if (i) the width angle value is less than the first threshold value and (ii) the height angle value is less than the second threshold value. The number of the virtual loudspeakers to use for rendering the audio element is determined to be a second value if (i) the width angle value is greater than or equal to the first threshold value and (ii) the height angle value is less than the second threshold value. The number of the virtual loudspeakers to use for rendering the audio element is determined to be the second value if (i) the width angle value is less than the first threshold value and (ii) the height angle value is greater than or equal to the second threshold value. The number of the virtual loudspeakers to use for rendering the audio element is determined to be a third value if (i) the width angle value is greater than or equal to the first threshold value and (ii) the height angle value is greater than or equal to the second threshold value.
In some embodiments, the width angle value is determined based on sin c×a/2 or the height angle value is determined based on sin c×e/2, where c is a constant. a is an angle formed by a line between the listener and a first point on a first side of the representation and a line between the listener and a second point on a second side of the representation. The first side is opposite to the second side and e is an angle formed by a line between the listener and a third point on a third side of the representation and a line between the listener and a fourth point on a fourth side of the representation. The third side is opposite to the fourth side.
In some embodiments, the method further comprises determining positions of the virtual loudspeakers, wherein the positions of the virtual loudspeakers are determined based on a boundary of the representation.
In some embodiments, the determined number of the virtual loudspeakers is one, and the position of the virtual loudspeaker is the center of the representation.
In some embodiments, the determined number of the virtual loudspeakers is more than two, and the virtual loudspeakers comprise a first virtual loudspeaker, a second virtual loudspeaker, and third virtual loudspeaker. A position of the first virtual loudspeaker is the center of the representation, and a position of the second virtual loudspeaker and a position of the third virtual loudspeaker are symmetric with respect to a line through the position of the first virtual loudspeaker. For example, the position of the first virtual speaker is a center point between the position of the second virtual loudspeaker and the position of the third virtual loudspeaker.
In some embodiments, the method further comprises obtaining changed distance information indicating a changed distance between the audio element and the listener, and based on the size information and the changed distance information, re-determining a number of virtual loudspeakers to use for rendering the audio element.
In some embodiments, the determined number of the virtual loudspeakers is 1 and the virtual loudspeakers of which the number is determined includes a first virtual loudspeaker, the redetermined number of the virtual loudspeakers is 3 and the virtual loudspeakers of which the number is redetermined includes the first virtual loudspeaker, a second virtual loudspeaker, and a third virtual loudspeaker, and an audio gain associated with the second virtual loudspeaker and/or an audio gain associated with the third virtual loudspeaker is a function of an angle (a or e) formed by a line between the listener and a position of the second virtual loudspeaker and a line between the listener and a position of the third virtual loudspeaker.
In some embodiments, the function is equal to
where each of c1 and c2 is a constant.
In some embodiments, the method further comprises obtaining changed distance information indicating a changed distance between the audio element and the listener; and based on the size information and the changed distance information, obtaining an updated representation of the audio element and determining an updated number of virtual loudspeakers to use for the updated representation of the audio element.
In some embodiments, the determined representation of the audio element is a one-dimensional, 1D, representation of the audio element, and the determined updated representation of the audio element is a two-dimensional, 2D, representation of the audio element.
In some embodiments, the 1D representation of the audio element comprises a first virtual loudspeaker, a second virtual loudspeaker, and a third virtual loudspeaker, the 2D representation of the audio element comprises the first virtual loudspeaker, the second virtual loudspeaker, and the third virtual loudspeaker, a fourth virtual loudspeaker, and a fifth virtual loudspeaker, and the method further comprises (i) moving the second virtual loudspeaker from a first coordinate towards a first boundary coordinate of the updated representation of the audio element and (ii) moving the third virtual loudspeaker from a second coordinate towards a second boundary coordinate of the updated representation of the audio element.
In some embodiments, a current coordinate of the second virtual loudspeaker depends on (the first coordinate×(1−f(e))+(the first boundary coordinate×f(e)), a current coordinate of the third virtual loudspeaker depends on (the second coordinate×(1−f(e))+(the second boundary coordinate×f(e)), and e is a value of an angle related to a width or a height of the 2D representation. f(e) is a function of the value e. One example of f(e) is
In some embodiments, the method further comprises determining an audio gain associated with the fourth virtual loudspeaker and/or an audio gain associated with the fifth virtual loudspeaker, wherein the audio gain associated with the fourth virtual loudspeaker and/or the audio gain associated with the fifth virtual loudspeaker is a function, ƒ, of (i) a width angle associated with the width of the updated representation of the audio element and the distance and/or (ii) a height angle associated with the height of the updated representation of the audio element and the distance.
In some embodiments, the function is
p is equal to (c1×the width angle or the height angle), pst is a lower threshold value, pend is a higher threshold value, c1 is a constant, and g(p) is a function of which an output value increases as p increases. g(p) is greater than 0 but is less than or equal to 0.5.
In some embodiments, the audio gain associated with the second virtual loudspeaker and/or the audio gain associated with the third virtual loudspeaker is set based on (1−f(p)).
In some embodiments, the determined representation of the audio element is a point representation of the audio element, and the determined updated representation of the audio element is a two-dimensional, 2D, representation of the audio element.
In some embodiments, the point representation of the audio element comprises a first virtual loudspeaker, and the 2D representation of the audio element comprises the first virtual loudspeaker, a second virtual loudspeaker, a third virtual loudspeaker, a fourth virtual loudspeaker, and a fifth virtual loudspeaker. The method further comprises moving one or more of the second virtual loudspeaker, the third virtual loudspeaker, the fourth virtual loudspeaker, and the fifth virtual loudspeaker using a moving path function, and the moving path function is a function of (i) a width angle associated with the width of the updated representation of the audio element and the distance and (ii) a height angle associated with the height of the updated representation of the audio element and the distance.
In some embodiments, the moving path function is a function of
where each of c1 and c2 is a constant.
While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/078163 | 10/11/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63254389 | Oct 2021 | US |