The present invention relates generally to audio processing.
Immersive audio is an essential media component of extended reality (XR) applications, which includes augmented reality (AR), mixed reality (MR) and virtual reality (VR). To enhance the user experience, immersive audio may support adjusting the presented immersive audio/visual scene in response to motion of the user. For example, it may be desirable to track a user's head position and head movement during audio rendering and to adjust the audio accordingly. Thus, an immersive audio experience may process head movements using models with three degrees of freedom (3DoF) or six degrees of freedom (6DoF).
Various immersive audio services, e.g., immersive voice and audio services (IVAS), may be used to render high quality audio renditions at the XR device that include awareness of pose information, which may include metadata for head positions with relative or absolute movements of the user. However, making such adjustments according to pose information may require significant computational processing capabilities to achieve a high-quality immersive audio experience.
The computational complexity requirements for immersive audio may be problematic for small form factor devices such as AR glasses. To make them as practical and user-friendly as possible, such AR glasses may avoid using powerful processors and heavy batteries, which may otherwise result in bulky, more expensive, and heavy weight user-worn devices that consume more power and generate a significant amount of heat. Consequently, to enable reasonable form factor low power operation with low latency, such AR devices tend to have processors with reduced complexity and constrained numerical operations.
The present disclosure recognizes the above noted problems and explores potential solutions. One potential solution is to reduce audio rendering requirements at the end-device (e.g., the AR device operated by the user) with a split-rendering topology that leverages processing from some other entity of the mobile/wireless network (e.g., a network based device) to which the end-device is connected or tethered (e.g., via a network or cloud-based connection). For example, a powerful network entity such as mobile user equipment (e.g., UE, a device used by an end-user, a portable multi-function device, a gaming console, a cloud-based resource, etc.) may be connected to the end-device to assist in split-rendering of immersive audio. Pose information based on the user movement may be gathered at the end-device and transmitted to the network entity. The end-device may then only receive the already rendered audio from the network entity; where the high complexity calculations such as processing 3DoF/6DoF pose information (e.g., head-tracking metadata) may be performed by the rendering entity (e.g., network entity). One problem with the described split-rendering topology is the latency for transmissions between end-device and network entity may be on the order of 100 ms; which means the network entity may be relying on outdated pose/head-tracking information. Because of this delay, the rendered audio from the network entity may not match the current head pose/head position of the user at the end-device. If the motion-to-sound latency is too large, the end user will experience a perceivable loss of quality in the immersive experience.
Document U.S. 63/340,181 discloses a novel approach to interactive headtracking. The described approach generates multiple binaural representations corresponding to various head poses at the main device or pre-renderer and computes metadata which can be used along with a reference binaural signal to reconstruct binaural output corresponding to any given pose at the post-renderer. The reference binaural signal and the metadata are sent to a post-rendering device. Based on the received binaural signal and metadata, and on a difference between a reference pose and a detected current head pose of the user, the post-renderer determines binaural audio corresponding to the current head pose. The present disclosure appreciates that the metadata requirements for head-pose information required in this type of solution may be significant. For example, if the current head pose deviates significantly from the reference head-pose, a large amount of metadata would be sent to the post-rendering device to cover all possible head poses.
It is with respect to these and other considerations that the disclosure made herein is presented.
Enclosed are techniques for split-rendering of immersive audio.
It is an object of the present invention to overcome this problem, and to enable efficient split rendering also in a situation where the head pose of the user is expected to change considerably.
In some embodiments, a method of processing audio in a main device is described, the method comprising receiving a first bitstream, decoding the first bitstream to obtain decoded immersive audio content, receiving a second bitstream, decoding the second bitstream to obtain pose information relating to a user of a lightweight processing device, determining a first head-pose, based on the pose information, rendering a downmix representation of the immersive audio content corresponding to the first head pose, selecting a second set of head poses with respect to the first head pose, rendering a set of binaural representations of the immersive audio content, the binaural representations corresponding to the second set of poses, computing reconstruction metadata enabling reconstruction of the set of binaural representations from the downmix representation, the metadata including the first head pose, encoding the downmix representation and the reconstruction metadata in a third bitstream, and outputting the third bitstream.
In some additional embodiments, a method of processing audio in a lightweight processing device is described, the method comprising receiving a bitstream from a main device, decoding the bitstream to obtain a downmix representation of an immersive audio content associated with a first head pose, and first reconstruction metadata, enabling reconstruction of a set of binaural representations from the downmix presentation, the set of binaural representations being associated with a set of second head poses, the reconstruction metadata including the first head pose, and obtaining the set of second head poses with which the first reconstruction metadata is associated. The method further comprises detecting a current head pose of a user of the lightweight processing device, transmitting the current head pose to the main device, and computing output binaural audio based on the downmixed presentation, the first reconstruction metadata, the set of second head poses, and a relationship between the first head pose and the current head pose.
In still some embodiments, the downmix representation is a first binaural representation. In other embodiments, the downmix representation includes a mono signal formed by a combination of channels in a multichannel representation of the immersive audio content.
A “lightweight processing device” is intended to include any user device that has limited capabilities, and therefore may be unsuitable for binaural rendering in real time. In some examples, a “lightweight processing device” refers to the physical weight of the device. In other examples, a “lightweight processing device” refers to the processing capabilities of the device. A typical example lightweight device may have limited battery capacity and limited processing capabilities so that the physical device may be maintained in a small form factor.
Existing techniques for head-tracked split rendering require more processing resources than necessary, wasting device energy and requiring costly physical components (e.g., powerful processors requiring large heatsinks or active cooling components) which often result in heavy and cumbersome device. These considerations are particularly important in battery operated devices and wearable devices.
Accordingly, the herein disclosed techniques provide electronic devices with faster, more efficient methods for head-tracked split rendering. Such methods optionally complement or replace other methods for head-tracked split rendering. For battery-operated and wearable computing devices, such methods conserve power, increase the time between battery charges, and enable construction of more comfortable devices at reduced cost.
In accordance with some embodiments, a method performed at one or more electronic devices is described. The method comprises: receiving, by a first, main processing device, an immersive audio, obtaining (current) user pose information; determining, by the first device, from the immersive audio, a downmixed signal including at least one channel; determining, by the first device, a set of N (e.g., N≥1) predicted poses based the obtained user pose information; determining, by the first device, from the immersive audio, a set of binaural representations corresponding to the set of N predicted poses; generating, by the first device, from the downmix signal and from at least one of the set of binaural representations and a metadata model, a metadata; and providing, by the first device to a second, lightweight processing device different from the first device. In accordance with some embodiments, obtaining user pose information is performed at least in part by a second device, and includes providing (e.g., transmitting) data corresponding to the obtained user pose information from the second device to the first device.
In accordance some embodiments, the method includes rendering, by a renderer of the second device, the downmixed signal into output binaural audio based at least in part on the metadata, the obtained user pose information, and updated user pose information. In accordance some embodiments, the downmix signal is a binaural signal generated using: a set of HRTFs or a set of BRIRs; and the obtained user pose information. In accordance some embodiments, determining a set of predicted poses includes calculating N poses corresponding to N predicted angles along yaw axis, herein referred to as yaw angles, by: modifying a head pose yaw angle derived from the obtained user pose information by a first pre-determined value (e.g., angle specified in degrees or radians) in first direction to obtain a first predicted yaw angle of the N predicted yaw angles. In accordance some embodiments, the method includes modifying the pose yaw angle derived from the obtained user pose information by second pre-determined value in a second direction (e.g., anti-clockwise, clockwise) to obtain a second predicted yaw angle of N predicted yaw angles.
In accordance with some embodiments, a non-transitory computer-readable storage medium is described. The non-transitory computer-readable storage medium stores one or more computer programs configured to be executed by one or more processors of a computing apparatus, the one or more computer programs including instructions for: receiving, by a first device, an immersive audio, obtaining user pose information; determining, by the first device, from the immersive audio, a downmixed signal including at least one channel; determining, by the first device, a set of N (e.g., N≥1) predicted poses based on the obtained user pose information; determining, by the first device, from the immersive audio, a set of binaural representations corresponding to the set of N predicted poses; generating, by the first device, from the downmix signal and from at least one of the set of binaural representations and a metadata model, a metadata; and providing, by the first device to a second device different from the first device. In accordance with some embodiments, obtaining user pose information is performed at least in part by a second device, and includes providing (e.g., transmitting) data corresponding to the obtained user pose information from the second device to the first device.
In accordance some embodiments, the one or more computer programs includes instructions for rendering, by a renderer of the second device, the downmixed signal into output binaural audio based at least in part on the metadata, the obtained user pose information, and updated user pose information. In accordance some embodiments, the downmix signal is a binaural signal generated using: a set of HRTFs or a set of BRIRs; and the obtained user pose information. In accordance some embodiments, the one or more computer programs includes instructions for determining a set of predicted poses includes calculating N poses corresponding to N predicted yaw angles by: modifying a pose yaw angle derived from the obtained user pose information by a first pre-determined value (e.g., angle specified in degrees or radians) in first direction to obtain a first predicted yaw angle of the N predicted yaw angles. In accordance some embodiments, the one or more computer programs includes instructions for modifying the pose yaw angle derived from the obtained user pose information by second pre-determined value in a second direction (e.g., anti-clockwise, clockwise) to obtain a second predicted yaw angle of N predicted yaw angles.
In accordance with some embodiments, an apparatus is described. The apparatus one or more processors; and memory storing one or more programs configured to be executed by the one or more processors, the one or more programs including instructions for: receiving, by a first device, an immersive audio, obtaining user pose information; determining, by the first device, from the immersive audio, a downmixed signal including at least one channel; determining, by the first device, a set of N (e.g., N≥1) predicted poses based the obtained user pose information; determining, by the first device, from the immersive audio, a set of binaural representations corresponding to the set of N predicted poses; generating, by the first device, from the downmix signal and from at least one of the set of binaural representations and a metadata model, a metadata; and providing, by the first device to a second device different from the first device. In accordance with some embodiments, obtaining user pose information is performed at least in part by a second device, and includes providing (e.g., transmitting) data corresponding to the obtained user pose information from the second device to the first device.
In accordance some embodiments, the one or more computer programs includes instructions for rendering, by a renderer of the second device, the downmixed signal into output binaural audio based at least in part on the metadata, the obtained user pose information, and updated user pose information. In accordance some embodiments, the downmix signal is a binaural signal generated using: a set of HRTFs or a set of BRIRs; and the obtained user pose information. In accordance some embodiments, the one or more computer programs includes instructions for determining a set of predicted poses includes calculating N poses corresponding to N predicted yaw angles by: modifying a pose yaw angle derived from the obtained user pose information by a first pre-determined value (e.g., angle specified in degrees or radians) in first direction to obtain a first predicted yaw angle of the N predicted yaw angles. In accordance some embodiments, the one or more computer programs includes instructions for modifying the pose yaw angle derived from the obtained user pose information by second pre-determined value in a second direction (e.g., anti-clockwise, clockwise) to obtain a second predicted yaw angle of N predicted yaw angles.
The embodiments described herein may be generally described as techniques, where the term “technique” may refer to system(s), device(s), method(s), computer-readable instruction(s), module(s), component(s), hardware logic, and/or operation(s) as suggested by the context as applied herein.
Features and technical benefits other than those explicitly described above will be apparent from a reading of the following Detailed Description and a review of the associate drawings. This Summary is provided to introduce a selection of techniques in a simplified form, and not intended to identify key or essential features of the claimed subject matter, which are defined by the appended claims.
The present invention will be described in more detail with reference to the appended drawings.
In the following detailed description, reference is made to the accompanied drawings, which form a part hereof, and which is shown by way of illustration, specific example configurations of which the concepts can be practiced. These configurations are described in sufficient detail to enable those skilled in the art to practice the techniques disclosed herein, and it is to be understood that other configurations can be utilized, and other changes may be made, without departing from the spirit or scope of the presented concepts. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the presented concepts is defined only by the appended claims.
Embodiments of the invention disclosed herein assume compatibility and consistency with usage of an immersive audio codec such as IVAS in an XR application. In particular, the inventive concepts described in detail below are applicable to systems, devices, architectures, methods, and techniques where main decoding and pre-rendering are performed by a main device (UE) with high resources such a powerful computational processing (or processor) resources with significant power or battery capabilities (e.g., an edge or other network node/server of an 5G system, a high performance mobile device, etc.) and final decoding and post-rendering are performed by a different device with lower resources relative to the main device (e.g., a lightweight device, a wearable device, AR glasses, head-mounted display, heads-up-display, etc.).
Embodiments of the proposed techniques, systems, devices, methods, and computer-readable instructions for low complexity low bitrate prediction-based split rendering, which may include operations such as:
Note, one or more aspects of the proposed techniques, systems, devices, methods and computer-readable instructions described herein, including those listed above, do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the descriptions herein, and other components may be added to, or removed from, the described systems. Accordingly, other embodiments are within the scope of the claims following the main description.
The first device 10 (or main processing device, or pre-renderer) includes a decoder 11, a downmixer 12, a head pose decoder 13, a binaural renderer 14, a metadata generator 15, a first encoder 16, a second encoder 17, and a multiplexer 18. The decoder 11, e.g., an IVAS decoder, is configured to receive and decode a bitstream b1, and decode an immersive audio content A. The downmixer 12 is configured to receive the immersive audio content and provide a downmix representation, Dmx, of the audio content. The head pose decoder 13 is configured to receive and decode a bitstream bp, which includes head pose information, and generates a first head pose P′. The binaural renderer 14 is configured to receive the first head pose P′ and the immersive audio content A and responsively render one or several binaural representations corresponding to the first head pose P′. The metadata generator 15 is configured to receive the downmix Dmx and binaural representations, and responsively generate reconstruction metadata M allowing reconstruction of the binaural representations from the downmix. The metadata M includes the first pose P′. The first encoder 16 is configured to receive downmix Dmx, and responsively encode the downmix Dmx as encoded bitstream bu. The second encoder 17 is configured to receive reconstruction metadata M (including the first pose P′), and responsively encode the reconstruction metadata as encoded bitstream b12. The multiplexer 18 is configured to receive the encoded bitstreams b11 and b12 from the outputs of the two encoders, and responsively combine the encoded bits into a bitstream b2. The main device may also include an interface to output the bitstream b2, whereby the bitstream may be subsequently transmitted or otherwise made available to another device that is external to the main device 10.
The second device 20 (lightweight processing device or post-renderer device) includes a demultiplexer 21, a first decoder 22, a second decoder 23, a head-tracker 24, a pose information encoder 25, and a binaural reconstruction block 26. the second or lightweight processing device 20 may be a user-held device. The demultiplexer 21 is configured to receive bitstream b2 and responsively separate the received bitstream b2 into two encoded bitstreams b21 and b22. The decoder 22 is configured to receive encoded bitstream b21, and responsively decode bitstream b21 into a downmix signal Dmx′. The decoder 23 is configured to receive encoded bitstream b22, and responsively decode bitstream b22 into metadata M′, including the first pose P′. The head tracker 24 is configured to sense a user head position, and responsively generate pose information, e.g., including a current (actual) user head pose P. The pose information encoder 25 is configured to receive the pose information from the head-tracker 24, and responsively encode the pose information in a bitstream bP. The binaural reconstruction block 26 is configured to receive the current user head pose P, the downmix signal Dmx′, and metadata M′, including the first pose P′, and responsively determine a binaural output based on the downmix Dmx′, the metadata M′, and the current head pose P in relation to the first head pose P′.
In
In some example implementations, the pose information received by light weight/post-renderer device 20 (depicted as right side block of
In some example implementations, the pose information received by light weight/post-renderer device 20 (depicted as right-side block of
In some example implementations, a light weight/post-renderer device 20 (depicted as right-side block of
As depicted in
Renderer 14 generates one or more binaural representations BINn from audio content A, the one or more binaural representations corresponding to one or more poses Pn that are estimated from pose P′, where 1≤n≤N and N≥1. The one or more poses (set of second head poses) may be determined based on a set of predefined offsets with respect to the first head pose P′. A metadata generator (e.g., generator 15) generates metadata M based on the Dmx signal and binaural signals BINn such that any of BINn binaural signals can be reconstructed using Dmx signal and metadata M. The downmix representation Dmx is coded by encoder 16 which generates a bitstream b11, Metadata M is quantized and coded (e.g., by encoder 17), generating a bitstream b12. Bitstreams b11 and b12 are combined into bitstream b2 by multiplexer 18.
In some embodiments, the downmix representation includes two signals. In this case, the metadata should allow a reconstruction from two signals (the downmix) to two signals (the binaural output). A two-by-two matrix is an efficient way to enable such a reconstruction. In some embodiments, the metadata M includes a two-by-two matrix for each time unit and for each frequency band, i.e., for each time-frequency tile.
As depicted in
In order to allow binaural reconstruction, the lightweight device obtains information about the set of second head poses to which the reconstruction metadata relates. In embodiments where the set of second head poses Pn is determined by applying a set of offsets to the first head pose, then these offsets may be known beforehand (and e.g., be applied by the reconstruction block 26). Alternatively, these offsets may be included in the metadata M received in the bitstream b2.
The reconstruction may involve first computing modified reconstruction metadata from the current head pose P and metadata M′ (e.g. by interpolation), and then applying this modified metadata to the downmix signal Dmx′.
In an example implementation with N=2, the downmixer 12 is a binaural renderer that generates the Dmx signal as a first (reference) binaural signal BINref using a set of HRTFs (or BRIRs) and the first head pose P′. Poses Pn are P′+X, P′−X′ where X and X′ are the assumed deviations in yaw angle between P′ and P. Renderer 14 generates two binaural outputs BINn corresponding to P′+X and P′−X′ poses. The reference binaural signal BINref and binaural signals BINn corresponding to Poses Pn are then fed into metadata generator block 15 that generates metadata M corresponding to P′+X and P′−X′ poses. The metadata M is quantized and coded by MD quant and coding block 17. The BINref signal is coded by encoder 16. The multiplexed bitstream b2 is sent to the post-renderer device 20 which decodes BINref signal and M metadata and feeds it to the binaural reconstruction block 26. Reconstruction block 26 interpolates or extrapolates the metadata based on the difference between P′, P′+X and P′−X′ and the current head pose P. The interpolation may be linear or triangular or based on sin or cosine-based models, etc. Reconstruction block 26 applies interpolated metadata to BINref as proposed in U.S. Provisional Application 63/340,181 (hereby incorporated by reference) and generates the head-tracked binaural signal BINout. In an example implementation, the usage of decorrelators is avoided by directly using decorrelator coefficients with the sum of Left and Right channels of BINref as mentioned below:
Here, zl,p [n] and zr,p [n] are the nth samples of Left and Right channels of the reconstructed BIN signal as per current head pose P. Mp is the (two-by-two) prediction coefficients mixing matrix, yl,p
In some embodiments, downmixer 12 generates a combination of a mono channel (prototype signal) and zero or more diffused channels (diffused signal(s)) as Dmx signals. The mono signal, S, may be formed as a combination of channels of a multichannel representation of the immersive audio content A, e.g. combination of the signals of a first binaural representation. The diffused signal, D, may be formed as a combination of diffused components of the same multichannel representation of the immersive audio content A.
In some embodiments, such operations may be applied in time, CQMF, subband or frequency domain and all coefficients subject to or resulting from such operations may be complex. In some embodiments, the prototype signal is generated from BINref signal as follows S=aL+bR, and the diffused signal is generated as D=cL+dR, wherein L and R are left and right channels of BINref signal, a and b are gain parameters that are either dynamically computed or statically determined for e.g., a=0.5, b=0.5. c and d are dynamically computed using covariance of L and R channels of the BINref signal. S is the prototype signal and D is the diffused signal. In an embodiment, a, b, c and d are computed as follows:
Let the BINref covariance be
where is a unit vector and q is the absolute value of covariance of L and R channels. Assuming a mid-side conversion from L, R as:
covariance of MS channels can be easily computed from covariance of L and R channels as:
where û is a unit vector of length 1 and α is the absolute value of covariance of M and S channels.
It can be shown that an optimal solution to obtain prototype signal and diffused signal leads to the value of a, b, c and d as follows:
a=norm*(1+ûf)
b=norm*(1−ûf)
c=norm*(1−gû−gf)
d=norm*(gf−gû−1)
wherein
f=α/max(m,s)
g=(α+sf)/(sf2+2αf+m)
Renderer 12 generates two binaural outputs BINn corresponding to P′+X and P′−X′ poses. The protype signal S and diffused signal D and BINn signals are then fed into metadata generator block 15 that generates metadata M corresponding to P′+X and P′−X′ signals. If Lx and Rx are left and right signal corresponding to P′+X then metadata corresponding to P′+X signals can be computed as follows:
From this metadata and downmix signals S and D, P′+X channels can be reconstructed by reconstruction block 26 as follows:
L
x
=S*PredL+DiffL*D
R
x
=S*PredR+DiffR*D
Similarly, metadata for P′−X can be computed, and P′−X binaural signals can be reconstructed from metadata and prototype signal S and diffused signal D.
In some implementations, it may be desired to code only one channel due to bitrate limitation. In that case, only the prototype signal is coded and metadata is generated as follows:
From this metadata and prototype signal, P′+X channels can be reconstructed by the reconstruction block 26 as follows:
L
x
=S*PredL+DiffL*Decorr(S)
R
x
=S*PredR+DiffR*Decorr(S)
wherein Decorr(S) is the decorrelated version of prototype signal S. Similarly, metadata for P′−X can be computed, and P′−X binaural signals can be reconstructed from metadata and prototype signal.
In some embodiments, the first head pose P′ may be transmitted to the lightweight processing device 20 for better synchronization of pose (e.g., as metadata). In case the current head pose P differs from P′, P′+X and P′−X′, reconstruction block 26 interpolates or extrapolates the metadata based on the difference between P′, P′+X and P′−X′ and the current head pose P. The interpolation may be, for example, linear or triangular or based on sine or cosine-based models, etc. Reconstruction block 26 applies interpolated metadata to BINref as proposed above and generates the head-tracked binaural signal BINout.
In some embodiments, X is equal to X′ and poses Pn are P′+X, P′−X wherein X is the assumed deviations in yaw angle between P′ and P. In other example implementations, X is not equal to X′ and X′ may be smaller or greater than X based on, for example, angular velocity and acceleration or deceleration of user's head rotation.
At step S11 (receive & decode bitstream, or receiving and decoding a first bitstream), a first bitstream is received and decoded (e.g., by a decoder 11) to obtain decoded immersive audio content A. At step S12, (receive & decode pose information, or receiving and decoding pose information), a second bitstream may be received and decoded (e.g., by a decoder 13) to obtain pose information associated with a user of a lightweight processing device (e.g., 20). At step S13 (determine P′, or determining P′), a first head-pose, P′, may be determined (e.g., by head pose decoder 13) based on the pose information. At step S14 (downmix audio, or downmixing audio), a first downmix of the immersive audio content A may be determined (e.g., by a downmixer 12), where the first downmix is a representation of the immersive audio content corresponding to the first head pose. At step S15 (render BINn, or rendering BINn), a set of binaural representations of the immersive audio content is rendered (eg., by renderer 14), where the set of binaural representations correspond to a second set of poses. At step S16 (generate M, or generating M), reconstruction metadata is generated (or computed, e.g., by generator 15), where the reconstruction metadata enables reconstruction of the set of binaural representations from the first downmix representation. At step S17 (encode or encoding), the downmix representation is encoded (e.g., by encoder 16) and the reconstruction metadata, including the first head pose P′, is encoded (e.g., by encoder 17). At step S18 (output or outputting), a bitstream b2 is output that includes the first downmix representation Dmx and the reconstruction metadata M. The output step may include transmitting the bitstream b2 to the lightweight processing device from which the pose information was received (e.g., lightweight processing device 20).
The process includes, at step S21 (receive and decode bitstream), receiving and decoding a bitstream b2 from a main device 10 (e.g., by decoders 22, 23) to obtain a downmix representation Dmx of an immersive audio content A, a first head pose, P′, and first reconstruction metadata M′ enabling reconstruction of a set of binaural representations BINn from the downmix presentation Dmx. Step 21 may optionally be preceded by a demultiplexing step, to divide (e.g., by demultiplexer 21) the bitstream into two or more bitstreams b21, b22. Step 22 (detect current head-pose) involves detecting (e.g., by head tracker 24) a current head pose P of a user of the lightweight processing device 20. Step S23 (transmit head pose) involves transmitting (e.g., by head pose encoder 25) the current head pose P to the main device 10. Finally, step S25 (compute binaural audio), involves computing (e.g., by reconstruction block 26) output binaural audio BINout based on the downmixed presentation Dmx, the first reconstruction metadata M′, and a relationship between the first head pose P′ and the current head pose P. Optionally, step S25 is preceded by a step S24 (compute second reconstruction metadata) involving computing second reconstruction metadata based on the first reconstruction metadata, the first head pose and the current head pose. In this case, step S25 may use this second reconstruction metadata to obtain the binaural output.
Elements in
As depicted in
According to an example of a mathematical model 19, 27 for generating estimates of the predictive metadata, the predictive metadata parameters for a pose P′+X are obtained through facilitating delay and gain/shape operations, which corresponds to the multiplication with complex prediction parameters in complex QMF domain. The input parameters of that model are direction of arrival (DOA) parameters of the dominant sound source in the given QMF band, the azimuth and elevation angles of the poses and possibly respective HRTF (or HRIR or BRIR) coefficients or at least related coefficients. It is notable that the parameters Mmod may be coded efficiently through indexing them in a codebook of HRTFs (or related codebook entries).
A further example implementation of low-complexity low bitrate prediction-based split rendering in accordance with some embodiments may rely on a mathematical model of how the metadata parameters evolve when the current head pose P differs by some amount Δ from P′. A more advanced technique compared to the above mentioned linear or triangular interpolation may rely on certain mathematical properties of the parameter evolution. One such property is symmetry. In the related discussion to follow, it is assumed that there is a dominant sound source in a given frequency or QMF band and that the DOA of that source is known. In that case, it is possible to designate the azimuthal angle of that DOA with zero or 180 degrees, meaning that the x-axis of the assumed cartesian coordinate system coincides with the DOA.
For instance, assuming that the HRIRs/BRIRs are left/right symmetrical and that pose P′ is aligned with the x-axis (i.e., the azimuth angle is 0 or 180 degrees), the metadata parameters applicable to left and right output channels for an azimuthal pose deviation X are identical to the applicable metadata parameters for swapped output channels (right, left) for a corresponding azimuthal pose deviation of −X.
Moreover, under the given assumptions, the parameters or intermediate parameters from which the metadata parameters are derived may exhibit an odd symmetry relative to the parameters for pose P′, i.e., M(P′+4)=−M(P′−Δ) (whereby a possible constant offset is not considered). This symmetry may be exploited if, for instance, the pre-rendering is done assuming an adjusted pose P′, which is aligned with the x-axis. The symmetry property will then allow limiting the pre-rendering to the poses P′ and P′+X while skipping pre-rendering for P′−X′. This will save the complexity for one rendering operation at the pre-renderer 110 and avoid transmission of metadata parameters for pose P′−X′.
Another case is when (adjusted) pose P′ coincides with the y-axis, i.e., pose P′ is such that the DOA of the dominant sound source is the left or the right direction. Changing the current head pose by a small amount Δ now means that the sound will virtually arrive from cither slightly front or back but still essentially from the left or the right. A good approximation of this case is that the metadata parameters (or intermediate parameters) now exhibit an even symmetry, i.e., M(P′+X)=M(P′−X).
The symmetry properties may further be exploited when modeling the metadata (or intermediate) parameter evolution as a function of the pose deviation Δ. For instance, this function can be represented as a Taylor series of type:
where M(i)(P′) denotes the i-th derivative evaluated at point P′.
Further considering the symmetry properties in the Taylor series approach, it may be useful to force the pose P′ to coincide with the x-axis or the y-axis, i.e., P′ is replaced by an adjusted x- or y-axis aligned pose. In the first case, the even terms (except for i=0) disappear (coefficients a2j=0 for any positive integer j). Thus, the modeling with a linear (first-order) term becomes very accurate and, in many cases, higher order terms do not need to be considered. Likewise, if P′ coincides with the y-axis, the odd terms disappear due to the even symmetry (coefficients a2j−1=0 for any positive integer j). Thus, the modeling with a single second order term becomes very accurate and efficient.
In summary, the described examples making use of symmetry properties may reduce the need to pre-render at 3 poses P′, P′+X and P′−X′ or at least reduce the amount of metadata to be transmitted. Effectively, rather than transmitting the metadata for P′+X and P′−X, it may be more efficient to transmit Taylor series coefficients and DOA angles to indicate the direction of the dominant sound.
Another mathematical property of the metadata parameters (or intermediate parameters) is 360° periodicity with respect to the azimuth angle:
M(P)=M(P+360°).
The interaural time differences for a rendered plane wave signal incident from a given azimuth angle α can be modeled by a sinusoidal expression as follows:
with de: interaural distance and c: speed of sound.
The interaural level differences can also be approximately modelled with a similar expression.
Thus, a possible approximation of the metadata (or intermediate) parameters involves applying a corresponding sinusoidal formulation. In a more general sense, these parameters can efficiently be represented by a few low-order harmonics of a discrete Fourier series:
where, e.g., K=2.
In this expression the 0th order term represents a constant (offset), while the first- and second- (and higher-) order sinusoids model the specific periodic metadata parameter evolution. The coefficients ck are generally complex valued and may for instance depend on the first head pose P′ and the DOA of a dominant sound direction as well as on other parameters such as the interaural distance of the assumed listener head. According to the model-based approach outlined above, the coefficients are determined at the pre-renderer 10, applied to generate approximate metadata parameters Mmod, quantized, coded and then transmitted to the post-renderer that decodes and applies them in its model.
A further embodiment is to rely only on the model. In that case, the main/pre-renderer device 110 may only transmit model parameters and the first head pose to the post-renderer device 20, thereby significantly reducing the amount of transmitted metadata. Coded metadata parameters or residual metadata parameters are not transmitted in that case. The renderer 14 and generator 15 may still be used to generate metadata for poses Pn. However, in that case, the generated metadata may merely be used to optimize the accuracy of the model parameters. It is also possible to set N to zero meaning that the renderer 14 will not be used at all. In that case, the model parameters are solely calculated from the received immersive audio content A and associated metadata parameters such as DOA angles that may be part of the received immersive audio signal representation.
It is notable that in the above examples, the letter M may generally represent a metadata parameter of the above defined mixer matrices such as, e.g., prediction gains PredL or PredR) or diffuseness gains DiffL or DiffR. M may also represent intermediate parameters occurring in the calculation of the metadata parameters such as, e.g., covariances as used in the above embodiments.
Methods of quantizing metadata for prediction-based split rendering technique are described below.
In some example implementations, a main device/pre-render 10 receives immersive audio signal/content A (e.g., output of an immersive decoder such as IVAS, a QMF signal, etc.). The audio content A is converted into downmix signal Dmx (e.g., by downmixer 12) using P′. Main device 10 may receive the pose P′ from Light weight device 20 or it may assume P′ to be a certain pose value without any indication from light weight device 20. In some embodiments, Dmx has one channel. In some embodiments, Dmx has more than one channel (e.g., two channels).
Renderer 14 generates one or more binaural representations BINn from A, the one or more binaural representations corresponding to one or more poses (a set of second poses) poses Pn that are estimated from pose P′, where 1≤n≤N and N≥1. A metadata generator (e.g., generator 15) generates metadata M based on the Dmx signal and binaural signals BINn such that any of BINn binaural signals can be reconstructed using Dmx signal and metadata M. Dmx signal is coded by an encoder 16 which generates a bitstream b11, Metadata M, including pose P′, is quantized and coded (e.g., by encoder 17), generating a bitstream b12. Bitstreams b11 and b12 are combined into bitstream b2 by multiplexer 18.
At the lightweight device/post-renderer 20, b2 is received and separated into b21 and b22 bitstreams by demultiplexer 21. Bitstream b21 is fed to a first decoder 22 which reconstructs Dmx signal and generates a reconstructed downmix representation Dmx′ signal. Bitstream b22 is fed to a MD decoding and dequantizing (unquant) block (e.g., second decoder 23) which reconstructs the metadata M, including pose P′, and generates reconstructed metadata M′. Dmx′ and M′ are then fed to a binaural reconstruction block 26 which generates head tracked binaural output using Dmx′ and metadata M′ and current head pose P.
Given that the poses Pn are known to metadata quantizer (encoder 17), it can make few assumptions to quantize the metadata corresponding to these poses more efficiently. The metadata comprises of a rotation matrix such that the binaural signal corresponding to poses Pn can be reconstructed from Dmx signal. An example metadata representation for a case where Dmx signal is a binaural signal (BINref signal) that is generated by applying a set of HRTFs (or BRIRs) and pose P′ to audio signal A1, is given below
Here, zl,p [n] and zr,p [n] are the nth samples of Left and Right channels of the reconstructed BIN signal as per pose Pn. Mp is the (two-by-two) prediction coefficients mixing matrix, yl,p
Techniques to efficiently quantize and code metadata Mp and gp,p are given below.
Depending on the combination of rotation angles in pose Pn and direction of arrival angles in the reference binaural signal, a rotation matrix Mr can be generated which can then be used as the origin for quantizing the Mp matrix such that the quantization points distribution is same on either side of the origin. This allows for fine quantization around rotation matrices Mr and limits the minimum and maximum value that needs to be coded and also limits the number of quantization points. In an example implementation, if one or more poses from Pn are close to the first head pose P′ then an identity matrix can be assumed as the origin of quantization.
Furthermore, if azimuth angle (θ) and elevation angle (ϕ) of a source in reference BINref signal is known, then the following rotation matrix can be used as the origin of quantization for azimuth angle (θ+Øn) if poses Pn only differ by angle Øn along yaw axis as compared to the reference pose P′:
Here, example values of x, x′, y, y′ are as follows. x=x′=f*(1+sin θ cos Øn+cos θ sin Øn), y=y′=f*(1−sin θ cos Øn−cos θ sin Øn) where f is a constant (for e.g., 0.5). If elevation angle (ϕ) of a source in reference BINref signal is 90 degrees, then Mr can be assumed to be an identity matrix. In some implementations, for certain values of X, Mr may be approximated without prior knowledge of directional of arrival angles of the source. Example values of x, x′, y, y′ are as follows x=y′=cos(Øn/2), x′=−sin(Øn/2), y=sin(Øn/2). It is to be noted that if Øn==0 then the matrix automatically becomes identity matrix.
Typically poses Pn are symmetrically placed around the first head pose P′. In an example implementation, if N=2 and P′+X and P′−X are the poses corresponding to which metadata Mp, as given in eq (1), is generated. Here, X can be a tuning parameter set based on the expected motion to sound delay of the system. Alternatively, X can be a constant (for e.g., 15 degrees along yaw axis, 0 degrees along pitch axis and 0 degrees along roll axis). If the metadata corresponding to P′+X is computed, then an intermediate metadata corresponding to P′−X can be extrapolated using the first head pose and P′+X pose, which can then be used to efficiently quantize and code the actual metadata of P′−X.
The metadata matrix Mp usually has certain symmetry in Left and Right channel entries which can be used to quantize and code the metadata efficiently. One of the symmetries in an example implementation is that the sum of square of the real part of elements of any row or column is assumed to be close to 1. Another symmetry that is used in an example implementation is that element mij is assumed to be close to mji for real part of the Mp matrix while mij is assumed to close to −mji for imaginary parts of the Mp matrix. These symmetries are used to save quantization points in some implementations. Alternatively, these symmetries are used to do differential coding in which a set of elements of Mp matrix are differentially coded with respect to second set of elements of Mp matrix i.e., the difference between two sets is coded. The difference values are likely to be close to 0 most of the times and can be efficiently coded using entropy coders.
The metadata for binaural channels corresponding to poses Pn may be computed in broadband or banded domain. Moreover, in some implementations it can be coded in subband domain with a CLDFB (Complex Low Delay Filterbank). The time resolution of the metadata computed with CLDFB filterbank can be very less than the time resolution of codec (e.g., IVAS) or renderer. In an example implementation, the time resolution of renderer or codec is 20 ms which is referred to as a frame whereas the time resolution of CLDFB domain metadata is 5 ms which is referred to as subframe. It can be assumed that the metadata does not change very frequently with time and hence the metadata corresponding to one or more subframes in a frame is differentially coded with respect to one or more subframes of the same frame. The subframes of the same frame are used to perform differential coding thereby minimizing the impact of packet loss during transmission of data to light weight device. The difference values that are being coded are likely to be 0 in most of the cases and can be efficiently coded using an entropy coder. In some implementations, it has been realized that the metadata does not change very frequently across frequency and hence the metadata corresponding to one or more frequency bands of a frame are differentially coded with respect to one or more frequency bands of the same frame. The difference values that are being coded are likely to be 0 in most of the cases and can be efficiently coded using an entropy coder.
Systems and methods disclosed in the present disclosure may be implemented as software, firmware, hardware or a combination thereof. In a hardware implementation, the division of tasks does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation.
The computer hardware may for example be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that computer hardware. Further, the present disclosure shall relate to any collection of computer hardware that individually or jointly execute instructions to perform any one or more of the concepts discussed herein.
The following components are connected to I/O interface 205: input unit 206, that may include a keyboard, a mouse, or the like; output unit 207 that may include a display such as a liquid crystal display (LCD) and one or more speakers; storage unit 208 including a hard disk, or another suitable storage device; and communication unit 209 which may include a network interface card such as a network card (e.g., wired or wireless).
In some implementations, input unit 206 includes one or more microphones in different positions (depending on the host device) enabling capture of audio signals in various formats (e.g., mono, stereo, spatial, immersive, and other suitable formats).
In some implementations, output unit 207 include systems with various number of speakers. Output unit 207 (depending on the capabilities of the host device) can render audio signals in various formats (e.g., mono, stereo, immersive, binaural, and other suitable formats).
In some embodiments, communication unit 209 is configured to communicate with other devices (e.g., via a network). Drive 210 is also connected to I/O interface 205, as required. Removable medium 211, such as a magnetic disk, an optical disk, a magneto-optical disk, a flash drive or another suitable removable medium is mounted on drive 210, so that a computer program read therefrom is installed into storage unit 208, as required. A person skilled in the art would understand that although apparatus 200 is described as including the above-described components, in real applications, it is possible to add, remove, and/or replace some of these components and all these modifications or alteration all fall within the scope of the present disclosure.
In accordance with example embodiments of the present disclosure, the processes described above may be implemented as computer software programs or on a computer-readable storage medium. For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program including program code for performing methods. In such embodiments, the computer program may be downloaded and mounted from the network via the communication unit 209, and/or installed from the removable medium 211, as shown in
Generally, various example embodiments of the present disclosure may be implemented in hardware or special purpose circuits (e.g., control circuitry), software, logic or any combination thereof. For example, the various elements of
Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, a processor and/or other computing device(s), which may include control circuitry. While various aspects of the example embodiments of the present disclosure are illustrated and described as block diagrams, flowcharts, or using some other pictorial representation, it will be appreciated that the blocks, apparatus, systems, techniques, or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
Additionally, various blocks shown in the flowcharts may be viewed as method steps, and/or as operations that result from operation of computer program code, and/or as a plurality of coupled logic circuit elements constructed to carry out the associated function(s). For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine readable medium, the computer program containing program codes configured to carry out the methods as described above.
Computer program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These computer program codes may be provided to one or more processors of a general-purpose computer, special purpose computer, or other programmable data processing apparatus that has control circuitry, such that the program codes, when executed by one or more processors of the computer or other programmable data processing apparatus, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented. The program code may execute entirely on a computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server or distributed over one or more remote computers and/or servers.
The one or more processors may operate as a standalone device or may be connected, e.g., networked to other processor(s). Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.
The software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, physical (non-transitory) storage media in various forms, such as ROM, PROM, EPROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Further, it is well known to the skilled person that communication media (transitory) typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
The implementation of the technologies disclosed in the figures are merely illustrative examples, and the invention is not so limited. For example, the illustrated partitions such as blocks in
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the disclosure discussions utilizing terms such as “processing”, “computing”, “calculating”, “determining”, “analyzing” or the like, may refer to the function, action, steps and/or processes of a computer hardware or computing system, or similar electronic computing devices, that manipulate and/or transform data represented as physical, such as electronic, quantities into other data similarly represented as physical quantities.
It should be appreciated that in the above description of example embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention. Furthermore, while some embodiments described herein include some, but not other, features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Furthermore, some of the embodiments are described herein as a method or combination of elements of a method that can be implemented by a processor of a computer system or by other means of carrying out the function. Thus, a processor with instructions for carrying out such a method or element of a method forms a means for carrying out the method or element of a method. Note that when the method includes several elements, e.g., several steps, no ordering of such elements is implied, unless specifically stated. Furthermore, an element described herein of an apparatus embodiment is an example of a means for carrying out the function performed by the element for the purpose of carrying out the embodiments of the invention. In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
The person skilled in the art realizes that the present invention by no means is limited to the preferred embodiments described above. On the contrary, many modifications and variations are possible within the scope of the appended claims. For example, various downmix representations may be employed, other than the ones mentioned above. Further, the number of second head poses may be any number, not necessarily two, like in the example mentioned above.
Various aspects and implementations of the present disclosure may also be appreciated from the following enumerated example embodiments (EEEs), which are not claims.
EEE1. A method of processing audio, comprising:
This application claims priority to U.S. Provisional Application 63/386,465 filed on Dec. 7, 2022, the contents of which are herein incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
63386465 | Dec 2022 | US |