This disclosure pertains to systems, methods, and media for rotation of sound components for orientation-dependent coding schemes.
Coding techniques for scene-based audio may rely on downmixing paradigms that are orientation-dependent. For example, a scene-based audio signal that includes W, X, Y, and Z components (e.g., for three-dimensional sound localization) may be downmixed such that only a subset of the components of the components are waveform encoded, and the remaining components are parametrically encoded and reconstructed by a decoder of a receiver device. This may result in a degradation in audio sound quality.
Throughout this disclosure, including in the claims, the terms “speaker,” “loudspeaker” and “audio reproduction transducer” are used synonymously to denote any sound-emitting transducer (or set of transducers). A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), which may be driven by a single, common speaker feed or multiple speaker feeds. In some examples, the speaker feed(s) may undergo different processing in different circuitry branches coupled to the different transducers.
Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X−M inputs are received from an external source) may also be referred to as a decoder system.
Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
At least some aspects of the present disclosure may be implemented via methods. Some methods may involve determining, by an encoder, a spatial direction of a dominant sound component in a frame of an input audio signal. Some methods may involve determining, by the encoder, rotation parameters based on the determined spatial direction and a direction preference of a coding scheme to be used to encode the input audio signal. Some methods may involve rotating sound components of the frame of the input audio signal based on the rotation parameters such that, after being rotated, the dominant sound component has a spatial direction that aligns with the direction preference of the coding scheme. Some methods may involve encoding the rotated sound components of the frame of the input audio signal using the coding scheme in connection with an indication of the rotation parameters or an indication of the spatial direction of the dominant sound component.
In some examples, rotating the sound components comprises: determining a first rotation amount and optionally a second rotation amount for the sound components based on the spatial direction of the dominant sound component and the direction preference of the coding scheme; and rotating the sound components around a first axis by the first rotation amount and optionally around a second axis by said optional second rotation amount such that the sound components, after rotation, are aligned with a third axis corresponding to the direction preference of the coding scheme. In some examples, the first rotation amount is an azimuthal rotation amount and the optional second rotation amount is an elevational rotation amount. In some examples, the first axis or the second axis is perpendicular to a vector associated with the dominant sound component. In some examples, the first axis or the second axis perpendicular to the third axis.
In some examples, some methods may involve determining whether to determine the rotation parameters based at least in part on a determination of a strength of the spatial direction of the dominant sound component, wherein determining the rotation parameters is responsive to determining that the strength of the spatial direction of the dominant sound component exceeds a predetermined threshold.
In some examples, some methods may involve: determining, for a second frame, a spatial direction of a dominant sound component in the second frame of the input audio signal; determining that a strength of the spatial direction of the dominant sound component in the second frame is below a predetermined threshold; and responsive to determining that the strength of the spatial direction of the dominant sound component in the second frame is below a predetermined threshold, determining that rotation parameters for the second frame are not to be determined. In some examples, the rotation parameters for the second frame are set to the rotation parameters for a preceding frame. In some examples, the sound components of the second frame are not rotated.
In some examples, determining the rotation parameters comprises smoothing at least one of: the determined spatial direction of the frame with a determined spatial direction of a previous frame or the determined rotation parameters of the frame with determined rotation parameters of the previous frame. In some examples, the smoothing comprises utilizing an autoregressive filter.
In some examples, the direction preference of the coding scheme depends at least in part on a bit rate at which the input audio signal is to be encoded.
In some examples, the spatial direction of the dominant sound component is determined using a direction of arrival (DOA) analysis.
In some examples, the spatial direction of the dominant sound component is determined using a principal components analysis (PCA).
In some examples, some methods involve quantizing at least one of the rotation parameters or the indication of the spatial direction of the dominant sound component, wherein the sound components are rotated using the quantized rotation parameters or the quantized indication of the spatial direction of the dominant sound component. In some examples, quantizing the rotation parameters or the indication of the spatial direction of the dominant sound component comprises encoding a numerical value corresponding to a point of a set of points uniformly distributed on a portion of a sphere. In some examples, some methods involve smoothing the rotation parameters relative to rotation parameters associated with a previous frame of the input audio signal prior to quantizing the rotation parameters or prior to quantizing the indication of the spatial direction of the dominant sound component.
In some examples, some methods involve smoothing a covariance matrix used to determine the spatial direction of the dominant sound component of the frame relative to a covariance matrix used to determine a spatial direction of a dominant sound component of a previous frame of the input audio signal.
In some examples, determining the rotation parameters comprises determining one or more rotation angles subject to a limit determined based at least in part on a rotation applied to a previous frame of the input audio signal. In some examples, the limit indicates a maximum rotation from an orientation of the dominant sound component based on the rotation applied to the previous frame of the input audio signal.
In some examples, rotating the sound components comprises interpolating from previous rotation parameters associated with a previous frame of the input audio signal to the determined rotation parameters for samples of the frame of the input audio signal. In some examples, the interpolation comprises a linear interpolation. In some examples, the interpolation comprises applying a faster rotation to samples at a beginning portion of the frame relative to samples at an ending portion of the frame.
In some examples, the rotated sound components and the indication of the rotation parameters are usable by a decoder to reverse the rotation of the sound components prior to rendering the sound components.
Some methods may involve receiving, by a decoder, information representing rotated audio components of a frame of an audio signal and a parameterization of rotation parameters used to generate the rotated audio components, wherein the rotated audio components were rotated, by an encoder, from an original orientation, and wherein the rotated audio components have been rotated to a rotated orientation that aligns with a spatial preference of a coding scheme used by the encoder and the decoder. Some methods may involve decoding the received information based at least in part on the coding scheme. Some methods may involve reversing a rotation of the audio components based at least in part on the parameterization of the rotation parameters to recover the original orientation. Some methods may involve rendering the audio components at least partly subject to the recovered original orientation.
In some examples, reversing the rotation of the audio components comprises rotating the audio components around a first axis by a first rotation amount and optionally around a second axis a second rotation amount, and wherein the first rotation amount and the optional second rotation amount are indicated in the parameterization of the rotation parameters. In some examples, the first rotation amount is an azimuthal rotation amount and the optional second rotation amount is an elevational rotation amount. In some examples, the first axis or the second axis is perpendicular to a vector associated with a dominant sound component of the audio components. In some examples, the first axis or the second axis perpendicular to a third axis that is associated with the spatial preference of the coding scheme.
In some examples, reversing the rotation of the audio components comprises rotating the audio components around an axis perpendicular to a plane formed by a dominant sound component of the audio components prior to the rotation and an axis corresponding to the spatial preference of the coding scheme, and wherein information indicating the axis perpendicular to the plane is included in the parameterization of the rotation parameters.
Some methods may involve determining, by an encoder, a spatial direction of a dominant sound component in a frame of an input audio signal. Some methods may involve determining, by the encoder, rotation parameters based on the determined spatial direction and a direction preference of a coding scheme to be used to encode the input audio signal. Some methods may involve modifying the direction preference of the coding scheme to generate an adapted coding scheme, wherein the modified direction preference is determined based on at least one of the rotation parameters or the determined spatial direction of the dominant sound component such that the spatial direction of the dominant sound component is aligned with the modified direction preference of the adapted coding scheme. Some methods may involve encoding sound components of the frame of the input audio signal using the adapted coding scheme in connection with an indication of the modified direction preference.
Some methods may involve receiving, by a decoder, information representing audio components of a frame of an audio signal and an indication of an adaptation of a coding scheme by an encoder to encode the audio components, wherein the coding scheme was adapted by the encoder such that a spatial direction of a dominant sound component of the audio components and a spatial preference of the coding scheme are aligned. Some methods may involve adapting the decoder based on the indication of the adaptation of the coding scheme. Some methods may involve decoding the audio components of the frame of the audio signal using the adapted decoder.
Some or all of the operations, functions and/or methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon.
At least some aspects of the present disclosure may be implemented via an apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some implementations, an apparatus is, or includes, an audio processing system having an interface system and a control system. The control system may include one or more general purpose single- or multi-chip processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic, discrete hardware components, or combinations thereof.
The present disclosure provides various technical advantages. For example, by rotating sound components to align with a directional preference of a coding scheme, high sound quality may be preserved while encoding audio signals in a bit-rate efficient manner. This may allow accuracy in sound source positioning in scene-based audio, even when audio signals are encoded with relatively lower bit rates and when sound components are not positioned in alignment with a directional preference of the coding scheme.
Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Like reference numbers and designations in the various drawings indicate like elements.
Some coding techniques for scene-based audio (e.g., Ambisonics) rely on coding multiple Ambisonics component signals after a downmix operation. Downmixing may allow a reduced number of audio components to be coded in a waveform encoded manner (e.g., in waveform-retaining fashion), and the remaining components may be encoded parametrically. On the receiver side, the remaining components may be reconstructed using parametric metadata indicative of the parametric encoding. Because only a subset of the components are waveform encoded and the parametric metadata associated with the parametrically encoded components may be encoded efficiently with respect to bit rate, such a coding technique may be relatively bit rate efficient while still allowing high quality audio.
By way of example, a First Order Ambisonics (FOA) signal may have W, X, Y, and Z components, where the W component is an omnidirectional signal, and where the X, Y, and Z components are direction-dependent. Continuing with this example, with certain codecs (e.g., the Immersive Voice and Audio Services (IVAS) codec), at a lowest bit rate (e.g., 32 kbps), the FOA signal may be downmixed to one channel, where only the W component is waveform encoded, and the X, Y, and Z components may be parametrically encoded. Continuing still further with this example, at a higher level bit rate (e.g., 64 kbps), the FOA signal may be downmixed to two channels, where the W component and one direction dependent component are waveform encoded, and the remaining direction dependent components are parametrically encoded. In one example, the W and Y components are waveform encoded, and the X and Z components may be parametrically encoded. In this case, because the Y component is waveform encoded, whereas the X and Z components are parametrically encoded, the encoding of the FOA signal is orientation dependent.
In instances in which a dominant sound component is not aligned with the selected direction dependent component, reconstruction of the parametrically encoded components may not be entirely satisfactory. For example, in an instance in which the W and Y components are waveform encoded and in which the X and Z components are parametrically encoded, and in which the dominant sound component is not aligned with the Y axis (e.g., in which the dominant sound component is substantially aligned with the X axis or the Z axis, or the like), it may be difficult to accurately reconstruct the X and Z components using the parametric metadata at the receiver. Moreover, because the dominant sound component is not aligned with the waveform encoded axis, the reconstructed FOA signal may have spatial distortions or other undesirable effects.
In some implementations, the techniques described herein perform a rotation of sound components to align with a directional preference of a coding scheme. For example, in an instance in which the directional preference of the coding scheme is along the Y axis (e.g., in the example given above in which W and Y components are waveform encoded), the techniques described herein may rotate the sound components of a frame such that a dominant sound component of the frame is aligned with the Y axis. The rotated sound components may then be encoded. Additionally, rotation parameters that include information that may be used by a decoder to reverse the rotation of the rotated sound components may be encoded. For example, the angles of rotation used to rotate the sound components may be provided. As another example, the location (e.g., in spherical coordinates) of the dominant sound component of the frame may be encoded. The encoded rotated sound components and the encoded rotation parameters may be multiplexed in a bit stream.
A decoder of a receiver device may de-multiplex the encoded rotated sound components and the encoded rotation parameters and perform decoding to extract the rotated sound components and the rotation parameters. The decoder may then utilize the rotation parameters to reverse the rotation of the rotated sound components such that the sound components are reconstructed to their original orientation. The techniques described herein may allow high sound quality with a reduced bit rate, while also maintaining accuracy in sound source positioning in scene-based audio, even when sound components are not positioned in alignment with a directional preference of the coding scheme.
The examples described herein generally utilize the Spatial Reconstruction (SPAR) perceptual encoding scheme. In SPAR, a FOA audio signal may be spatially processed during downmixing such that some channels are waveform encoded and some channels are parametrically encoded based on metadata determined by a SPAR encoder. SPAR is further described in D. McGrath, S. Bruhn, H. Purnhagen, M. Eckert, J. Torres, S. Brown, and D. Darcy Immersive Audio Coding for Virtual Reality Using a Metadata-assisted Extension of the 3GPP EVS Codec IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 730-734, which is hereby incorporated by reference in its entirety. It should be noted that although the SPAR coding scheme is sometimes utilized herein in connection with various examples, the SPAR coding scheme is merely one example of a coding scheme that utilizes a directional preference for FOA downmixing. In some implementations, the techniques described herein may be utilized with any suitable scene-based audio coding scheme.
In some implementations, the rotated sound components, along with an indication of the rotation that was performed with an encoder, may be encoded as a bit stream. For example, the encoder may encode rotational parameters that indicate that the sound components of the audio signal depicted in
In some implementations, an encoder rotates sound components of an audio signal and encodes the rotated audio components in connection with rotation parameters. In some implementations, the audio components are rotated by an angle that is determined based on: 1) the spatial direction of the dominant sound component in the audio signal; and 2) a directional preference of the coding scheme. For example, the directional preference may be based at least in part on a bit rate to be used in the coding scheme. As a more particular example, a lowest bit rate (e.g., 32 bits per second) may be used to encode just the W component such that the coding scheme has no directional preference. Continuing with this more particular example, a next higher bit rate (e.g., 64 bits per second) may be used to encode the W component and the Y component, such that the coding scheme has a directional preference along the Y axis. The examples described herein will generally relate to a condition in which the W component and the Y component are encoded, although other coding schemes and other directional preferences may be derived using the techniques described herein.
Process 200 can begin at 202 by determining a spatial direction of a dominant sound component in a frame of an input audio signal. In some implementations, the spatial direction may be determined as spherical coordinates (e.g., (α, β), where α indicates an azimuthal angle, and β indicates an elevational angle). In some implementations, the spatial direction of the dominant sound component may be determined using direction of arrival (DOA) analysis of the frame of the input audio signal. DOA analysis may indicate a location of an acoustic point source (e.g., positioned at a location having coordinates (α, β)) from which sound originating yields the dominant sound component of the frame of the input audio signal. DOA analysis may be performed using, for example, the techniques described in Pulkki, V., Delikaris-Manias S., Politis, A., Parametric Time-Frequency Domain Spatial Audio, 2018, 1st edition, which is incorporated by reference herein in its entirety. In some implementations, the spatial direction of the dominant sound component may be determined by performing principal components analysis (PCA) on the frame of the input audio signal. In some implementations, the spatial direction of the dominant sound component may be determined by performing a Karhunen-Loeve transform (KLT).
In some implementations, a metric that indicates a degree of dominance, or strength, of the dominant sound component is determined. One example of such a metric is a direct-to-total energy ratio of the frame of the FOA signal. The direct-to-total energy ratio may be within a range of 0 to 1, where lower values indicate less dominance of the dominant sound component relative to higher values. In other words, lower values may indicate a more diffuse sound with a less strong directional aspect.
It should be noted that, in some implementations, process 200 may determine that rotation parameters need not be uniquely determined based on the degree of the strength of the dominant sound component. For example, in response to determining that the direct-to-total energy ratio is below a predetermined threshold (e.g., 0.5, 0.6, 0.7, or the like), process 200 may determine that rotation parameters need not be uniquely determined for the current frame. For example, in some such implementations, process 200 may determine that the rotation parameters from the previous frame may be re-used for the current frame. In such examples, process 200 may proceed to block 208 and rotate sound components using rotation parameters determined for the previous frame. As another example, in some implementations, process 200 may determine that no rotation is to be applied, because any directionality present in the FOA signal may reflect creator intent that is to be preserved, for example, determined based on metadata received with the input audio signal. In such examples, process 200 may omit the remainder of process 200 and may proceed to encode downmixed sound components without rotation. As yet another example, in some implementations, process 200 may estimate or approximate rotation parameters based on other sources. For example, in an instance in which the input audio signal is associated with corresponding video content such as the position of a speaking person, process 200 may estimate the rotation parameters based on locations and/or orientations of various content items in the video content. In some such examples, process 200 may proceed to block 206 and may quantize the estimated rotation parameters determined based on other sources.
At 204, process 200 may determine rotation parameters based on the determined spatial direction and a directional preference of a coding scheme used to encode the input audio signal. In some implementations, the directional preference of the coding scheme may be determined and/or dependent on a bit rate used to encode the input audio signal. For example, a number of downmix channels, and therefore, which downmix channels are used, may depend on the bit rate.
It should be noted that, rotation of sound components may be performed using a two-step rotation technique in which the sound components are rotated around a first axis (e.g., the Z axis) and then around a second axis (e.g., the X axis) to align the sound components with a third axis (e.g., the Y axis). Note that the two-step rotation technique is shown in and described below in more detail in connection with
αrot=αopt−α; and βrot=βopt−β
Alternatively, in some implementations, rotation of sound components may be performed using a great circle technique in which sound components are rotated around an axis perpendicular to a plane formed by the dominant sound component and the axis corresponding to the directional preference of the coding scheme. Note that the great circle technique is shown in and described below in more detail in connection with
It should be noted that, in some implementations, smoothing may be performed on determined rotation angles (e.g., on αrot and βrot, or on Θ and N), for example, to allow for smooth rotation across frames. For example, smoothing may be performed using an autoregressive filter (e.g., of order 1, or the like). As a more particular example, given determined rotation angles for a two-step rotation technique of αrot(n) and βrot(n) for a current frame n, smoothed rotation angles αrot_smoothed(n) and βrot_smoothed(n) may be determined by:
αrot
βrot
In the above, δ may have a value between 0 and 1. In one example, δ is about 0.8.
Alternatively, in some implementations, smoothing may be performed on covariance parameters or covariance matrices that are generated in the DOA analysis, PCA analysis, and/or KLT analysis to determine the direction of the dominant sound component. The smoothed covariance matrices may then be used to determine rotation angles. It should be noted that in instances in which smoothing is applied to determined directions of the dominant sound component across successive frames, various smoothing techniques, such as an autoregressive filter or the like, may be utilized.
In some instances, the smoothing operation (on rotation angles or on covariance parameters or matrices) can advantageously be reset when a transient directional change occurs rather than allowing such a transient change to affect subsequent frames.
It should be noted that, in some implementations, process 200 may determine and/or modify rotation angles determined at block 204 subject to a rotational limit from a preceding frame to a current frame. For example, in some implementations, process 200 may limit a rate of rotation (e.g., to 15° per frame, 20° per frame, or the like). Continuing with this example, process 200 can modify rotation angles determined at block 204 subject to the rotational limit. As another example, in some implementations, process 200 may determine that the rotation is not to be performed if a change in rotation angles of the current frame from the preceding frame is smaller than a predetermined threshold. In other words, process 200 may determine that small rotational changes between successive frames are not to be implemented, thereby applying hysteresis to the rotation angles. By not performing rotations unless a change in rotation angle substantially differs from the rotation angle of a preceding frame, small jitters in direction of the dominant sound are not reflected in corresponding jitters in the rotation angle.
At 206, process 200 may quantize the rotation parameters (e.g., that indicate an amount by which the sound components are to be rotated around the relevant rotation axes). For example, referring to the two-step rotation technique, in some implementations, the rotation amount in the azimuthal direction (e.g., αrot) may be quantized to be αrot, q, and the rotation amount in the elevational direction (e.g., βrot) may be quantized to be βrot, q. As another example, referring to the great circle rotation technique, the rotation amount about the perpendicular axis N may be quantized to Θq, and the direction of the perpendicular axis N may be quantized to Nq. As yet another example, referring to the great circle rotation technique, in some implementations, the direction of the dominant sound component (e.g., α and β) may be quantized, and the decoder may determine the direction of the perpendicular axis N and the rotation angle Θ about N using a priori knowledge of the spatial preference of the coding scheme (e.g., a priori knowledge of αopt and βopt). In some implementations, each angle may be quantized linearly. For example, in an instance in which 5 bits are used to encode a rotation angle, the rotation angle may be quantized to one of 32 steps. As another example, in an instance in which 6 bits are used to encode a rotation angle, the rotation angle may be quantized to one of 64 steps. Additional techniques for quantization are shown in and described below in connection with
It should be noted that in some implementations, smoothing may be performed prior to quantization, such as described above in connection with block 204. Alternatively, in some implementations, smoothing may be performed after quantization. In instances in which smoothing is performed after quantization, the decoder may additionally have to perform smoothing of decoded rotation angles. In such instances, smoothing filters at the encoder and the decoder run in a substantially synchronized manner such that the decoder can accurately reverse a rotation performed by the encoder. For example, in some implementations, smoothing operations may be reset under pre-determined conditions readily available at encoder and decoder, such as at a fixed time grid (e.g. each nth frame after codec reset/start) or upon transients detected based on the transmitted downmix signals.
Referring back to
It should be noted that, in some implementations, process 200 may perform sample-by-sample interpolation across samples of the frame. The interpolation may be performed from rotation angles determined from a previous frame (e.g., as applied to a last sample of the previous frame) to rotation angles determined (e.g., at block 206) and as applied to the last sample of the current frame. In some implementations, interpolation across samples of a frame may ameliorate perceptual discontinuities that may arise from two successive frames being associated with substantially different rotation angles. In some implementations, the samples may be interpolated using a linear interpolation. For example, in an instance in which a two-step rotation is performed (e.g., the sound components are rotated by αrot, q around a first axis and by βrot, q around a second axis), a ramp function may be used to linearly interpolated between α′rot, q of a previous frame and αrot, q of a current frame, and similarly, between β′rot, q of a previous frame and βrot, q of a current frame. For example, for a frame n, an interpolated azimuthal rotation angle αint(n) is represented by:
αint(n)=α′rot,q*w(n)+αrot,q*(1−w(n)),n=1 . . . L
In the above, L indicates a length of the frame, and w(n) may be a ramp function. One example of a suitable ramp function is:
It should be noted that a similar interpolation may be performed for the elevational rotation angle, βrot, q. In instances in which rotation is performed using the great circle rotation technique where a rotation of the sound components is performed around an axis perpendicular to a plane formed by the dominant sound component and an axis corresponding to the directional preference of the coding scheme by an angle Θq (e.g., as shown in and described below in connection with
In some implementations, rather than performing a linear interpolation across samples of the frame, process 200 may perform a non-linear interpolation. For example, in some implementations, rotation angles may be interpolated such that a faster change in rotation angles occur for samples in a beginning portion of the frame relative to samples in an end portion of the frame. Such an interpolation may be implemented by applying an interpolation function with shortened ramp portion in the beginning of the frame. In one example, weights w(n) may be determined according to:
In the equation given above, interpolation is performed over M samples of a frame having length L samples, where M is less than or equal to L.
In some implementations, rather than interpolating between rotation angles, process 200 may perform an interpolation between a direction of a dominant sound component from a previous frame and a direction of a dominant sound component of a current frame. For example, in some implementations, an interpolated sound direction may be determined for each sample of the frame. Continuing with this example, each interpolated position may then be used for rotation, using either the two-step rotation technique or the great circle technique. Interpolation of dominant sound component directions is shown in
Referring to
It should be noted that, in certain cases (e.g., in instances in which P1 and P2 are not on the equator or P1 and P2 are not on the same meridian), the set of interpolated points 902 may not be evenly spaced. When rotated samples are rendered using a uniform time scale, this may lead to perceptual effects, because, during rendering, traversal from P1 to P2 may be more rapid for some samples relative to others. An alternative in which traversal between P1 to P2 is uniform with respect to time is shown in
Referring to
It should be noted that while the great circle interpolation technique with linear interpolation ensures equidistance of the interpolation points, it may have the effect that azimuth and elevation angles are not evolving linearly. The elevation angle may even evolve non-monotonically, such as initially increasing to some maximum elevation and then decreasing with increasing pace to the target interpolation point P2. This may in turn lead to undesirable perceptual effects. For example, the first described technique, which linearly interpolates the two spherical coordinate angles (α, β), may in some cases be advantageous as the elevation angle is strictly confined to the interval [α1, α2] with a strictly monotonic (e.g., linear) evolution of the elevation within it. Thus, the optimal interpolation method may in some cases be the technique that linearly interpolates the two spherical coordinate angles (α, β) according to
Referring back to
At 210, process 200 can encode the rotated sound components and an indication of the rotation parameters using the coding scheme or an indication of the spatial direction of the dominant sound component. In some implementations, the rotation parameters may include bits encoding the rotation angles that were used to rotate the sound components (e.g., βrot, q and βrot, q). In some implementations, the direction of the dominant sound component (e.g., α and β) may be encoded, which is quantized prior to be encoded, e.g., using the techniques shown in and described below in connection with
In some implementations, the rotated sound components may be encoded using the SPAR coding method. In some implementations, the encoded rotation parameters may be multiplexed with the bits representing the encoded rotated sound components, as well as parametric metadata associated with a parametric encoding of the parametrically-encoded sound components. The multiplexed bit stream may then be configured for being provided to a receiver device having a decoder configured to decoder and/or reconstruct the encoded rotated sound components.
Process 300 can begin at 302 by receiving information representing rotated sound components for a frame of an input audio signal and an indication of rotation parameters (e.g., determined and/or applied by an encoder) or an indication of the direction of the dominant sound component of the frame. In some implementations, process 300 may then demultiplex the received information, e.g., to separate the bits representing the rotated sound components from the bits representing the rotation parameters. In some implementations, rotation parameters may indicate angles of rotation around particular axes (e.g., an X axis, a Z axis, an axis parallel to a plane formed by the dominant sound component and another axis, or the like). In instances in which process 300 receives an indication of the direction of the dominant sound component of the frame, process 300 may determine the rotation parameters (e.g., angles by which the sound components were rotated and/or axes about which the sound components were rotated) based on the direction of the dominant sound component and a priori knowledge indicating the directional preference of the coding scheme. For example, process 300 may determine the rotation parameters (e.g., rotation angles and/or axes about which rotation was performed) using similar techniques as those used by the encoder (e.g., as described above in connection with block 204).
At 304, process 300 can decode the rotated sound components. For example, process 300 can decode the bits corresponding to the rotated sound components to construct a FOA signal. Continuing with this example, the decoded rotated sound components may be represented as a FOA signal F as:
where W represents the omnidirectional signal components, and X, Y, and Z represent the decoded sound components along the X, Y, and Z axes, respectively, after rotation. In some implementations, process 300 may reconstruct the components that were parametrically encoded by the encoder (e.g., the X and Z components) using parametric metadata extracted from the bit stream.
At 306, process 300 may reverse the rotation of the sound components using the rotation parameters. For example, in an instance in which the rotation parameters include a parameterization of the rotation angles applied by the encoder, process 300 may reverse the rotation using the rotation angles. As a more particular example, in an instance in which a two-step rotation was performed (e.g., first around the Z axis, and subsequently around the X axis), the two-step rotation may be reversed, as described below in connection with
At 308, process 300 may optionally render the audio signal using the reverse-rotated sound components. For example, process 300 may cause the audio signal to be rendered using one or more speakers, one or more headphones or ear phones, or the like.
In some implementations, angles (e.g., angles of rotation and/or an angle indicating a direction of a dominant sound component, which may be used to determine angles of rotation applied by an encoder) may be quantized, e.g., prior to being encoded into a bit stream by the encoder. As described above, in some implementations, a rotation parameter may be quantized linearly, e.g., using 5 or 6 bits, which would yield 32 or 64 quantization steps, or points, respectively. However, referring to
Various techniques may be used to identify a point from the set of points to which an angle is to be quantized. For example, in some implementations, a Cartesian representation of the angle to be quantized may be projected, along with the set of points, onto a unit cube. Continuing with this example, in some implementations, a two-dimensional distance calculation may be used to identify a point of the subset of points on the face of the unit cube on which the Cartesian representation of the angle has been projected. This technique may reduce the search for the point by a factor of 6 relative to searching over the entire set of points.
As another example, in some implementations, the Cartesian representation of the angle to be quantized may be used to select a particular three-dimensional octant of the sphere. Continuing with this example, a three-dimensional distance calculation may be used to identify a point from within the selected three-dimensional octant. This technique may reduce the search for the point by a factor of 8 relative to searching over the entire set of points. As yet another example, in some implementations, the above two techniques may be combined such that the point is identified from the set of points by performing a two-dimensional distance search over the subset of points in a two-dimensional octant of the face of the cube on which the Cartesian representation of the angle to be quantized is projected. This technique may reduce the search for the point by a factor of 24 relative to searching over the entire set of points.
In some implementations, rather than quantizing an angle by identifying a point of a set of points that is closest to the angle to be quantized, the angle may be quantized by projecting a unit vector representing the Cartesian representation of the angle on the face of a unit cube, and quantizing and encoding the projection. In one example, the unit vector representing the Cartesian representation of the angle may be represented as (x, y, z). Continuing with this example, the unit vector may be projected onto the unit cube to determine a projected point (x′, y′, z′), where:
Given the above, x′, y′, and z′ may have values within a range of (−1, 1), and the values may then be quantized uniformly. For example, quantizing the values within the range of about (−0.9, 0.9), e.g., with a step size of 0.2, may allow duplicate points on the edges of the unit cube to be avoided.
In some implementations, an encoder may perform a two-step rotation of sound components to align with a directionally-preferred axis by rotating the sound components around a first axis, and then subsequently around a second axis. For example, in an instance in which the directionally-preferred axis is the Y axis, the encoder may rotate the sound components around the Z axis, and then around the X axis, such that after the two rotation steps, the dominant sound component is directionally aligned with the Y axis.
An example of such a two-step rotation is shown in and described below in connection with
The second step of the two-step rotation is depicted in
Process 600 may begin at 602 by determining an azimuthal rotation amount (e.g., αrot) and an elevational rotation amount (e.g., βrot). The azimuthal rotation amount and the elevational rotation amount may be determined based on a spatial direction of the dominant sound component in a frame of an input audio signal and a directional preference of a coding scheme to be used to encode the input audio signal. For example, in an instance in which the directional preference of the coding scheme is the Y axis, the azimuthal rotation amount may indicate a rotation amount around the Z axis and the elevational rotation amount may indicate a rotation amount around the X axis. As a more particular example, given a directional preference of αopt and βopt for a dominant sound component positioned at (α, β), an azimuthal rotation amount αrot and an elevational rotation amount βrot may be determined by:
αrot=αopt−α; and βrot=βopt−β
In some implementations, because αopt+90° may also align with the preferred direction of the coding scheme (e.g., corresponding to the negative Y axis) and because azimuthal rotation may be performed in either the clockwise or counterclockwise direction about the Z axis, the value of αrot may be constrained to within a range of [−90°, 90° ]. By determining αrot within a range of [−90°, 90° ] rather than constraining αrot to rotate only in one direction about the Z axis, rotation angles within the range of [90°, 270° ] may not occur. Accordingly, in such implementations, an extra bit may be saved when quantizing the value of αrot (e.g., as described below in connection with block 208). In some implementations, the value of αrot can be determined within the range of [−90°, 90° ] by finding the value of the integer index k for which |αopt−α+k*180°| is minimized. Then, αrot may be determined by:
αrot=αopt−α+kopt*180
It should be noted that, in some implementations, a rotation angle may be determined as a differential value relative to a rotation that was performed on the preceding frame. By way of example, in an instance in which an azimuthal rotation of α′rot was performed on the preceding frame, a differential azimuthal rotation to be performed on the current frame may be determined by: α+rot=αrot−α′rot. In some implementations, the total rotation angle αrot may be encoded as a rotation parameter and provided to the decoder for reverse rotation, thereby ensuring that even if the encoder and the decoder become desynchronized, the decoder can still accurately perform a reverse rotation of the sound components.
It should be noted that, in some implementations, the azimuthal rotation amount and the elevational rotation amount may be quantized values (e.g., αrot, q and βrot, q), which may be quantized using one or more of the quantization techniques described above.
At 604, process 600 can rotate the sound components by rotating the sound components by the azimuthal rotation amount around a first axis and by rotating the sound components by the elevational rotation amount around a second axis. Continuing with the example given above, process 600 can rotate the sound components by αrot (or, for a quantized angle, αrot, q) around the Z axis, and by βrot (or, for a quantized angle βrot, q) around the X axis.
In some implementations, the rotation around the first axis and the second axis may be accomplished using a matrix multiplication. For example, given an azimuthal rotation amount of αrot, q and an elevational rotation amount of βrot, q matrices Rα and Rβ are defined as:
Given a frame of an input audio signal having FOA components of:
The rotated X, Y, and Z components, represented as Xrot, Yrot, and Zrot, respectively, may be determined by:
Because the W component (e.g., representing the omnidirectional signal) is not rotated, the rotated FOA signal may then be represented as:
At the decoder, after extracting the encoded rotated components from the bit stream, the decoder can reverse the rotation of the sound components by applying rotations in the reverse angles. For example, given R−α and R−β defined as:
The encoded rotated components may be reverse rotated by applying a reverse rotation around the X axis by the elevational angle amount and around the Z axis by the azimuthal angle amount. For example, the reverse rotated FOA signal Fout may be represented as:
Xout, Yout, and Zout, representing the reverse rotated X, Y, and Z components of the FOA signal, may be determined by:
In the above, in an instance in which the Y component was waveform encoded by the encoder and in which the X and Z components were parametrically encoded by the encoder, Xrot and Zrot may correspond to reconstructed X and Z components that are still rotated, where the reconstruction was performed by the decoder using the parametric metadata.
In some implementations, an encoder may rotate sound components around an axis perpendicular to a plane formed by the dominant sound component and an axis corresponding to the directional preference of the coding scheme. For example, in an instance in which the dominant sound component is denoted as P, and in which the direction preference of the coding scheme is along the Y axis, the axis (generally represented herein as N) is perpendicular to the P×Y plane.
It should be noted that, in some instances, rotation of sound components about an axis perpendicular to the plane formed by the dominant sound component and the axis corresponding to the directional preference of the coding scheme may provide an advantage in providing consistent rotations for dominant sound components that are near the Z axis but in different quadrants. By way of example, using the two-step rotation process, two dominant sound components near the Z axis but in different quadrants may be rotated by substantially different rotation angles around the Z axis (e.g., αrot may be substantially different for the two points). Conversely, by rotating around an axis perpendicular to a plane formed by the dominant sound component and an axis corresponding to the directional preference of the coding scheme, rotation angles Θ may remain relatively similar for both points. Using similar rotation angles for points that are relatively close together may improve sound perception, e.g., by avoiding rotating audio signal components that would benefit from waveform encoding onto the X and/or Z axes, when the audio signal components along these axes are parametrically encoded.
The angle βN indicates an angle of elevation of axis 704 (e.g., of axis N). The angle γN indicates an angle of inclination between axis 704 (e.g., axis N) and the Z axis. It should be noted that γN is 90°−βN. The angle through which to rotate around axis N is represented as Θ. In some implementations, Θ may be determined by the angle between a vector to point P and a vector corresponding to the Y axis. For example, Θ=arccos (P·Y). Accordingly, the rotation may be performed by first rotating about the Y axis by γN to bring axis N in line with the Z axis, then rotating about the Z axis by Θ to bring the dominant sound component in line with the Y axis, and then subsequently reverse rotating the dominant sound component about the Y axis by −γN to return axis N back to its original position as perpendicular to the original P×Y plane. After rotation, the dominant sound component P is now at position 706, as illustrated in
Process 800 may begin at 802 by identifying, for a point P representing a location of a dominant sound component of a frame of an input audio signal in three-dimensional space, an inclination angle (e.g., γN) of an axis N that is perpendicular to a plane formed by P and an axis corresponding to the directional preference, and an angle (e.g., Θ) through which to rotate the point P about axis N. By way of example, in an instance in which the directional preference corresponds to the Y axis, the plane may be the P×Y plane, and the perpendicular axis may be an axis N which is perpendicular to the P×Y plane. Such an axis is depicted and described above in connection with
At 804, process 800 may perform the rotation by rotating by the inclination angle around the Y axis corresponding to the directional preference, rotating about the Z axis by the angle Θ, and reversing the rotation by the inclination angle around the Y axis. By way of example, process 800 may rotate by γN around the Y axis, by Θ around the Z axis, and then by −γN around the Y axis. After this sequence, the point P (e.g., the dominant sound component) may be aligned with the Y axis, e.g., corresponding to the directional preference.
By way of example, assuming a directional preference corresponding to the Y axis and a quantized angle of rotation about axis N of Θq, Rγ and RΘ,q may be given by:
It should be noted that, for readability, in the equations given above, the inclination angle γN is indicated as not quantized, however, γN may be quantized, for example, using any of the techniques described herein.
Continuing with this example, given a FOA signal having components Win, Xin, Yin, and Zin, a rotation of the X, Y, and Z components may be performed to determine rotated components Xrot, Yrot, and Zrot, which may be determined by:
It should be noted that, the W component, corresponding to the omnidirectional signal, remains the same.
At the decoder, given Xrot, Yrot, and Zrot, the rotation may be reversed by:
In the equation given above, R−Θ,q applies a rotation around the Z axis by −Θ. In other words, R−Θ,q reverses the rotation around the Z axis. It should be noted that, in an instance in which the rotated X and Z components were parametrically encoded by the encoder, Xrot, and Zrot, may correspond to reconstructed rotated components which have been reconstructed by the decoder using parametric metadata provided by the encoder.
In some implementations, rotation of sound components may be performed by various blocks and/or at various levels of a codec (e.g., the IVAS codec). For example, in some implementations, rotation of sound components may be performed prior to an encoder (e.g., a SPAR encoder) downmixing channels. Continuing with this example, the sound components may be reverse rotated after upmixing the channels (e.g., by a SPAR decoder).
An example system diagram for rotating sound components prior to downmixing channels is shown in
At a receiver, a waveform codec 1008 may receive the bit stream and decode the bit stream to extract the reduced channels. In some implementations, bit stream decoder 1008 may be an EVS decoder. In some implementations waveform codec 1008 may additionally extract the rotation parameters. An upmix decoder 1010 may then upmix the reduced channels by reconstructing the encoded components. For example, upmix decoder 1010 may reconstruct one or more components that were parametrically encoded by downmix decoder 1004. In some implementations, upmix decoder 1010 may be a SPAR decoder. A reverse rotation decoder 1012 may then reverse the rotation, for example, utilizing the extracted rotation parameters to reconstruct the FOA signal. The reconstructed FOA signal may then be rendered.
In some implementations, rotation may be performed by a downmix encoder (e.g., by a SPAR encoder). Continuing with this example, the sound components may be reverse rotated by an upmixing decoder (e.g., by a SPAR decoder). In some instances, this implementation may be advantageous in that techniques for rotating sound components (or reverse rotating the sound components) may utilize processes that are already implemented by and/or executed by the downmix encoder or the upmix decoder. For example, a downmix encoder may perform various cross-fading techniques from one from to a successive frame. Continuing with this example, in an instance in which the downmix encoder performs cross-fading between successive frames and in which the downmix encoder itself performs rotation of sound components, the downmix encoder may not need to interpolate between samples of frames, due to the cross-fading between frames. In other words, the smoothing advantages provided by performing cross-fading may be leveraged to reduce computational complexity by not performing additional interpolation processes. Moreover, because a downmix encoder may perform cross-fading on a frequency band by frequency band basis, utilizing the downmix encoder to perform rotation may allow rotation to be performed differently for different frequency bands rather than applying the same rotation to all frequency bands.
An example system diagram for rotating sound components by a downmix encoder is shown in
At a receiver, a waveform codec 1026 may receive the bit stream and extract the downmixed and rotated sound components. For example, in an instance in which the FOA signal has been downmixed to two channels, waveform codec 1026 may extract W and Yrot components and extract parametric metadata used to parametrically encode the X and Z components. In some implementations, waveform codec 1026 may extract the rotation parameters. In some implementations, waveform codec 1026 may be an EVS decoder. An upmix and reverse rotation decoder 1028 may take the extracted downmixed and rotated sound components and reverse the rotation of the sound components, as well as upmix the channels (e.g., by reconstructing parametrically encoded components). For example, an output of upmix and reverse rotation decoder 1028 may be a reconstructed FOA signal. The reconstructed FOA signal may then be rendered.
Turning to
It should be noted that, in some implementations, a downmix and rotation encoder (e.g., downmix and rotation encoder 1022 as shown in and described above in connection with
According to some alternative implementations the apparatus 1100 may be, or may include, a server. In some such examples, the apparatus 1100 may be, or may include, an encoder. Accordingly, in some instances the apparatus 1100 may be a device that is configured for use within an audio environment, such as a home audio environment, whereas in other instances the apparatus 1100 may be a device that is configured for use in “the cloud,” e.g., a server.
In this example, the apparatus 1100 includes an interface system 1105 and a control system 1110. The interface system 1105 may, in some implementations, be configured for communication with one or more other devices of an audio environment. The audio environment may, in some examples, be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, etc. The interface system 1105 may, in some implementations, be configured for exchanging control information and associated data with audio devices of the audio environment. The control information and associated data may, in some examples, pertain to one or more software applications that the apparatus 1100 is executing.
The interface system 1105 may, in some implementations, be configured for receiving, or for providing, a content stream. The content stream may include audio data. The audio data may include, but may not be limited to, audio signals. In some instances, the audio data may include spatial data, such as channel data and/or spatial metadata. In some examples, the content stream may include video data and audio data corresponding to the video data.
The interface system 1105 may include one or more network interfaces and/or one or more external device interfaces (such as one or more universal serial bus (USB) interfaces). According to some implementations, the interface system 1105 may include one or more wireless interfaces. The interface system 1105 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system and/or a gesture sensor system. In some examples, the interface system 1105 may include one or more interfaces between the control system 1110 and a memory system, such as the optional memory system 1115 shown in
The control system 1110 may, for example, include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
In some implementations, the control system 1110 may reside in more than one device. For example, in some implementations a portion of the control system 1110 may reside in a device within one of the environments depicted herein and another portion of the control system 1110 may reside in a device that is outside the environment, such as a server, a mobile device (e.g., a smartphone or a tablet computer), etc. In other examples, a portion of the control system 1110 may reside in a device within one environment and another portion of the control system 1110 may reside in one or more other devices of the environment. For example, a portion of the control system 1110 may reside in a device that is implementing a cloud-based service, such as a server, and another portion of the control system 1110 may reside in another device that is implementing the cloud-based service, such as another server, a memory device, etc. The interface system 1105 also may, in some examples, reside in more than one device.
In some implementations, the control system 1110 may be configured for performing, at least in part, the methods disclosed herein. According to some examples, the control system 1110 may be configured for implementing methods of rotating sound components, encoding rotated sound components and/or rotation parameters, decoding encoded information, reversing a rotation of sound components, rendering sound components, or the like.
Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. The one or more non-transitory media may, for example, reside in the optional memory system 915 shown in
In some examples, the apparatus 1100 may include the optional microphone system 1120 shown in
According to some implementations, the apparatus 1100 may include the optional loudspeaker system 1125 shown in
Some aspects of present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer readable medium (e.g., a disc) which stores code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems can be or include a programmable general purpose processor, digital signal processor, or microprocessor, programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including an embodiment of disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, a memory, and a processing subsystem that is programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.
Some embodiments may be implemented as a configurable (e.g., programmable) digital signal processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform required processing on audio signal(s), including performance of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general purpose processor (e.g., a personal computer (PC) or other computer system or microprocessor, which may include an input device and a memory) which is programmed with software or firmware and/or otherwise configured to perform any of a variety of operations including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more loudspeakers and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device.
Another aspect of present disclosure is a computer readable medium (for example, a disc or other tangible storage medium) which stores code for performing (e.g., coder executable to perform) one or more examples of the disclosed methods or steps thereof.
While specific embodiments of the present disclosure and applications of the disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the disclosure described and claimed herein. It should be understood that while certain forms of the disclosure have been shown and described, the disclosure is not to be limited to the specific embodiments described and shown or the specific methods described.
This application claims the benefit of priority to U.S. Provisional Patent Application No. 63/264,489, filed Nov. 23, 2021, U.S. Provisional Patent Application No. 63/171,222, filed Apr. 6, 2021, and U.S. Provisional Patent Application No. 63/120,617, filed Dec. 2, 2020, all of which are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/061549 | 12/2/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63264489 | Nov 2021 | US | |
63171222 | Apr 2021 | US | |
63120617 | Dec 2020 | US |