The present disclosure relates to the efficient rendering of digital audio signals for headphone playback.
Spatial audio refers to an immersive audio reproduction system that allows the audience perceive high degree of audio envelopment. This sense of envelopment includes the sensation of spatial location of the audio sources, in both direction and distance, such that the audience perceive the sound scene as if they are in the natural sound environment.
There are three audio recording formats commonly used for spatial audio reproduction system. The format depends on the recording and mixing approach used at the audio content production site. The first format is the most well-known channel-based whereby each channel of audio signals is designated to be playback on a particular loudspeaker at the reproduction site. The second format is called object-based whereby a spatial sound scene can be described by a number of virtual sources (also called objects). Each audio object can be represented by a sound waveform with the associated metadata. The third format is called Ambisonic-based which can be regarded as coefficient signals that represent a spherical expansion of the sound field.
With the proliferation of personal portable devices such as mobile phones, tablets, etc., and emerging applications of virtual/augmented reality, rendering the immersive spatial audio over headphones is becoming more and more necessary and attractive. Binauralization is the process of converting the input spatial audio signals, for example, channel-based signals, object-based signals or Ambisonic-based signals, into the headphone playback signals. In essence, the natural sound scene in a practical environment is perceived by a pair of human ears. This infers that the headphone playback signals should be able to render the spatial sound scene as natural as possible if these playback signals are close to the sounds perceived by the human in the natural environment.
A typical example of the binaural rendering is documented in MPEG-H 3D audio standard [see NPL 1].
[NPL 1] ISO/IEC DIS 23008-3 “Information technology—High efficiency coding and media delivery in heterogeneous environments—Part 3: 3D audio”
[NPL 2] T. Lee. H. O. Oh. J. Seo, Y, C. Park and D. H. Youn, “Sealable Muitiband Binaural Renderer for MPEG-H 3D Audio.” in Journal of Selected Topics in Signal Processing, vol. 9, no. S. pp. 907-920. August 2015.
One non-limiting and exemplary embodiment provides a method of a fast binaural rendering for multiple moving audio sources. The present disclosure takes the audio source signals which can be object-based, channel-based or a mixture of both, associated metadata, user head tracking data and binaural room impulse response (BRIR) database to generate the headphone playback signals. One non limiting and exemplary embodiment of the present disclosure provides high spatial resolution and a low computational complexity when used in the binaural renderer.
In one general aspect, the techniques disclosed here feature a method of efficiently generating the binaural headphone playback signals given the multiple audio source signals with the associated metadata and binaural room impulse, response (BRIR) database, wherein the said audio source signals can be channel-based, object-based, or a mixture of both signals. The method comprises a step of: (a) computing instant head-relative positions of the audio sources with respect to the position of user head and facing direction, (b) grouping the source signals according to live said instant head-relative positions of the audio sources in a hierarchical manner, (c) parameterizing BRIR to be used for rendering for dividing BRIR to be used for rendering into a number of blocks), (d) dividing each source signal to be rendered into a number of blocks and frames, (e) averaging the parameterized (divided) BRIR sequences identified with a hierarchically grouping result, and (f) down mixing (averaging) the divided source signals identified with the hierarchically grouping result.
It is useful for rendering fast moving objects using head-tracking enabled head-mounted device by using an method in an embodiment of live present disclosure.
It should be noted that general or specific embodiments may be implemented as a system, a method, an integrated circuit, a computer program, a storage medium, or any selective combination thereof.
Additional benefits and advantages of the disclosed embodiments will become apparent from the specification and drawings. The benefits and/or advantages may be individually obtained by the various embodiments and features of the specification and drawings, which need not all be provided in order to obtain one or more of such benefits and/or advantages.
Configurations and operations in embodiments of the present disclosure will be described below with reference to the drawings. The following embodiment is merely illustrative for the principles of various inventive steps. It is understood that variations of the details described herein will be apparent to others skilled in the art.
The authors examined a method to solve the problems faced by the binaural Tenderer using MPEG-H 3D audio standard as a practical example.
Indirect binaural rendering via conversion of channel-based and object-based input signals to the virtual loudspeaker signals first and then followed by conversion to the binaural signals is widely adopted in 3D audio system, such as in MPEG-H 3D audio standard. However, such a framework resulted in spatial resolution being fixed and limited by the configuration of the virtual loudspeakers in the middle of the rendering path. When the virtual loudspeaker is set as 5.1 or 7.1 configuration, for example, the spatial resolution is constrained by small number of the virtual loudspeakers, resulting that the user perceives the sound coming from only these fixed directions.
In addition, the BRIR database used in the binaural renderer (103) is associated with the virtual loudspeaker layout in a virtual listening room. This fact is deviated from the expected situation where the BRIRs should be the ones associated with the production scene if such information is available from the decoded bitstream.
Ways to improve the spatial resolution include the increase of the number of loud-speakers, e.g., to 22.2 configuration, or using an object-binaural direct rendering scheme. However, these ways may lead to a high computational complexity problem when BRIR is used as the number of input signals for binauralization is increased. The computational complexity issue is explained in the following paragraph.
Due to the fact that the BRIR is generally a long sequence of impulses, direct convolution between BRIR and signal is highly computational demanding. Therefore, many binaural renderers seek for a tradeoff between the computational complexity and spatial quality.
On the other hand, as the “late reverberation” part of BRIR contains less spatial information, the signals can be downmixed (202) into one channel such that the convolution needs to be performed only once with the downmixed channel in (203). Although this method reduces the computational load in the late reverberation processing (203), the computational complexity may still be very high for the direct and early part processing (201). This is because each of the source signals is processed separately in the direct and early part processing (201) and the computational complexity increases as the number of the source signals increases.
The binaural renderer (103) considers the virtual loudspeaker signals as input signals and the binaural rendering can be performed by convolving each virtual loudspeaker signal with the corresponding pair of binaural impulse responses. The head related impulse response (HRIR) and binaural room impulse response (BRIR) are commonly used as the impulse response where the latter one consists of room reverberation filter coefficients which make it much longer than the HRIR.
The convolution process implicitly assumes that the source is at fixed position—which is true for the virtual loudspeaker. However, there are many cases where the audio sources can be moving. One example, is the use of head mounted display (HMD) in virtual reality (VR) application where the positions of audio sources are expected to be invariant from any rotation of the user head. This is achieved by rotating the positions of objects or virtual loudspeakers in the reverse direction to wipe off the effect of user head rotation Another example is the direct rendering of objects, where these objects can be moving with the varying positions specified in metadata.
Theoretically, there is no straight forward method to render a moving source due to that the rendering system is no longer a linear time invariant (LTI) system because of the moving source. However, approximation can be made such that the source is assumed to be stationary in a short period and within this short period, the LTI assumption is valid. This is the true when we use the HRIR and the source can be assumed stationary within the the filler length of HRIR (usually is a fraction of milisecond). Source signal frames can therefore be convolved with corresponding HRIR filters to generate the binarual feeds. However, when BRIR is used, due to that the filter length is generally much longer (e.g., 0.5 second), the source can no longer be assumed to be stationary during the BRIR filter length period. The source signal frame cannot be directly convolved with the BRIR filters, unless additional processing is applied on the convolution with BRIR filters.
The present disclosure comprises the followings. Firstly, it is the means of directly rendering the object based and channel-based signals to the binaural ends without going through the virtual loudspeakers. It is possible to solve the spatial resolution limitation problem in <Problem 1>. Secondly, it is the means of grouping the close sources into one cluster such that some part of processing can be applied to the downmixed version of the sources within one cluster to save computational complexity problem in <Problem 2>. The means of splitting the BRIR into several blocks and further divides the direct block (corresponding to the direct and early reflections) into several frames and then perform binauralization filtering by a new frame-by-frame convolution scheme which selects the BRIR frame according to the instant position of the moving source to solve the moving source problem in <Problem 3.
In addition, the inputs also include an optional user head tracking data, which can be the instant user head facing direction or position, if such information is available from external applications and the rendered audio scene is required to be adapted with respect to the user head rotation/movement. The outputs of the fast binaural tenderer are the left and right headphone feed signals for user listening.
To obtain the outputs, the fast binaural tenderer first comprises of a head-relative source position computation module (301) which computes the relative source positions with respect to the instant user head facing direction/position by taking the instant source metadata and user head tracking data. The computed head-relative source positions are then used in a hierarchical source grouping module (302) to generate the hierarchical source grouping information and binaural renderer core (303) for selecting the parameterized BRIRs according to the instant source positions. The hierarchical information generated by (302) is also used in the binaural renderer core (303) for the purpose of reducing the computational complexity. The details of the hierarchical source grouping module (302) are described in Section <Source grouping>.
The proposed fast binaural render also comprises of a BRIR parameterization module (304) which splits each BRIR filter into several blocks. It further divides the first block into frames and attaches each frame with corresponding BRIR target position label. The details of the BRIR parameterization module (304) are described in Section <BRIR parameterization>.
Note that the proposed fast binaural renderer considers the BRIRs as the filters for rendering the audio sources. In the case where the BRIR database is not adequate or the user prefers to use a high resolution BRIR database, the proposed fast binaural render supports an external BRIR interpolation module (305) which interpolates the BRIR filters for the missing target locations based on the nearby BRIR filters. However, such an external module is not specified in this document.
Finally, the proposed fast binaural renderer comprises of a binaural renderer core (303) which is the core processing unit. It takes the aforementioned individual source signals, the computed head relative source positions, the hierarchical source grouping information and the parameterized BRIR blocks/frames for generating the headphone feeds. The details of the binaural renderer core (303) are described in Section <Binaural renderer core> and Section <Source grouping based frame-by-frame binaural rendering>.
The hierarchical source grouping module (302) in
Co(p) [Math. 1]
Where 0 is the cluster index and p is the layer index.
The number of layers P is chosen by the user depending on the system complexity requirement and can be greater than 2. A proper hierarchy design with lower resolution on the high layers can result in a lower computational complexity. To group the sources, a simple way is based on division of the whole space where the audio sources exist into a number of small areas/enclosures, as illustrated in the previous example. The sources are therefore grouped based on which area/enclosure they fall into. More professionally, the audio sources can be grouped based on some particular clustering algorithms, e.g., k-means, fuzzy c means algorithms. These clustering algorithms compute the similarity measures between any two sources and grouped the sources into clusters.
This section describes the processing procedures in BRIR parameterization module (304) in
As discussed in the above, use of such long filter results in high computational complexity if direct convolution is applied between the filter and source signal. The computational complexity would increase if the number of audio sources increases. To save computational complexity, each BRIR filter is divided into direct block and diffuse blocks and a simplified processing, as described in Section <Binaural renderer core>, is applied on the diffuse blocks. Dividing the BRIR filter into blocks can be determined by the energy envelop of each BRIR filter and inter-aural coherence between the filters in pair. As the energy and inter-aural coherence reduces with time increases in BRIRs, the time points for separating the blocks can be derived empirically using existing algorithms [see NPL 2].
hθ(0)(n) [Math.2]
where n denotes the sample index, superscript (0) denotes direct block and 0 denotes the target location of this BRIR filter. Similarly, the wth diffuse block is denoted as
h
θ
(w)(n),w=1,2, . . . , W [Math.3]
where w is the diffuse block index. Furthermore, as shown in
On the other hand, the direct block of BRIR contains important directional information and will generate the directional cues in the binaural playback signals. To cater for the scenario where the audio sources are moving fast, rendering is to be performed based on the assumption that audio source is only stationary during a short time period (i.e., time frame with length of, e.g., 1024 samples at 16 kHz sampling rate), and binauralization is processed frame by frame in a module of source grouping based frame-by-frame binauralization (701) shown in
hθ(0),m(n) [Math.4]
where m=0, . . . , M denotes the frame index and M is the total number of frames in the direct block. The divided frames are also assigned position labels θ which correspond to the target location of this BRIR filter.
This section describes the details of binaural renderer core (303) as shown in
Sk(current)(n) [Math.5]
and the previous wth block is denoted as
S
k
(current−w)(n), w=1,2, . . . , W. [Math.6]
As shown in
y
(current)=β(s1(current)(n) , . . . , sk(current)(n), H(0)) [Math.7]
where y(current) denotes the output of (701) and the function β(⋅) denotes the processing function of (701) which takes hierarchical source grouping information generated from (302) in
On the other hand, the previous blocks of source signals will be downmixed in the downmixing module (702) into one channel and passed to the late reverberation processing module (703). The late reverberation processing in (703) is denoted by
where y(current−w) denotes the output of (703), γ(⋅) denotes the processing function of (703) which takes the downmixed version of the previous blocks of source signals, and the diffuse blocks of BRIRs as inputs. The variable θave denotes the averaged location of all the K sources at the block current−w.
Note that this late reverberation processing can be performed in time-domain using convolution, it can also be implemented by multiplication in frequency domain using fast Fourier transform (FFT) with cut-off frequencies fw applied. It is also worth noting that time-domain downsampling can be implemented on the diffuse blocks depending on the target system computational complexity. Such downsampling can reduce the number of signal samples, and thus reduce the number of multiplications in the FFT domain resulted a reduced computational complexity.
Given the above, the binaural playback signal is finally generated by
As shown in the above equation, for each diffuse block w, due to that a downmix processing
is applied on the source signals, the late reverberation processing γ(⋅) only needs to be performed once. Compared to the case of a typical direct convolution approach where such processing (filtering) has to be performed separately for K number of source signals, the present disclosure reduces the computational complexity.
This section describes the details of the source grouping based frame-by-frame binauralization module (701) in
As shown in
contained in the collection H(0). This BRIR frame is selected by searching for the labelled location of BRIR frame [θk(current), lrfm] which is closest to the instant position of the source θk(current), lrfm at the latest frame, where [θk(current), lrfm] denotes finding the nearest value of label in the BRIR database. Due to that the 0th frame of BRIR contains the most directional information, the convolution is performed with each source signal individually to reserve the spatial cues of each source. The convolution can be performed using multiplication in frequency domain, as illustrated in (801) in
For each of the previous frames sk(current), lrfm−m(n) where m≥1, the convolution is supposed to be performed with the mth frame of the direct block of BRIR.
contained in H(0), where [θk(current), lrfm−m] denotes the labelled position of that BRIR frame which is closest to the source position of at the frame lfrm-m. Note that as m increases, the directional information contained in
reduces. Because of this, to save computational complexity and as shown in (802), the present disclosure applies a downmixing for sk(current), lfram−m(n), k=1,2, . . . K where m≥1 according to the hierarchical source grouping decision C(0)
For example, if the second layer source grouping is applied on the signal frame sklatest frames2(n) (i.e. m=2) and that the source 4 and 5 are grouped into the second cluster C0
and the convolution is applied between this averaged signal and the BRIR frame with the averaged source location at that frame.
Note that different hierarchical layers can be applied on the frames, in essence, high resolution grouping should be considered for the early frames of BRIRs to reserve the spatial cues, while low resolution grouping is considered for the late frames of BRIRs for reduction of computational complexity. Finally the frame-wised processed signals are passed to a mixer which performs a summation to generate the output of (701), i.e. y(current).
In the foregoing embodiments, the present present disclosure is configured with hardware by way of the above explained example, bin the present disclosure may also be provided by software in cooperation with hardware.
In addition, the functional blocks used in the descriptions of the embodiments are typically implemented as LSI devices, which are integrated circuits. The functional blocks may be formed as individual chips, or a part or all of the functional blocks may be integrated into a single chip. The term “LSI” is used herein, but the terms “IC,” “system LSI,” “super LSI” or “ultra LSI” may be used as well depending on the level of integration.
In addition, the circuit integration is not limited to LSI and may be achieved by dedicated circuitry or a general purpose processor other than an LSI. After fabrication of LSI, a field programmable gate array (FPGA), which is programmable, or a reconfigurable processor which allows reconfiguration of connections and settings of circuit cells in LSI may be used.
Should a circuit integration technology replacing LSI appear as a result of advancements in semiconductor technology or other technologies derived from the technology, the functional blocks could be integrated using such a technology. Another possibility is the application of biotechnology and/or the like.
This disclosure can be applied to a method for rendering of digital audio signals for headphone playback.
101 format converter
102 VBAP renderer
103 binaural renderer
201 direct and early pad processing
202 downmix
203 late reverberation part processing
204 mixing
301 bead-relative source position computation module
302 hierarchical source grouping module
303 binaural tenderer core
304 BRIR parameterization module
305 external BRIR interpolation module
306 fast binaural Tenderer
701 frame-by-frame fast binauralization module
702 downmixing module
703 late reverberation processing module
704 summation
Number | Date | Country | Kind |
---|---|---|---|
2016-211803 | Oct 2016 | JP | national |
This is a continuation of U.S. application Ser. No. 17/097,829, filed Nov. 13, 2020, which is a continuation of U.S. application Ser. No. 16/913,034, filed Jun. 26, 2020, now U.S. Pat. No. 10,873,826 issued Dec. 22, 2020, which is a continuation of U.S. application Ser. No. 16/724,921, filed Dec. 23, 2019, now U.S. Pat. No. 10,735,886 issued Aug. 4, 2020, which is a continuation of U.S. application Ser. No. 16/341,861, filed Apr. 12, 2019, now U.S. Pat. No. 10,555,107 issued Feb. 4, 2020, which is a national stage of Int. Appl. No. PCT/JP2017/036738, filed Oct. 11, 2017, which claims the benefit of JP Appl. No. 2016-211803, filed Oct. 28, 2016. The disclosure of each of the above-mentioned documents is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 17097829 | Nov 2020 | US |
Child | 17725097 | US | |
Parent | 16913034 | Jun 2020 | US |
Child | 17097829 | US | |
Parent | 16724921 | Dec 2019 | US |
Child | 16913034 | US | |
Parent | 16341861 | Apr 2019 | US |
Child | 16724921 | US |