In general, spatial sound recording aims at capturing a sound field with multiple microphones such that at the reproduction side, the listener perceives the sound image as it was at the recording location. In the envisioned case, the spatial sound is captured in a single physical location at the recording side (referred to as reference location), whereas at the reproduction side, the spatial sound can be rendered from arbitrary different perspectives relative to the original reference location. The different perspectives include different listening positions (referred to as virtual listening positions) and listening orientations (referred to as virtual listening orientations).
Rendering spatial sound from arbitrary different perspectives with respect to an original recording location enables different applications. For example, in 6 degrees-of-freedom (6DoF) rendering, the listener at the reproduction side can move freely in a virtual space (usually wearing a head-mounted display and headphones) and perceive the audio/video scene from different perspectives. In 3 degrees-of-freedom (3DoF) applications, where e.g. a 360° video together with spatial sound was recorded in a specific location, the video image can be rotated at the reproduction side and the projection of the video can be adjusted (e.g., from a stereographic projection [WolframProj1] towards a Gnomonic projection [WolframProj2], referred to as “little planet” projection). Clearly, when changing the video perspective in 3DoF or 6DoF applications, the reproduced spatial audio perspective should be adjusted accordingly to enable consistent audio/video production.
There exist different state-of-the-art approaches that enable spatial sound recording and reproduction from different perspectives. One way would be to physically record the spatial sound in all possible listening positions and, on the reproduction side, use the recording for spatial sound reproduction that is closest to the virtual listening position.
However, this recording approach is very intrusive and would involve an unfeasibly high measurement effort. To reduce the number of physical measurement positions that may be used while still achieving spatial sound reproduction form arbitrary perspectives, non-linear parametric spatial sound recording and reproduction techniques can be used. An example is the directional audio coding (DirAC) based virtual microphone processing proposed in [VirtualMic]. Here, the spatial sound is recorded with microphone arrays located at only a small number (3-4) of physical locations. Afterwards, sound field parameters such as the direction-of-arrival and diffuseness of the sound can be estimated at each microphone array location and this information can then be used to synthesize the spatial sound at arbitrary spatial positions. While this approach offers a high flexibility with significantly reduced number of measurement locations, it still involves multiple measurement locations. Moreover, the parametric signal processing and violations of the assumed parametric signal model can introduce processing artifacts that might be unpleasant especially in high-quality sound reproduction applications.
According to an embodiment, an apparatus for processing a sound field representation related to a defined reference point or a defined listening orientation for the sound field representation may have: a sound field processor for processing the sound field representation using a deviation of a target listening position from the defined reference point or of a target listening orientation from the defined listening orientation, to acquire a processed sound field description, wherein the processed sound field description, when rendered, provides an impression of the sound field representation at the target listening position being different from the defined reference point or for the target listening orientation being different from the defined listening orientation, or for processing the sound field representation using a spatial filter to acquire the processed sound field description, wherein the processed sound field description, when rendered, provides an impression of a spatially filtered sound field description, wherein the sound field processor is configured to process the sound field representation so that the deviation or the spatial filter is applied to the sound field representation in relation to a spatial transform domain having associated therewith a forward transform rule and a backward transform rule.
According to another embodiment, a method of processing a sound field representation related to a defined reference point or a defined listening orientation for the sound field representation may have the steps of: detecting a deviation of a target listening position from the defined reference point or of a target listening orientation from the defined listening orientation; and processing the sound field representation using the deviation to acquire a processed sound field description, wherein the processed sound field description, when rendered, provides an impression of the sound field representation at the target listening position being different from the defined reference point or for the target listening orientation being different from the defined listening orientation, or for processing the sound field representation using a spatial filter to acquire the processed sound field description, wherein the processed sound field description, when rendered, provides an impression of a spatially filtered sound field description, wherein the deviation or the spatial filter is applied to the sound field representation in relation to a spatial transform domain having associated therewith a forward transform rule and a backward transform rule.
Another embodiment may have a non-transitory digital storage medium having a computer program stored thereon to perform the method of processing a sound field representation related to a defined reference point or a defined listening orientation for the sound field representation, the method having the steps of: detecting a deviation of a target listening position from the defined reference point or of a target listening orientation from the defined listening orientation; and processing the sound field representation using the deviation to acquire a processed sound field description, wherein the processed sound field description, when rendered, provides an impression of the sound field representation at the target listening position being different from the defined reference point or for the target listening orientation being different from the defined listening orientation, or for processing the sound field representation using a spatial filter to acquire the processed sound field description, wherein the processed sound field description, when rendered, provides an impression of a spatially filtered sound field description, wherein the deviation or the spatial filter is applied to the sound field representation in relation to a spatial transform domain having associated therewith a forward transform rule and a backward transform rule, when said computer program is run by a computer.
In an apparatus or method for processing a sound field representation, a sound field processing takes place using a deviation of a target listening position from a defined reference point or a deviation of a target listening orientation from the defined listening orientation, so that a processed sound field description is obtained, wherein the processed sound field description, when rendered, provides an impression of the sound field representation at the target listening position being different from the defined reference point. Alternatively or additionally, the sound field processing is performed in such a way that the processed sound field description, when rendered, provides an impression of the sound field representation for the target listening orientation being different from the defined listening orientation. Alternatively or additionally, the sound field processing takes place using a spatial filter wherein a processed sound field description is obtained, where the processed sound field description, when rendered, provides an impression of a spatially filtered sound field description. Particularly, the sound field processing is performed in relation to a spatial transform domain. Particularly, the sound field representation comprises a plurality of audio signals in an audio signal domain, where these audio signals can be loudspeaker signals, microphone signals, Ambisonics signals or other multi-audio signal representations such as audio object signals or audio object coded signals. The sound field processor is configured to process the sound field representation so that the deviation between the defined reference point or the defined listening orientation and the target listening position or the target listening orientation is applied in a spatial transform domain having associated therewith a forward transform rule and a backward transform rule. Furthermore, the sound field processor is configured to generate the processed sound field description again in the audio signal domain, where the audio signal domain, once again, is a time domain or a time/frequency domain, and the processed sound field description may comprise Ambisonics signals, loudspeaker signals, binaural signals and/or audio object signals or encoded audio object signals as the case may be.
Depending on the implementation, the processing performed by the sound field processor may comprise a forward transform into the spatial transform domain and the signals in the spatial transform domain, i.e., the virtual audio signals for virtual speakers at virtual positions are actually calculated and, depending on the application, spatially filtered using a spatial filter in the transform domain or are, without any optional spatial filtering, transformed back into the audio signal domain using the backward transform rule. Thus, in this implementation, virtual speaker signals are actually calculated at the output of a forward transform processing and the audio signals representing the processed sound field representation are actually calculated as an output of a backward spatial transform using a backward transform rule.
In another implementation, however, the virtual speaker signals are not actually calculated. Instead, only the forward transform rule, an optional spatial filter and a backward transform rule are calculated and combined to obtain a transformation definition, and this transformation definition is applied, advantageously in the form of a matrix, to the input sound field representation to obtain the processed sound field representation, i.e., the individual audio signals in the audio signal domain. Hence, such a processing using a forward transform rule, an optional spatial filter and a backward transform rule results in the same processed sound field representation as if the virtual speaker signals were actually calculated. However, in such a usage of a transformation definition, the virtual speaker signals do not actually have to be calculated, but only a combination of the individual transform/filtering rules such as a matrix generated by combining the individual rules is calculated and is applied to the audio signals in the audio signal domain.
Furthermore, another embodiment relates to the usage of a memory having precomputed transformation definitions for different target listening positions and/or target orientations, for example for a discrete grid of positions and orientations. Depending on the actual target position or target orientation, the best matching pre-calculated and stored transformation definition has to be identified in the memory, retrieved from the memory and applied to the audio signals in the audio signal domain.
The usage of such pre-calculated rules or the usage of a transformation definition—be it the full transformation definition or only a partial transformation definition—is useful, since the forward spatial transform rule, the spatial filtering and the backward spatial transform rule are all linear operations and can be combined with each other and applied in a “single-shot” operation without an explicit calculation of the virtual speaker signals.
Depending on the implementation, a partial transformation definition obtained by combining the forward transform rule and the spatial filtering on the one hand or obtained by combining the spatial filtering and the backward transform rule can be applied so that only either the forward transform or the backward transform is explicitly calculated using virtual speaker signals. Thus, the spatial filtering can be either combined with the forward transform rule or the backward transform rule and, therefore, processing operations can be saved as the case may be.
Embodiments are advantageous in that a sound scene modification is obtained related to a virtual loudspeaker domain for a consistent spatial sound reproduction from different perspectives.
Embodiments describe a practical way where the spatial sound is recorded in or represented with respect to a single reference location while still allowing to change the audio perspective at will at the reproduction side. The change in the audio perspective can be e.g. rotation or translation, but also effects such an acoustical zoom including spatial filtering. The spatial sound at the recording side can be recorded using for example a microphone array, where the array position represents the reference position (it is referred to a single recording location even though the microphone array may consist of multiple microphones located at slightly different positions, whereas the extend of the microphone array is negligible compared to the size of the recording side). The spatial sound at the recording location also can be represented in terms of a (higher-order) Ambisonics signal. Moreover, the embodiments can be generalized to use loudspeaker signals as input, whereas the sweet spot of the loudspeaker setup represents the single reference location. In order to change the perspective of the recorded spatial audio relative to the reference location, the recorded spatial sound is transformed into a virtual loudspeaker domain. By changing the positions of the virtual loudspeakers and filtering the virtual loudspeaker signals depending on the virtual listening position and orientation relative to the reference position, the perspective of the spatial sound can be adjusted as desired. In contrast to the state-of-the-art parametric signal processing [VirtualMic], the presented approach is completely linear avoiding non-linear processing artifacts. The authors in [AmbiTrans] describe a related approach where a spatial sound scene is modified in the virtual loudspeaker domain, e.g., to achieve rotation, warping, and directional loudness modification. However, this approach does not reveal how the spatial sound scene can be modified to achieve a consistent audio rendering at an arbitrary virtual listening position relative to the reference location. Moreover, the approach in [AmbiTrans] describes the processing for Ambisonics input only, whereas embodiments relate to Ambisonics input, microphone input, and loudspeaker input.
Further implementations relate to a processing where a spatial transformation of the audio perspective is performed and optionally a corresponding spatial filtering in order to mimic different spatial transformations of corresponding video image such as a spherical video. Input and output of the processing are, in an embodiment, first-order Ambisonics (FOA) or higher-order Ambisonics (HOA) signals. As stated, the entire processing can be implemented as a single matrix multiplication.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
Particularly, the sound field processor is configured for processing the sound field representation using a deviation of a target listening position from the defined reference point or using a deviation of a target listening orientation from the defined listening orientation. The deviation is obtained by a detector 1100. Alternatively or additionally, the detector 1100 is implemented to detect the target listening position or the target listening orientation without actually calculating the deviation. The target listening position and/or the target listening orientation or, alternatively, the deviation between the defined reference point and the target listening position or the deviation between the defined listening orientation and the target listening orientation are forwarded to the sound field processor 1000. The sound field processor processes the sound field representation using the deviation so that a processed sound field description is obtained, wherein the processed sound field description, when rendered, provides an impression of the sound field representation at the target listening position being different from the defined reference point or for the target listening orientation being different from the defined listening orientation. Alternatively or additionally, the sound field processor is configured for processing the sound field representation using a spatial filter, so that a processed sound field description is obtained, wherein the processed sound field description, when rendered, provides an impression of a spatially filtered sound field description, i.e., a sound field description that has been filtered by the spatial filter.
Hence, irrespective of whether a spatial filtering is performed or not, the sound field processor 1000 is configured to process the sound field representation so that the deviation or the spatial filter 1030 is applied in a spatial transform domain having associated therewith a forward transform rule 1021 and a backward transform rule 1051. The forward and backward transform rules are derived using a set of virtual speakers at virtual positions, but it is not necessary to explicitly calculate the signals for the virtual speakers.
Advantageously, the sound field representation comprises a number of sound field components which is greater than or equal to two or three. Furthermore, and advantageously, the detector 1100 is provided as an explicit feature of the apparatus for processing. In another embodiment, however, the sound field processor 1000 has an input for the target listening position or target listening orientation or a corresponding deviation. Furthermore, the sound field processor 1000 outputs a processed sound field description 1201 that can be forwarded to an output interface 1200 and then output for a transmission or storage of the processed sound field description 1201. One kind of transmission is, for example, an actual rendering of the processed sound field description via (real) loudspeakers or via a headphone in relation to the binaural output. Alternatively, as, for example, in the case of an Ambisonics output, the processed sound field description 1201 is output by the output interface 1200 can be forwarded/input into an Ambisonics sound processor.
Furthermore, the sound field processor 1000 is configured to process the sound field representation so that the deviation or the spatial filter is applied in a spatial transform domain having associated therewith a forward transform rule 1021 as obtained by a forward transform block 1020, and having associated a backward transform rule 1051 obtained by a backward transform block 1050. Furthermore, the sound field processor 1000 is configured to generate the processed sound field description in the audio signal domain. Thus, advantageously, the output of block 1050, i.e., the signal on line 1201 is in the same domain as the input 1001 into the forward transform block 1020.
Depending on whether an explicit calculation of virtual speaker signals is performed, the forward transform block 1020 actually performs the forward transform and the backward transform block 1050 actually transforms the backward transform. In the other implementation, where only a transform domain related processing is performed without an explicit calculation of the virtual speaker signals, the forward transform block 1020 outputs the forward transform rule 1021 and the backward transform block 1050 outputs the backward transform rule 1051 for the purpose of sound field processing. Furthermore, with respect the spatial filter implementation, the spatial filter is either applied as a spatial filter block 1030 or the spatial filter is reflected by applying a spatial filter rule 1031. Both implementations, i.e., with or without explicit calculation of the explicit virtual speaker signals are equivalent to each other, since the output of the sound field processing, i.e., signal 1201, when rendered, provides an impression of the sound field representation at the target listening position being different from the defined reference point or for the target listening orientation being different from the defined listening orientation. To this end, the spatial filter 1030 and the backward transform block 1050 may receive the target position or/and the target orientation.
Depending on the given virtual speakers and depending on the reference position and/or reference orientation, block 1040a generates azimuth/elevation angles for each virtual speaker related to the reference position or/and the reference orientation. This information is advantageously input into the forward transform block 1020 so that the virtual speaker signals for the virtual speakers defined at the input into block 1040a can be explicitly (or implicitly) calculated.
Depending on the implementation, other definitions for the virtual speakers different from azimuth/elevation angles can be given such as Cartesian coordinates or a Cartesian direction information such as vectors pointing into the orientation that would correspond to the orientation of a speaker directed to the corresponding original or predefined reference position on the one hand or, with respect to the backward transform, directed to the target orientation.
Block 1040b receives, as an input, the target position or the target orientation or alternatively or additionally, the deviation for the position/orientation between the defined reference point or the defined listening orientation from the target listening position or the target listening orientation. Block 1040b then calculates, from the data generated by block 1040a and the data input into block 1040b the azimuth/elevation angles for each virtual speaker related to the target position or/and the target orientation and, this information is input into the backward transform definition 1050. Thus, block 1050 can either actually apply the backward transform rule with the modified virtual speaker positions/orientations or can output the backward transform rule 1051 as indicated in
In another implementation illustrated in
In a further implementation illustrated in
Depending on the implementation, a spatial filtering using the spatial filter 1031 is performed or not. In case of using a spatial filter, and in case of not performing any position/orientation modification, the forward transform 1020 and the backward transform 1050 rely on the same virtual speaker positions. Nevertheless, the spatial filter 1031 has been applied in the spatial transform domain irrespective of whether the virtual speaker signals are explicitly calculated or not.
Furthermore, in case of not performing any spatial filtering, the modification of the listening position or the listening orientation to the target listening position and the target orientation is performed and, therefore, the virtual speaker position/orientations will be different in the inverse/backward transform on the one hand and the forward transform on the other hand.
The detector 1100 is configured to detect the target position and/or target orientation and forwards this information to a processor 1081 for finding the closest transformation definition or forward/backward/filtering rule within the memory 1080. To this end, the processor 1081 has knowledge of the discrete grid of positions and orientations, at which the corresponding transformation definitions or pre-calculated forward/backward/filtering rules are stored. As soon as the processor 1081 has identified the closest grid point matching with the target position or/and target orientation as close as possible, this information is forwarded to a memory retriever 1082 which is configured to retrieve the corresponding full or partial transformation definition or forward/backward/filtering rule for the detected target position and/or orientation. In other embodiments, it is not necessary to use the closest grid point from a mathematical point of view. Instead, it may be useful to determine a grid point being not the closest one, but a grid point being related to the target position or orientation. An example may be that the grid point being, from a mathematical point of view not the closest but the second or third closest or fourth closest is better than the closest one. A reason is that the optimization has more than one dimension and it might be better to allow a greater deviation for the azimuth but a smaller deviation from the elevation. This information is input into a corresponding (matrix) processor 1090 that receives, as an input, the sound field representation and that outputs the processed sound field representation 1201. The pre-calculated transformation definition may be a transform matrix having a dimension of N rows and M columns, wherein N and M are integers greater than 2, and the sound field representation has M audio signals, and the processed sound field representation 1201 has N audio signals. In a mathematically transposed formulation, the situation can be vice versa, i.e. the pre-calculated transformation definition may be a transform matrix having a dimension of M rows and N columns, or the sound field representation has N audio signals, and the processed sound field representation 1201 has M audio signals.
In the following sections, embodiments are described and it is explained how different spatial sound representations can be transformed into the virtual loudspeaker domain and then modified to achieve a consistent spatial sound production at an arbitrary virtual listening position (including arbitrary listening orientations), which is defined relative to the original reference location.
The input to embodiments are multiple (two or more) audio input signals in the time domain or time-frequency domain. Time domain input signals optionally can be transformed into the time-frequency domain using an analysis filterbank (1010). The input signals can be, e.g., loudspeaker signals, microphone signals, audio object signals, or Ambisonics components. The audio input signals represent the spatial sound field related to a defined reference position and orientation. The reference position and orientation can be, e.g., the sweet spot facing 0° azimuth and elevation (for loudspeaker input signals), the microphone array position and orientation (for microphone input signals), or the center of the coordinate system (for Ambisonics input signals).
The input signals are transformed into the virtual loudspeaker domain using a first or forward spatial transform (1020). The first spatial transform (1020) can be, e.g., beamforming (when using microphone input signals), loudspeaker signal up-mixing (when using loudspeaker input signals), or a plane wave decomposition (when using Ambisonics input signals). For audio object input signal, the first spatial transform can be an audio object renderer (e.g., a VBAP [Vbap] renderer). The first spatial transform (1020) is computed based on a set of virtual loudspeaker positions. Normally, the virtual loudspeaker positions can be defined uniformly distributed over the sphere and centered around the reference position.
Optionally, the virtual loudspeaker signals can be filtered using spatial filtering (1030). The spatial filtering (1030) is used to filter the sound field representation in the virtual loudspeaker domain depending on the desired listening position or orientation. This can be used, e.g., to increase the loudness when the listening position is getting closer to the sound sources. The same is true for a specific spatial region in which e.g. such a sound object may be located.
The virtual loudspeaker positions are modified in the position modification block (1040) depending on the desired listening position and orientation. Based on the modified virtual loudspeaker positions, the (filtered) virtual loudspeaker signals are transformed back from the virtual loudspeaker domain using a second or backward spatial transform (1050) to obtain two or more desired output audio signals. The second spatial transform (1050) can be, e.g., a spherical harmonic decomposition (when the outputs signals should be obtained in the Ambisonics domain), microphone signals (when the output signals should be obtained in the microphone signal domain), or loudspeaker signals (when the output signals should be obtained in the loudspeaker domain). The second spatial transform (1050) is independent of the first spatial transform (1020). The output signals in the time-frequency domain optionally can be transformed into the time domain using a synthesis filterbank (1060).
Due to the position modification (1040) of the virtual listening positions, which are then used in the second spatial transform (1050), the output signals represent the spatial sound at the desired listening position with the desired look direction, which may be different from the reference position and orientation.
In some applications, embodiments are used together with a video application for consistent audio/video reproduction, e.g., when rendering the video of a 360° camera from different, user-defined perspectives. In this case, the reference position and orientation usually correspond to the initial position and orientation of the 360° video camera. The desired listening position and orientation, which is used to compute the modified virtual loudspeaker positions in block (1040), then corresponds to the user-defined viewing position and orientation within the 360° video. By doing so, the output signals computed in block (1050) represent the spatial sound from the perspective of the user-defined position and orientation within the 360° video. Clearly, the same principle may apply to applications that do not fully cover the full (360°) field of view, but only parts of it, e.g., applications that allow user-defined viewing position and orientation in (e.g., 180° field of view applications).
In an embodiment the sound field representation is associated with a three dimensional video or spherical video and the defined reference point is a center of the three dimensional video or the spherical video. The detector 110 is configured to detect a user input indicating an actual viewing point being different from the center, the actual viewing point being identical to the target listening position, and the detector is configured to derive the detected deviation from the user input, or the detector 110 is configured to detect a user input indicating an actual viewing orientation being different from the defined listening orientation directed to the center, the actual viewing orientation being identical to the target listening orientation, and the detector is configured to derive the detected deviation from the user input. The spherical video may be a 360 degrees video, but other (partial) spherical videos can be used as well such as spherical videos covering 180 degrees or more.
In a further embodiment, the sound field processor is configured to process the sound field representation so that the processed sound field representation represents a standard or little planet projection or a transition between the standard or the little planet projection of at least one sound object included in the sound field description with respect to a display area for the three dimensional video or the spherical video, the display area being defined by the user input and a defined viewing direction. Such as transition is, e.g., when the magnitude of h in
Embodiments can be applied to achieve an acoustic zoom, which mimics a visual zoom. In a visual zoom, when zooming in on a specific region, the region of interest (in the image center) visually appears closer whereas undesired video objects at the image side move outwards and eventually disappear from the image. Acoustically, a consistent audio rendering would mean that when zooming in, audio sources in zoom direction become louder whereas audio sources at the side move outwards and eventually become silent. Clearly, such an effect corresponds to moving the virtual listening position closer to the virtual loudspeaker that is located in zoom direction (see Embodiment 3 for more details). Moreover, the spatial window in the spatial filtering (1030) can be defined such that the signals of the virtual loudspeakers are attenuated when the corresponding virtual loudspeakers are outside the region of interest according to the zoomed video image (see Embodiment 2 for more details).
In many applications, the input signals used in block (1020) and the output signals computed in block (1050) are represented in the same spatial domain with the same number of signals. This means, for example, if Ambisonics components of a specific Ambisonics order are used as input signals, the output signals correspond to Ambisonics components of the same order. Nevertheless, it is possible that the output signals computed in block (1050) can be represented in a different spatial domain and with a different number of signals compared to the input signals. For example, it is possible to use Ambisonics components of a specific order as input signals while computing the output signals in the loudspeaker domain with a specific number of channels.
In the following, specific embodiments of the processing blocks in
In this embodiment, the input to the first spatial transform (1020) is an L-th order Ambisonics signal in the time-frequency domain. An Ambisonics signal represents a multi-channel signal where each channel (referred to as Ambisonics component or coefficient) is equivalent to the coefficient of a so-called spatial basis function. There exist different types of spatial basis functions, for example spherical harmonics [FourierAcoust] or cylindrical harmonics [FourierAcoust]. Cylindrical harmonics can be used when describing the sound field in the 2D space (for example for 2D sound reproduction) whereas spherical harmonics can be used to describe the sound field in the 2D and 3D space (for example for 2D and 3D sound reproduction). Without loss of generality, the latter case with spherical harmonics is considered in the following. In this case, the Ambisonics signal consists of (L+1)2 separate signals (components) and is denoted by the vector
where k and n are the frequency index and time index, respectively, 0≤l≤L is the level (order), and −l≤m≤l is the mode of the Ambisonics coefficient (component) Al,m(k,n). First-order Ambisonics signals (L=1) can be measured e.g. using a SoundField microphone. Higher-order Ambisonics signals can be measured e.g. using an EigenMike.
The recording location represents the center of the coordinate system and reference position, respectively.
To convert the Ambisonics signal a(k,n) into the virtual loudspeaker domain, it is advantageous to apply a state-of-the-art plane wave decomposition (PWD) 1022, i.e., inverse spherical harmonic decomposition, on a(k,n), which can be computed as [FourierAcoust]
The term Yl,m(φj, ϑj) is the spherical harmonic [FourierAcoust] of order/and mode m evaluated at azimuth angle φj and elevation angle ϑj. The angles (φj, ϑj) represent the position of the j-th virtual loudspeaker. The signal S(φj, ϑj) can be interpreted as the signal of the j-th virtual loudspeaker.
An example of spherical harmonics is shown in
It is advantageous to define the directions (φj, ϑj) of the virtual loudspeakers to be uniformly distributed on the sphere. Depending on the application, however, the directions may be chosen differently. The total number of virtual loudspeaker positions is denoted by J. It should be noted that a higher number J leads to a higher accuracy of the spatial processing at the cost of higher computational complexity. In practice, a reasonable number of virtual loudspeakers is given e.g. by J=250.
The J virtual loudspeaker signals are collected in the vector defined by
which represents the audio input signals in the virtual loudspeaker domain.
Clearly, the J virtual loudspeaker signals s(k,n) in this embodiment can be computed by applying a single matrix multiplication to the audio input signals, i.e.,
where the J×L matrix C(k, φ1 . . . J, ϑ1 . . . J) contains the spherical harmonics for the different levels (orders), modes, and virtual loudspeaker positions, i.e.,
In this embodiment, the input to the first spatial transform (1020) are M loudspeaker signals. The loudspeaker corresponding setup can be arbitrary, e.g., a common 5.1, 7.1, 11.1, or 22.2 loudspeaker setup. The sweet spot of the loudspeaker setup represents the reference position. The m-th loudspeaker position (m≤M) is represented by the azimuth angle φmin and elevation angle ϑmin.
In this embodiment, the M input loudspeaker signals can be converted into J virtual loudspeaker signals where the virtual loudspeakers are located at the angles (φj, ϑj). If the number of loudspeakers M is smaller than the number of virtual loudspeakers J, this represents a loudspeaker up-mix problem. If the number of loudspeakers M exceeds the number of virtual loudspeakers J, It represents a down-mix problem 1023. In general, the loudspeaker format conversion can be achieved e.g. by using a state-of-the-art static (signal-independent) loudspeaker format conversion algorithm, such as the virtual or passive up-mix explained in [FormatConv]. In this approach, the virtual loudspeaker signals are computed as
where the vector
contains the M input loudspeaker signals in the time-frequency domain and k and n are the frequency index and time index, respectively. Moreover,
are the J virtual loudspeaker signals. The matrix C is the static format conversion matrix which can be computed as explained in [FormatConv] by using for example the VBAP panning scheme [Vbap]. The format conversion matrix depends in the M positions of the input loudspeakers and the J positions of the virtual loudspeakers.
Advantageously, the angles (φj, ϑj) of the virtual loudspeakers are uniformly distributed on the sphere. In practice, the number of virtual loudspeakers J can be chosen arbitrarily whereas a higher number leads to a higher accuracy of the spatial processing at the cost of higher computational complexity. In practice, a reasonable number of virtual loudspeakers is given e.g. by J=250.
In this embodiment, the input to the first spatial transform (1020) are the signals of a microphone array with M microphones. The microphones can have different directivities such as omnidirectional, cardioid, or dipole characteristics. The microphones can be arranged in different configurations, such as coincident microphone arrays (when using directional microphones), linear microphone arrays, circular microphones arrays, non-uniform planar arrays, or spherical microphone arrays. In many applications, planar or spherical microphone arrays may be used. A typical microphone array in practice is given for example by a circular microphone array with M=8 omnidirectional microphones with an array radius of 3 cm.
The M microphones are located in the positions d1 . . . M. The array center represents the reference position. The M microphone signals in the time-frequency domain are given
where k and n are the frequency index and time index, respectively, and A1 . . . M(k,n) are the signals of the M microphones located at d1 . . . M.
To compute the virtual loudspeaker signals, it is advantageous to apply beamforming 1024 to the input signals a(k,n) and steer the beamformers towards the positions of the virtual loudspeakers. In general, the beamforming is computed as
Here, bj(k, n) are the beamformer weights to compute the signal of the j-th virtual loudspeaker, which is denoted as S(φj, j). In general, the beamformer weights can be time and frequency-dependent. As in the previous embodiments, the angles (φj, ϑj) represent the position of the j-th virtual loudspeaker. Advantageously, the directions (φj, ϑj) are uniformly distributed on the sphere. The total number of virtual loudspeaker positions is denoted by J. In practice, this number can be chosen arbitrarily whereas a higher number leads to a higher accuracy of the spatial processing at the cost of higher computational complexity. In practice, a reasonable number of virtual loudspeakers is given e.g. by J=250.
An example of the beamforming is depicted in
The beamformer is directed towards the j-th loudspeaker (in this case, j=2) to create the j-th virtual loudspeaker signal.
A beamforming approach to obtain the weights bj(k,n) is to compute the so-called matched beamformer, for which the weights bj(k) are given by
The vector h(k, φj, ϑj) contains the relative transfer functions (RTFs) between the array microphones for the considered frequency band k and for the desired direction (φj, ϑj) of the j-th virtual loudspeaker position. The RTFs h(k, φj, ϑj) for example can be measured using a calibration measurement or can be simulated using sound field models such as the plane wave model [FourierAcoust].
Besides using the matched beamformer, other beamforming techniques such as MVDR, LCMV, multi-channel Wiener filter can be applied.
The J virtual loudspeaker signals are collected in the vector defined by
which represents the audio input signals in the virtual loudspeaker domain.
Clearly, the J virtual loudspeaker signals s(k,n) in this embodiment can be computed by applying a single matrix multiplication to the audio input signals, i.e.,
where the J×M matrix C(k) contains the beamformer weights for the J virtual loudspeakers, i.e.,
In this embodiment, the input to the first spatial transform (1020) are M audio object signals together with their accompanying position metadata. Similarly as in Embodiment 1b, the J virtual loudspeaker signals can be computed for example using the VBAP panning scheme [Vbap]. The VBAP panning scheme 1025 renders the J virtual loudspeaker signals depending on the M positions of the audio object input signals and the J positions of the virtual loudspeakers. Obviously, other rendering schemes than the VBAP panning scheme may be used instead. The audio object's positional metadata may indicate static object positions or time-varying object positions.
The spatial filtering (1030) is applied by multiplying the virtual loudspeaker signals in s(k,n) with a spatial window W(φj, ϑj, p, l), i.e.,
where S′(φj, ϑj) denotes the filtered virtual loudspeaker signals. The spatial filtering (1030) can be applied for example to emphasize the spatial sound towards the look direction of the desired listening position or when the location of the desired listening position approaches the sound sources or virtual loudspeaker positions. This means that the spatial window W(φj, ϑj, p, l) typically corresponds to non-negative real-valued gain values that usually are computed based on the desired listening position (denoted by vector p) and desired listening orientation or look direction (denoted by vector 1).
As an example, the spatial window W(φj, ϑj, p, l) can be computed as a common first-order spatial window directed towards the desired look direction which further is attenuated or amplified according to the distance between the desired listening position and virtual loudspeaker positions, i.e.,
Here, nj=[cos φj cos ϑj, sin φj cos ϑj, sin ϑj]T is the direction vector corresponding to the j-th virtual loudspeaker position and 1=[cos ϕ cos θ, sin ϕ cos θ, sin θ]T is the direction vector corresponding to the desired listening orientation with ϕ being the azimuth angle and θ being the elevation angle of the desired listening orientation. Moreover, α is the first-order parameter that determines the shape of the spatial window. For example, a spatial window with cardioid shape for α=0.5 is obtained. A corresponding example spatial window with cardioid shape and look direction ϕ=45° is depicted in
where p=[x, y, z] is the desired listening position in Cartesian coordinates. A drawing of the considered coordinate system is depicted in
In general, the spatial window W(φj, ϑj, p, l) can be defined arbitrarily. In applications such as an acoustic zoom, the spatial window may be defined as an rectangular window centered towards the zoom direction, which becomes more narrow when zooming in and more broad when zooming out. The window width can be defined consistent to the zoomed video image such that the window attenuates sound sources at the side when the corresponding audio object disappears from the zoomed video image.
Clearly, the filtered virtual loudspeaker signals in this embodiment can be computed from the virtual loudspeaker signals with a single element-wise vector multiplication, i.e.,
where ∘ is the element-wise product (Schur product) and
are the window weights for the J virtual loudspeakers given the desired listening position and orientation. The J filtered virtual microphone signals are collected in the vector
The purpose of the position modification (1040) is to compute the virtual loudspeaker positions from the point-of-view (POV) of the desired listening position with the desired listening orientation.
An example is visualized in
In
If the desired listening rotation is different from the reference rotation, an additional rotation matrix can be applied when computing the modified virtual loudspeaker positions, i.e.,
For example, if the desired listening orientation (relative to the reference orientation) corresponds to an azimuth angle ϕ, the rotation matrix can be computed as [RotMat]
The modified virtual loudspeaker positions n′j are then used in the second spatial transform (1050). The modified virtual loudspeaker positions can also be expressed in terms of modified azimuth angles φ′j and modified elevation angles ϑ′j, i.e.,
As an example, the position modification described in this embodiment can be used to achieve consistent audio/video reproduction when using different projections of a spherical video image. The different projections or viewing positions for a spherical video can be for example selected by a user via a user interface of a video player. In such an application,
As another example, the position modification in this embodiment also can be used to create an acoustic zoom effect that mimics a visual zoom. To mimic a visual zoom, one can move the virtual loudspeaker position towards the zoom direction. In this case, the virtual loudspeaker in zoom direction will get closer whereas the virtual loudspeakers at the side (relative to the zoom direction) will move outwards, similarly as the video objects would move in a zoomed video image.
Subsequently, reference is made to
Moreover, the virtual loudspeakers are located on the surface of the sphere, i.e., along the depicted circle. This corresponds to the standard spatial sound reproduction where the listening reference position is located in the sweet spot, for example in the center of the sphere of
This embodiment describes an implementation of the second spatial transform (1050) to compute the audio output signals in the Ambisonics domain.
To compute the desired output signals, one can transform the (filtered) virtual loudspeaker signals S′(φj, ϑj) using a spherical harmonic decomposition (SHD) 1052, which is computed as the weighted sum over all J virtual loudspeaker signals according to [FourierAcoust]
Here, Y*l,m(φ′j, ϑ′j) are the conjugate-complex spherical harmonics of level (order) l and mode m. The spherical harmonics are evaluated at the modified virtual loudspeaker positions (φ′j, ϑ′j) instead of the original virtual loudspeaker positions. This assures that the audio output signals are created from the perspective of the desired listening position with the desired listening orientation. Clearly, the output signals A′l,m(k,n) can be computed up to an arbitrary user-defined level (order) L′.
The output signals in this embodiment also can be computed as a single matrix multiplication from the (filter) virtual loudspeaker signals, i.e.,
where
contains the spherical harmonics evaluated at the modified virtual loudspeaker positions and
contains the output signals up to the desired Ambisonics level (order) E.
This embodiment describes an implementation of the second spatial transform (1050) to compute the audio output signals in the loudspeaker domain. In this case, it is advantageous to convert the J (filtered) signals S′(φj, ϑj) of the virtual loudspeakers into loudspeaker signals of the desired output loudspeaker setup by taking into account the modified virtual loudspeaker positions (φ′j, ϑ′j). In general, the desired output loudspeaker setup can be defined arbitrary. Commonly used output loudspeaker setups are for example 2.0 (stereo), 5.1, 7.1, 11.1, or 22.2. In the following, the number of output loudspeakers is denoted by L and the positions of the output loudspeakers are given by the angles (φlout, ϑlout).
To convert 1053 the (filtered) virtual loudspeaker signals into the desired loudspeaker format, it is advantageous to use the same approach as in Embodiment 1 b, i.e., one applies a static loudspeaker conversion matrix. In this case, the desired output loudspeaker signals are computed with
where s′(k,n) contains the (filtered) virtual loudspeaker signals, a′(k,n) contains the L output loudspeaker signals, and C is the format conversion matrix. The format conversation matrix is computed using the angles (φlout, ϑlout) of the output loudspeaker setup as well as the modified virtual loudspeaker positions (φ′j, ϑ′j). This assures that the audio output signals are created from the perspective of the desired listening position with the desired listening orientation. The conversation matrix C can be computed as explained in [FormatConv] by using for example the VBAP panning scheme [Vbap].
The second spatial transform (1050) can create output signals in the binaural domain for binaural sound reproduction. One way is to multiply 1054 the J (filtered) virtual loudspeaker signals S′(φj, ϑj) with a corresponding head-related transfer function (HRTF) and to sum up the resulting signals, i.e.,
Here, A′left(k, n) and A′right(k, n) are the binaural output signals for the left and right ear, respectively, and Hleft(k, φ′j, ϑ′j) and Hright(k, φ′j, ϑ′j) are the corresponding HRTFs for the j-th virtual loudspeaker. It is noted that the HRTFs for the modified virtual loudspeaker directions (φ′j, ϑ′j) are used. This assures that the binaural output signals are created from the perspective of the desired listening position with the desired listening orientation.
An alternative way to create binaural output signals is to perform a first or forward transform 1055 the virtual loudspeaker signals into the loudspeaker domain as described in Embodiment 4b, such as an intermediate loudspeaker format. Afterwards, the loudspeaker output signals from the intermediated loudspeaker format can be binauralized by applying 1056 the HRTFTs for the left and right ear corresponding to the positions of the output loudspeaker setup.
The binaural output signals also can be computed applying a matrix multiplication to the (filtered) virtual loudspeaker signals, i.e.,
where
contains the HRTFs for the J modified virtual loudspeaker positions for the left and right ear, respectively, and the vector
contains the two binaural audio signals.
From the previous embodiments it is clear that the output signals a′ (k,n) can be computed from the input signals a(k,n) by applying a single matrix multiplication, i.e.,
where the transformation matrix T(φ′1 . . . J, ϑ′1 . . . J) can be computed as
Here, C(φ1 . . . J, ϑ1 . . . J) is the matrix for the first spatial transform that can be computed as described in the Embodiments 1(a-d), w(p, l) is the optional spatial filter described in Embodiment 2, diag{•} denotes an operator that transforms a vector into a diagonal matrix with the vector being the main diagonal, and D(φ′1 . . . J, ϑ′1 . . . J) is the matrix for the second spatial transform depending on the desired listening position and orientation, which can be computed as described in the Embodiments 4(a-c). In an embodiment, it is possible to precompute the matrix T(φ′1 . . . J, ϑ′1 . . . J) for the desired listening positions and orientations (e.g., for a discrete grid of positions and orientations) to save computational complexity. In case of audio object input with time-varying positions, only the time-invariant parts of above calculation of T(φ′1 . . . J, ϑ1 . . . J) may be pre-computed to save computational complexity.
Subsequently, an implementation of the sound field processing as performed by the sound field processor 1000 is illustrated. In step 901 or 1010, two or more audio input signals are received in the time domain or time-frequency domain where, in the case of a reception of the signal in the time-frequency domain, an analysis filterbank has been used in order to obtain the time-frequency representation.
In step 1020, a first spatial transform is performed to obtain a set of virtual loudspeaker signals. In step 1030, an optional spatial filtering is performed by applying a spatial filter to the virtual loudspeaker signals. In case of not applying the step 1030 in
Thus,
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier or a non-transitory storage medium.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
This application is a continuation of copending International Application No. PCT/EP2020/071120, filed Jul. 27, 2020, which is incorporated herein by reference in its entirety, and additionally claims priority from International Application No. PCT/EP2019/070373, filed Jul. 29, 2019, which is incorporated herein by reference in its entirety. The present invention relates to the field of spatial sound recording and reproduction.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2020/071120 | Jul 2020 | US |
Child | 17583760 | US | |
Parent | PCT/EP2019/070373 | Jul 2019 | US |
Child | PCT/EP2020/071120 | US |