Examples of the disclosure relate to spatial audio processing. Some relate to spatial audio processing in systems with limited microphone arrangements.
Spatial audio requires audio signals captured by multiple microphones. Some audio capture devices, such as mobile phones, have a limited number of microphones and the microphones are located in positions that are not optimized for spatial audio capture. These modest microphone arrangements can lead to quality issues in the captured spatial audio.
According to various, but not necessarily all, examples of the disclosure there is provided an apparatus comprising means for:
The processing based on parametric audio may vary more over time than the processing based on beamforming audio.
The respective outputs may comprise at least one of:
The second spatial audio processing might not be performed for at least some of the first frequency range.
The first spatial audio processing might not be performed for at least some of the second frequency range.
The combining may comprise using the first output for the first frequency range and the second output for the second frequency range.
The combining may comprise applying a weighting to the first output and the second output so that the first output has a higher weighting than the second output in a frequency range and the second output has a higher weighting than the first output in another frequency range.
The means may be for determining an orientation of the microphone arrangement and applying a mode of at least one of the first spatial audio processing or the second spatial audio processing based on the determined orientation.
At least one of the first frequency range or the second frequency range may be different for different modes of first spatial audio processing or the second spatial audio processing.
A first mode may comprise using the first spatial audio processing for the first frequency range and a second spatial audio processing for the second frequency range and a second mode comprises using the first spatial audio processing for both the first frequency range and the second frequency range.
A first mode may use a first set of coefficients for the second spatial audio processing and a second mode may use a second set of coefficients for the second spatial audio processing.
The means may be for adjusting at least one of the first spatial audio processing or the second spatial audio processing based on the head orientation of a listener.
According to various, but not necessarily all, examples of the disclosure there is provided an electronic device comprising an apparatus as described herein wherein the electronic device is at least one of: smart phone, personal computer, image capturing device.
According to various, but not necessarily all, examples of the disclosure there is provided a method comprising:
According to various, but not necessarily all, examples of the disclosure there is provided a computer program comprising instructions which, when executed by an apparatus, cause the apparatus to perform at least:
While the above examples of the disclosure and optional features are described separately, it is to be understood that their provision in all possible combinations and permutations is contained within the disclosure. It is to be understood that various examples of the disclosure can comprise any or all of the features described in respect of other examples of the disclosure, and vice versa. Also, it is to be appreciated that any one or more or all of the features, in any combination, may be implemented by/comprised in/performable by an apparatus, a method, and/or computer program instructions as desired, and as appropriate.
Some examples will now be described with reference to the accompanying drawings in which:
The figures are not necessarily to scale. Certain features and views of the figures can be shown schematically or exaggerated in scale in the interest of clarity and conciseness. For example, the dimensions of some elements in the figures can be exaggerated relative to other elements to aid explication. Corresponding reference numerals are used in the figures to designate corresponding features. For clarity, all reference numerals are not necessarily displayed in all figures.
In the example of
The microphones 106 can comprise any means that can be configured to detect audio signals. The microphones 106 provide microphone audio signals 112 as an output. The microphone audio signals 112 are based on the detected audio signals.
In the example of
The electronic device 100 is configured so that the microphone audio signals 112 from the microphones 106 are provided to the processor 102 as an input. The microphone audio signals 112 can be provided to the processor 102 in any suitable format. In some examples the microphone audio signals 112 can be provided to the processor 102 in a digital format. The digital format could comprise pulse code modulation (PCM) or any other suitable type of format. In other examples the microphones 106 could comprise analog microphones 106. In such examples an analog-to-digital converter can be provided between the microphones 106 and the processor 102.
The processor 102 can be configured to process the microphone audio signals 112 to perform spatial audio processing of the microphone audio signals 112. The spatial audio processing comprises processing the microphone audio signals 112 so as to enable spatial properties of the captured audio to be rendered. This can enable a listener to perceive the spatial properties of the audio. For example, the listener can perceive different directions or positions for audio sources.
The processor 102 can be configured to access program code 116 stored in the memory 104. The processor 102 can use the program code 116 to perform spatial audio processing of the microphone audio signals 112. The processor 102 can use the program code 116 to implement examples of the disclosure. The processor 102 can be configured to use methods as shown in
The processor 102 processes the microphone audio signals 112 to provide a processed audio signal 114 as an output. Other types of processing could also be applied to the microphone audio signals 112 by the processor 102. For example, the processor 102 could also perform speech enhancement, noise reduction, and/or any other suitable procedure.
The processed audio signal 114 can be provided in any suitable format. The processed audio signal 114 can be provided in an encoded form such as an Immersive Voice and Audio Stream (IVAS), a binaural audio signal, a surround sound loudspeaker signal, Ambisonic audio signals or any other suitable type of spatial audio output. Other formats for the processed audio signal 114 could also be used.
The electronic device 100 shown in
The transceiver 108 can comprise any means that can enable data to be transmitted from the electronic device 100. This can enable the processed audio signal 114, and/or any other suitable data, to be transmitted from the electronic device 100 to an audio rendering device or any other suitable device.
The storage 110 can comprise any means for storing the processed audio signal 114 and/or any other outputs from the processor 102. The storage 110 could comprise one or more memories or any other suitable means. The processed audio signal 114 can be retrieved from the storage 110 at a later time. This can enable the electronic device 100 to render the processed audio signal 114 at a later time or can enable the processed audio signal 114 to be transmitted to another device.
In some examples the processed audio signal 114 can be associated with other information. For instance, the processed audio signal 114 can be associated with images such as video images. The images or video images could be captured by a camera of the electronic device 100. The processed audio signal 114 can be associated with the other information so that the processed audio signal 114 can be stored with this information and/or can be transmitted with this other information. The association can enable the images or other information to be provided to a user of the playback device when the processed audio signal 114 is played back.
Devices other than the electronic device 100 shown in
The example electronic device 100 of
Parametric spatial audio capture can be used with such arrangements of microphones 106 however this is not ideal for all circumstances. For instance, if there is a single sound source then parametric spatial audio capture can be used to accurately determine the direction. This results in a stable reproduction of the sound source.
However, if there are multiple sound sources that are active at the same time the direction metadata can fluctuate which can cause perceivable instabilities in reproduction of the sound sources.
Examples of the disclosure address these issues and provide methods of spatial audio processing that can provide high quality, stable reproductions even with a limited microphone arrangement.
In examples of the disclosure the electronic device 100 can be operated in multiple orientations. The electronic device 100 is shown in a landscape orientation in
For example, in the electronic device 100 in
At block 200 the method comprises obtaining at least two audio signals based on signals from at least two microphones 106. The microphones 106 can be arranged in a modest or limited arrangement or could be in any other suitable type of arrangement.
At block 202 the method comprises performing a first spatial audio processing of the obtained audio signals for at least a first frequency range to generate a first output. At block 204 the method comprises performing a second spatial audio processing of the obtained audio signals for at least a second frequency range to generate a second output.
The first and second spatial audio processing can comprise different types of processing. The signal processing operations of the first spatial audio processing comprises processing based on parametric audio. The processing based on parametric audio can comprise obtaining parameters such as energy and direction parameters for multiple frequency bands, and using them and the audio signals to generate the first output.
The signal processing operations of the second spatial audio processing comprises processing based on beamforming audio. The processing based on parametric audio varies more over time than the processing based on beamforming audio. Processing based on beamforming audio refers to combining the microphone signals, typically in a frequency-by-frequency basis, with time-variant or time-invariant processing weights so that as the result audio signals are obtained that have certain spatial capture characteristics.
The frequency ranges for which the respective spatial audio processes are used can comprise subsets of the available frequency ranges. In such examples the second spatial audio processing is not performed for at least some of the first frequency range and/or the first spatial audio processing is not performed for at least some of the second frequency range.
The frequencies within a frequency range do not need to be continuous. For example, it could be that the second frequency range comprises the lowest and highest frequencies and the first frequency range comprises the frequency portion in between. In such cases the lowest and highest frequencies would be processed based on beam forming audio and the frequency portion in between would be processed based on parametric audio.
The frequencies within the respective ranges can be dependent upon the arrangement of the microphones 106, the orientation of the electronic device 100 and/or any other suitable factor. The frequencies that are to be used for the respective ranges can be selected using any suitable procedure.
At block 206 the method comprises combining the first output and the second output to generate a combined output. The combined output can comprise at least some of the first output for a first frequency range and at least some of the second output for a second frequency range.
The combining can comprise any suitable process. In some examples the combining can comprise using the first output for the first frequency range and the second output for the second frequency range. This type of combining could be used if there is no overlap in the respective frequency ranges.
In some examples the combining can comprise applying a weighting to the first output and the second output. The weightings can be applied so that the first output has a higher weighting than the second output in a frequency range and the second output has a higher weighting than the first output in another frequency range. The weightings can be applied so that the first output is the main or dominant source in one frequency range and the second output is the main or dominant source in another frequency range. The frequency ranges for which the weightings apply can be the first frequency range and the second frequency range or could be different frequency ranges.
The combined output can comprise any suitable type of output. For example, the combined output can comprise at least one of binaural outputs, multi-channel outputs, stereo outputs, or any other suitable type of outputs.
In some examples the method can comprise additional blocks that are not shown in
In some examples the different modes of the respective spatial audio processes could use different ranges. That is the first frequency range and/or the second frequency range would be different depending on whether the electronic device 100 or apparatus 700 is in a landscape or portrait orientation.
In some examples a first mode could be to use one of the respective spatial audio processes and a second mode could be not to use the respective spatial audio processes. For example, a first mode could comprise using the first spatial audio processing for the first frequency range and using the second spatial audio processing for the second frequency range and a second mode could comprises using the first spatial audio processing for both the first frequency range and the second frequency range. In this case the first spatial audio processing would be used for all frequencies and the second spatial audio processing would not be used.
In some examples the different modes could comprise modifications to the respective spatial audio processes. For instance, in a first mode a first set of coefficients could be used for the second spatial audio processing and in a second mode a second set of coefficients could be used for the second spatial audio processing. The coefficients could be used in mixing or processing matrices used in the spatial audio processes.
In some examples one or more of the spatial audio processes could be adjusted based on a head orientation of a listener or any other suitable factor.
The processor 102 receives the microphone audio signals 112 as an input. The microphone audio signals 112 can be obtained from two or more microphones 106. The microphones 106 can be arranged in an electronic device 100 as shown in
The microphone audio signals 112 are provided as an input to a time-frequency transform block 300. The time-frequency transform block 300 is configured to convert the input microphone audio signals 112 from a time domain signal to a time-frequency domain signal. The time-frequency transform block 300 can be configured to apply any suitable time-frequency transform to convert the microphone audio signals 112 to time-frequency audio signals 302. For example, the transform used by the time-frequency transform block 300 could comprise a short-time Fourier transform (STFT) configured to take a frame of 1024 samples of the microphone audio signals 112 and concatenate this frame with the previous 1024 samples, and the applying a square-root of the 2*1024 length Hann window to the concatenated frames, and apply a fast Fourier transform (FFT) to the result. Other time-frequency transforms and their configurations can be used. The time-frequency transform block 300 provides time-frequency audio signals 302 as an output.
The time-frequency audio signals 302 are provided as an input to both a parametric-based spatial processing block 304 and a beamforming-based spatial processing block 306.
The parametric-based spatial processing block 304 can be configured to perform processing based on parametric audio. The processing based on parametric audio can comprise parametric capture and rendering. The parametric capture and rendering can use any suitable processes. The parametric capture and rendering can comprise estimating spatial parameters such as directions and direct-to-total energy ratios based on the microphone audio signals 112. These estimated parameters can then be used to synthesize a spatial audio output accordingly.
The processor 102 is configured so that the parametric-based spatial processing block 304 performs the processing based on parametric audio for a first frequency range.
The beamforming-based spatial processing block 306 can be configured to perform processing based on beam forming audio. The processing based on beam forming audio can comprise rendering based on measured or simulated responses of the electronic device 100 so that beam patterns are generated. In examples where the output processed audio signal 114 is a binaural audio signal the beam patterns that are generated can approximate the two complex-valued beam patterns (amplitude and phase) of the head-related transfer functions related to the left and right ears of a human listener. In other examples the beam patterns that are generated can comprise beam patterns that correspond to amplitude panning patterns suitable for stereo reproduction.
The processor 102 is configured so that the beamforming-based spatial processing block 306 performs the processing based on beamforming audio for a second frequency range. The second frequency range can be different to the first frequency range. There can be some overlap between the first frequency range and the second frequency range.
The frequencies within a frequency range do not need to be continuous. For example, it could be that the second frequency range comprises the lowest and highest frequencies and the first frequency range comprises the frequency portion in between. In such cases the lowest and highest frequencies would be processed based on beam forming audio and the frequency portion in between would be processed based on parametric audio.
The frequencies within the respective ranges can be dependent upon the arrangement of the microphones 106, the orientation of the electronic device 100 and/or any other suitable factor.
The parametric-based spatial processing block 304 provides a first output. The output comprises a first frequency portion signal 308. The first frequency portion signal 308 comprises a parametric-based processed time-frequency audio signal. The beamforming-based spatial processing block 306 provides a second output. The second output comprises a second frequency portion signal 310. The second frequency portion signal 310 comprises a beamforming-based processed time-frequency audio signal.
Both the first frequency portion signal 308 and the second frequency portion signal 310 are provided to combiner 312. The combiner 312 is configured to combine the first frequency portion signal 308 and the second frequency portion signal 310. The combiner 312 therefore combines a first output that has been processed based on parametric audio and a second output that has been processed based on beamforming audio.
In some examples the first frequency range and the second frequency range might not overlap. In such examples the combiner 312 can generate a processed time-frequency audio signal 314 by making a frequency-dependent selection of the respective inputs. In some examples there could be some overlap between the first frequency range and the second frequency range. In such examples the combiner 312 can generate a processed time-frequency audio signal 314 by performing smooth transitioning between the respective signals. For some frequencies the combined output can be based on both inputs.
The combiner 312 provides a processed time-frequency audio signal 314 as an output. The processed time-frequency audio signal 314 is provided to an inverse time-frequency transform block 316.
The inverse time-frequency transform block 316 is configured to convert the processed time-frequency audio signal 314 to a time domain signal. The inverse time-frequency transform block 316 can be configured to apply any suitable inverse time-frequency transform to convert the processed time-frequency audio signal 314 to a time domain signal. The transform used by the inverse time-frequency transform block 316 can be the inverse of the transform used by the time-frequency transform block 300. If the time-frequency transform block 300 uses an STFT then the inverse time-frequency transform block 316 can use an inverse STFT.
The inverse time-frequency transform block 316 provides a processed PCM audio signal 318 as an output. In the example of
The encoding block 320 provides processed audio signals 114 as an output. The processed audio signals 114 can be the output of the processor 102 as shown in
In some examples the encoding block 320 could be omitted. In such cases the processed PCM audio signal 318 could be the output of the processor 102.
In the offline steps target patterns 400 and device response 404 are provided as an input to a beam matrices design block 402.
The target patterns 400 could comprise head-related transfer functions (HRTFs) or any other suitable type of patterns. The HRTFs comprise a set of complex value pairs for different frequencies and different directions. The respective HRTF pairs comprise the amplitude and phase responses corresponding to sound arriving to human ears from a defined direction, at a defined frequency.
The HRTFs or other target patterns 400 can be based on measurements (real person or dummy head) or simulation, or a mixture of measurements and simulation. For example, target patterns 400 for the lowest frequencies can be obtained from simulations, but the target patterns 400 for the higher frequencies can be from measurements.
The target patterns 400 could comprise loudspeaker-based beam patterns. Such target patterns 400 could be used when the use case is to capture multi-channel signals (such as stereo) instead of a binaural signal. Details of examples of these loudspeaker-patterns are described further below.
The device responses 404 are also provided as inputs to the beam matrices design block 402. Similar to the target patterns 400 the device response 404 can comprise a set of complex weights per direction and frequency and indicate the phase and amplitude of sounds arriving to the microphones 106 of an electronic device 100 at different directions and frequencies. The device responses 404 can also be based on measurements or simulation, or a mixture of measurements and simulation.
The beam matrices design block 402 is configured to use the input target responses 400 and device responses 404 to generate a set of coefficients for beam matrices for use by the beamforming-based spatial processing block 306.
The beam matrices design block 402 can use any suitable procedure. As an example, the target patterns 404 can be denoted as t(b,DOA) which is a column vector comprising two (or more) entries, where b is the frequency bin index, and DOA is the direction of the arriving sound which may indicate azimuth and elevation. In the present example, the target patterns 404 are for binaural or stereo output. Therefore, the column vector t(b,DOA) has two rows. A first row is for the left output channel and a second row is for the right output channel.
For binaural output, the vector t(b, DOA) comprises the complex-valued HRTFs for bin b and the direction DOA.
For stereo loudspeaker output, the target patterns 404 could be defined by first converting the direction DOA to a unit vector pointing towards the corresponding direction, finding y(DOA) that is the y-dimension value of that unit vector, and then converting it to a cone-of-confusion azimuth value θcc(DOA)=sin−1 y(DOA). Then, the target pattern is determined so that
and when 30°<θcc(DOA)<30° then t(b, DOA) values are the amplitude panning gains according to a suitable panning scheme, such as the tangent panning law assuming loudspeakers at ±30°. In this example of stereo loudspeaker output, the target patterns 404 do not depend on bin b. In some examples, the target patterns 404 for loudspeaker playback could also comprise a phase in dependence of bin b. For example, the phase of the target value could be phase of the left and right microphones of the electronic device 100 for the same frequency and direction.
The device responses 404 are the same regardless of the processing output. The device responses 404 can be denoted as a complex valued column vector r(b,DOA) comprising as many rows as there are microphones 106. In some cases, some microphones 106 signals are omitted. For example, an electronic device 100 could comprise three microphones 106 in the same plane and one further microphone 106 in a different plane and above one of the three microphones 106. In such examples this further microphone 106 can be omitted in the design of the beam matrices.
The beam matrices design block 402 is configured to design a beam matrix B(b) that aims to optimize the relation
For any response r(b,DOA) from any DOA, the matrix B(b) maps the response to the target response t(b,DOA) at that same direction. Therefore, ideally the matrix B(b) is designed so that a sound that arrives from any direction is captured as if the capture was according to the target patterns 400.
The matrix B(b) only approximates this. Such a matrix can only feasibly be obtained when the wavelength is not too short with respect to the spacing of the microphones 106 (due to spatial aliasing), and not too long with respect to the spacing of the microphones 106 (due to diminishing inter-microphone difference signals and array noise).
For an electronic device 100 in landscape mode as shown in
The matrix B(b) can be solved by first determining a suitable set of directions DOA at which the operation of the matrix is optimized. What happens in between these directions is not separately optimized, and as such the number of directions should not be too sparse. A practical choice for the optimization is to obtain the r(b,DOA) and t(b, DOA) in the horizontal plane with suitable spacing (for example, 5 degrees intervals denoted DOA1, DOA2, . . . DOA72), and then stack the data into matrices
The solution then is
where the operator inverse( ) could be the Moore-Penrose pseudoinverse, however, at least some regularization of the inverse is preferred. Omitting the bin index in the notation, one example is to first use a singular value decomposition R=USVH, then performing inverse processing for the diagonal matrix S. However, S is not necessarily a square matrix, depending on the number of microphones 106. The regularized inverse processing can be performed by transposing S, and performing, for all the diagonal elements, the operation
where sj,j is the j:th row j:th column element of S; smax is the maximum sj,j for all j; and αs is a regularization factor, which could be 0.01. The values ŝi,i then replace all values si of the transposed S, and the result is then denoted Ŝ−1. The regularized inverse is then inverse(R)=VŜ−1UH.
The matrices B(b) are the beam matrices 406 that are provided as the output of the beam matrices design block 402. The beam matrices 406 are provided as an input to a matrix multiplication block 408. The matrix multiplication block 408 can be performed as part of the online processing.
The matrix multiplication block 408 also receives the time-frequency audio signals 302 as an input. The time-frequency audio signals 302 can be denoted as a column vector s(b,n) comprising as many rows as there are channels, and where n is the current temporal frame index (of the time-frequency transform).
The matrix multiplication block 408 performs the processing by
where the subscript b denotes that the output is the beam-processed output.
The output of the matrix multiplication block 408 provides the resulting signal sb(b,n) as an output signal. The output signal is the second frequency portion 310 as shown in
The beam matrices 406 can be dependent upon the orientation of the electronic device 100. For example, if the electronic device 100 is in the landscape orientation as shown in
The respective orientations of the electronic device 100 can be associated with corresponding beam matrices 406 unless the microphone arrangement has enough symmetry that two or more different orientations acoustically match each other. In such situations these orientations of the electronic device 100 can be associated with the same beam matrices 406. In some of these cases the microphone channels can be mutually switched if the left and right directions are different for the different orientations.
In the method described above the regularization factor as used by the beam matrices design block 402 was set to 0.01. The regularization factor can be set to any value between 0 and 1, where the value 0 means no regularization, and 1 means maximum regularization. The value of the regularization factor depends on the use case. Smaller values mean more accurate approximation of a beam design but also higher amplification of potential incoherent noises such as wind noise or microphone noise. For example, an electronic device 100 for outdoor use could have a regularization factor αs=0.1 and an electronic device 100 for indoor-use could have a regularization factor αs=0.01. The same electronic device 100 can have more than one mode depending on where it is used. The best mode can be detected automatically (this could be detected based on wind noise detection or camera input or any other suitable input). In examples where smaller values of the regularization factor are used, for example if αs=0.01, the processing can comprise means to avoid excessive noise amplification. For example, the input and output energies of the beam matrix processing can be monitored, and a limit can be applied as to how much louder the output can be. For example, a limiter could be configured so that the mean output of the beam matrix design block 402 would not be more than twice the mean input energy. This limiter would operate per frequency bin, and would remove any overshoots due to the noise, for example amplification of microphone noise.
The parametric-based spatial processing block 304 comprises a parametric spatial analysis block 500 and a parametric spatial synthesis block 504.
The parametric spatial analysis block 500 receives the time-frequency audio signals 302 and uses the time-frequency audio signals 302 to determine spatial metadata 502. The spatial metadata 502 can comprise a direction parameter and a direct-to-total energy ratio parameter for frequency bands k. One frequency band k can contain multiple frequency bins b, and has the lowest frequency bin blow(k) and the highest bin bhigh(k). The frequency resolution of these bands could be one that follows the Bark frequency scale, or any other suitable frequency resolution. The typical practical design principle is that for higher bands k there is a higher number of bins b included within the bands.
Any suitable process can be used to determine the spatial metadata 502. In some examples the spatial metadata 502 can be determined as follows. In this example the spatial metadata 502 is determined using only a subset of the available microphone audio signals 112. For example, only the microphone audio signals 112 from the microphones 106 at the edge of the electronic device might be used.
s(b,n,1) and s(b,n,2) can denote the first and second row entries of s(b,n). Cross correlation can be formulated by
Then, a real-valued normalized correlation measure in the range from bin b0 to b1 for delay d can be defined by
The delay search range can be selected based on the spacing of the microphones 106. The distance between a first microphone 106A and a second microphone 106B can be denoted dmic. This distance could be 0.16 meters. Then, the search could be performed by determining 41 uniformly spaced azimuth search directions θd ranging from −90° to 90°, and then converting them to delay values by
where vs is the speed of sound.
For band k, for each d=d(θd), the following value is evaluated
Then, for each band k, the value of d=d(θd) for which the maximum value cnorm,k(k, d, n) is obtained is evaluated. This value of d is the obtained delay estimate for that band. Accordingly, for that band, the determined direction parameter θ(k,n) is then the corresponding θd and the direct-to-total ratio parameter r(k,n) is the value of cnorm,k(k, d, n), for that same d. The procedure is repeated for each band k.
In this example the direction parameter is only azimuth, and elevation analysis is not performed. The elevation angle can be set to zero. In other examples the elevation angle can also be estimated. Any suitable method can be used to estimate the elevation angle.
The determined direction parameter θ(k,n) and the ratio r(k,n) provide the spatial metadata 502. The spatial metadata 502 is provided as the output from the parametric spatial analysis block 500.
In the described example the processing of the parametric spatial analysis block 500 provided the azimuth direction parameter θ(k,n) based on the delay estimations between the two microphones 106 at the edge of the electronic device 100. The angles were assumed to be at the front horizontal plane. This means that any sounds arriving from the rear directions are mirrored to the front. In some examples the parametric capture can resolve if the sound is arriving from the front or rear. In case it is determined that the sound arrives from the rear then the azimuth direction parameter θ(k,n) is mirrored to back (for example, sound at 30 degrees is mirrored to 150 degrees).
Any suitable process can be used to determine if an azimuth direction parameter θ(k,n) should be at the front or the rear. In the example electronic device 100 of
In the example of
The parametric spatial synthesis block 504 receives the time-frequency audio signals 302, and the target patterns 400, and the spatial metadata 502 as inputs. The parametric spatial synthesis block 504 is configured to render a spatial audio output based on the respective inputs. Any suitable process can be used to render the spatial audio output.
As an illustrative example using the electronic device 100 of
For each frequency band k, the input energies are formulated for channel i=1,2 by
Then, the target energy is formulated using the spatial metadata 502 and the target patterns 400. The target patterns t(k,DOA) are available in frequency bands k. The target patterns 400 used by the parametric spatial synthesis block 504 can be the same as the target patterns 400 used by the beamforming-based spatial processing block 306, or could be derived form the same data set used for the target patterns used by the beamforming-based spatial processing block 306, or could be derived from a different data set.
From the target patterns t(k,DOA) the pattern with azimuth θ(k,n) and zero elevation is selected. If the target patterns 400 does not have exactly that data point available, the nearest DOA may be selected or interpolation can be performed. The selected target pattern vector is denoted t(k,θ(k,n)). The target energy for band k is then defined by
where t(k,θ(k,n),i) is the i:th row value of t(k,θ(k,n)). In this example, it was assumed that in case of HRTF reproduction the target patterns t(k,θ(k,n)) are diffuse field equalized so that its mean value over surrounding spatial directions is
Then, the parametrically processed spatial audio output signals are obtained by
The division of the energies for obtaining g(k,n,i) can be regularized by a small epsilon at the divisor and limiting the maximum value of g(k,n,i), for example, to a value of 4.
The signal sp(k,n, i) is denoted in a vector form as sp(b,n). This signal provides the first frequency portion 308 as shown in
In the above described examples binaural rendering could be used. Other types of spatial audio outputs could be used. For stereo rendering the above described examples could be modified as appropriate or a different method could be used.
The signal sp(b,n) provides the first frequency portion 308 that has been processed based on parametric audio and the signal sb(b,n) provides the second frequency portion 310 that has been processed based on beamforming audio. These respective signals can be combined by the combiner 312. The combiner 312 can use any suitable process. In some examples the combiner 312 can the bins from the different inputs so that the combined output is sb(b,n) for bins where processing based on beamforming audio is feasible and the combined output is sp(b,n) for bins where processing based on beamforming audio is not feasible. In some examples the combined output can be interpolated at some frequencies so that, for at least some frequencies, the combined output can comprise a mixture sb(b,n)m(b)+sp(b,n)(1−m(b)), where m(b) is a value that transitions from 0 to 1 or from 1 to 0 during an interpolation frequency interval.
In some examples of the disclosure the parametric-based spatial processing block 304 can perform the processing based on parametric audio for just the first frequency range and the beamforming-based spatial processing block 306 can perform the processing based on beamforming audio for just the second frequency range. In some examples one or both of the processing based on parametric audio or the processing based on beamforming audio can also be performed outside of the allocated frequency range.
For instance, in some examples the processing based on parametric audio can be performed for the entire frequency range. In such cases the processed audio signals 114 could be encoded with the spatial metadata 502 for the entire frequency range. This can enable the spatial metadata 502 to be used at a later stage and/or at a remote decoder to render the processed audio signals 114 to other formats.
In some cases, the examples of the disclosure could be implemented with head tracking. In such cases the beamforming-based spatial processing block 306 would determine the beam matrices 406 for different head orientations. The different head orientations could comprise orientations with 5-degree spacing for the yaw rotation. The beamforming-based spatial processing block 306 could use the methods describe above, or any other suitable methods, but the target pattern 400 would be rotated based on the orientation of the head.
In such examples the beam matrices 406 would be used by the matrix multiplication block 408 as part of the online processing. To accommodate the head tracking the correct beam matrices 406 would be determined for the current head orientation. The correct beam matrices 406 can be selected by selecting the closest beam matrices 406 from the determined matrices, or via interpolation.
The parametric-based spatial processing block 304 can take head orientation into account by rotating the determined direction parameters θ(k,n) before applying them in the rendering. In some examples the parametric-based spatial processing block 304 could also perform some additional processing of the audio signals based on the head orientation before the rendering.
Therefore, examples of the disclosure can be used with head tracking. The head-tracked binaural rendering can be performed by processing the stored or retrieved microphone audio signals 112 in real-time based on the tracked head orientation of the listener.
In examples which use headtracking the frequency ranges for the respective types of spatial processing can change over time based on the head orientation of the listener and the positions of the microphones 106 in the electronic device 100. For example, the cross over frequency between the first frequency for which processing based on parametric audio is used and the second frequency for which processing based on beamforming audio is used can be changed. The change in the frequency ranges can take into account that the suitable frequency range for the processing based on beamforming audio can be different for different head orientations.
Although the examples described above have been described for binaural and stereo outputs other types of output could be used in other examples. The examples of the disclosure could be used to perform the processing for any multi-channel output, such as a 5.1 multi-channel loudspeaker setup. The processing steps could be as described above with the processing based on beamforming audio being used for a frequency range where suitable beam patterns can be achieved with the microphones 106 in use and the processing based on parametric audio is used at other frequencies.
In examples where the output is a multi-channel output, such as a 5.1 multi-channel output there are more output channels than would be used for a binaural output or a stereo output. The beam patterns used for the multi-channel output are narrower than the beam patterns used for binaural outputs or stereo outputs. This means that more microphones 106 might be needed to obtain the device responses 404 for such loudspeaker setups. Therefore, to implement examples of the disclosure for a multi-channel output the electronic device 100 might need more than three microphones 106. For example, the electronic device 100 might need five to ten microphones 106.
Examples of the disclosure can also be implemented with other types of signals. For instance, examples of the disclosure can be applied with cross-talk cancelled stereo reproduction. The cross-talk cancelled stereo can be used in electronic devices such as mobile phones that have stereo loudspeakers. The cross-talked cancelled stereo reproduction is applied in order to produce a perception of stereo image wider than the physical locations of the loudspeaker. With such a reproduction system, the processing would be as above, but the target patterns 400 for normal stereo reproduction would be replaced by the corresponding target patterns 400 for the cross-talk cancelled stereo reproduction in processing based on beamforming audio.
Examples of the disclosure can also be implemented with a soundbar reproduction system. In such examples the device responses 404 from the soundbar would be utilized for processing based on beamforming audio. The processing based on parametric audio could be performed using any suitable process.
In
The frequencies that are to be used for the respective ranges can be selected by inspecting the obtained beam patterns against the target patterns as shown in
In the examples of
However,
In the example of
In the example of
As illustrated in
The processor 102 is configured to read from and write to the memory 104. The processor 102 can also comprise an output interface via which data and/or commands are output by the processor 102 and an input interface via which data and/or commands are input to the processor 102.
The memory 104 is configured to store a computer program 116 comprising computer program instructions (computer program code 704) that controls the operation of the apparatus 700 when loaded into the processor 102. The computer program instructions, of the computer program 116, provide the logic and routines that enables the apparatus 700 to perform the methods described herein. The processor 102 by reading the memory 104 is able to load and execute the computer program 116.
The apparatus 700 therefore comprises: at least one processor 102; and at least one memory 104 storing instructions that, when executed by the at least one processor 102, cause an apparatus 700 at least to perform:
As illustrated in
The computer program 116 can comprise computer program instructions for causing an apparatus 700 to perform at least the following or for performing at least the following:
The computer program instructions can be comprised in a computer program 116, a non-transitory computer readable medium, a computer program product, a machine readable medium. In some but not necessarily all examples, the computer program instructions can be distributed over more than one computer program 116.
Although the memory 104 is illustrated as a single component/circuitry it can be implemented as one or more separate components/circuitry some or all of which can be integrated/removable and/or can provide permanent/semi-permanent/dynamic/cached storage.
Although the processor 102 is illustrated as a single component/circuitry it can be implemented as one or more separate components/circuitry some or all of which can be integrated/removable. The processor 102 can be a single core or multi-core processor.
References to “computer-readable storage medium”, “computer program product”, “tangibly embodied computer program” etc. or a “controller”, “computer”, “processor” etc. should be understood to encompass not only computers having different architectures such as single/multi-processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.
As used in this application, the term “circuitry” can refer to one or more or all of the following:
This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit for a mobile device or a similar integrated circuit in a server, a cellular network device, or other computing or network device.
The blocks illustrated in the Figs. can represent steps in a method and/or sections of code in the computer program 116. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block can be varied. Furthermore, it can be possible for some blocks to be omitted.
In the example of
The term ‘comprise’ is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising Y indicates that X may comprise only one Y or may comprise more than one Y. If it is intended to use ‘comprise’ with an exclusive meaning then it will be made clear in the context by referring to “comprising only one . . . ” or by using “consisting”.
In this description, the wording ‘connect’, ‘couple’ and ‘communication’ and their derivatives mean operationally connected/coupled/in communication. It should be appreciated that any number or combination of intervening components can exist (including no intervening components), i.e., so as to provide direct or indirect connection/coupling/communication. Any such intervening components can include hardware and/or software components.
As used herein, the term “determine/determining” (and grammatical variants thereof) can include, not least: calculating, computing, processing, deriving, measuring, investigating, identifying, looking up (for example, looking up in a table, a database or another data structure), ascertaining and the like. Also, “determining” can include receiving (for example, receiving information), accessing (for example, accessing data in a memory), obtaining and the like. Also, “determine/determining” can include resolving, selecting, choosing, establishing, and the like.
In this description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term ‘example’ or ‘for example’ or ‘can’ or ‘may’ in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus ‘example’, ‘for example’, ‘can’ or ‘may’ refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.
Although examples have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the claims.
Features described in the preceding description may be used in combinations other than the combinations explicitly described above.
Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.
Although features have been described with reference to certain examples, those features may also be present in other examples whether described or not.
The term ‘a’, ‘an’ or ‘the’ is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a/an/the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use ‘a’, ‘an’ or ‘the’ with an exclusive meaning then it will be made clear in the context. In some circumstances the use of ‘at least one’ or ‘one or more’ may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning.
The presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and also to features that achieve substantially the same technical effect (equivalent features). The equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way. The equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.
In this description, reference has been made to various examples using adjectives or adjectival phrases to describe characteristics of the examples. Such a description of a characteristic in relation to an example indicates that the characteristic is present in some examples exactly as described and is present in other examples substantially as described.
The above description describes some examples of the present disclosure however those of ordinary skill in the art will be aware of possible alternative structures and method features which offer equivalent functionality to the specific examples of such structures and features described herein above and which for the sake of brevity and clarity have been omitted from the above description. Nonetheless, the above description should be read as implicitly including reference to such alternative structures and method features which provide equivalent functionality unless such alternative structures or method features are explicitly excluded in the above description of the examples of the present disclosure.
Whilst endeavoring in the foregoing specification to draw attention to those features believed to be of importance it should be understood that the Applicant may seek protection via the claims in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings whether or not emphasis has been placed thereon.
Number | Date | Country | Kind |
---|---|---|---|
2319500.1 | Dec 2023 | GB | national |