Spatial Audio Processing

Abstract
Examples of the disclosure relate to spatial audio processing. An apparatus obtains at least two audio signals based on signals from at least two microphones. The apparatus performs a first spatial audio processing of the obtained audio signals for at least a first frequency range to generate a first output and performs a second spatial audio processing of the obtained audio signals for at least a second frequency range to generate a second output. The signal processing operations of the first spatial audio processing include processing based on parametric audio and the signal processing operations of the second spatial audio processing include processing based on beamforming audio. The apparatus combines the first output and the second output to generate a combined output.
Description
TECHNOLOGICAL FIELD

Examples of the disclosure relate to spatial audio processing. Some relate to spatial audio processing in systems with limited microphone arrangements.


BACKGROUND

Spatial audio requires audio signals captured by multiple microphones. Some audio capture devices, such as mobile phones, have a limited number of microphones and the microphones are located in positions that are not optimized for spatial audio capture. These modest microphone arrangements can lead to quality issues in the captured spatial audio.


BRIEF SUMMARY

According to various, but not necessarily all, examples of the disclosure there is provided an apparatus comprising means for:

    • obtaining at least two audio signals based on signals from at least two microphones;
    • performing a first spatial audio processing of the obtained audio signals for at least a first frequency range to generate a first output;
    • performing a second spatial audio processing of the obtained audio signals for at least a second frequency range to generate a second output, wherein the signal processing operations of the first spatial audio processing comprises processing based on parametric audio and the signal processing operations of the second spatial audio processing comprises processing based on beamforming audio; and
    • combining the first output and the second output to generate a combined output.


The processing based on parametric audio may vary more over time than the processing based on beamforming audio.


The respective outputs may comprise at least one of:

    • binaural outputs;
    • multi-channel outputs; and
    • stereo outputs.


The second spatial audio processing might not be performed for at least some of the first frequency range.


The first spatial audio processing might not be performed for at least some of the second frequency range.


The combining may comprise using the first output for the first frequency range and the second output for the second frequency range.


The combining may comprise applying a weighting to the first output and the second output so that the first output has a higher weighting than the second output in a frequency range and the second output has a higher weighting than the first output in another frequency range.


The means may be for determining an orientation of the microphone arrangement and applying a mode of at least one of the first spatial audio processing or the second spatial audio processing based on the determined orientation.


At least one of the first frequency range or the second frequency range may be different for different modes of first spatial audio processing or the second spatial audio processing.


A first mode may comprise using the first spatial audio processing for the first frequency range and a second spatial audio processing for the second frequency range and a second mode comprises using the first spatial audio processing for both the first frequency range and the second frequency range.


A first mode may use a first set of coefficients for the second spatial audio processing and a second mode may use a second set of coefficients for the second spatial audio processing.


The means may be for adjusting at least one of the first spatial audio processing or the second spatial audio processing based on the head orientation of a listener.


According to various, but not necessarily all, examples of the disclosure there is provided an electronic device comprising an apparatus as described herein wherein the electronic device is at least one of: smart phone, personal computer, image capturing device.


According to various, but not necessarily all, examples of the disclosure there is provided a method comprising:

    • obtaining at least two audio signals based on signals from at least two microphones;
    • performing a first spatial audio processing of the obtained audio signals for at least a first frequency range to generate a first output;
    • performing a second spatial audio processing of the obtained audio signals for at least a second frequency range to generate a second output, wherein the signal processing operations of the first spatial audio processing comprises processing based on parametric audio and the signal processing operations of the second spatial audio processing comprises processing based on beamforming audio; and
    • combining the first output and the second output to generate a combined output.


According to various, but not necessarily all, examples of the disclosure there is provided a computer program comprising instructions which, when executed by an apparatus, cause the apparatus to perform at least:

    • obtaining at least two audio signals based on signals from at least two microphones;
    • performing a first spatial audio processing of the obtained audio signals for at least a first frequency range to generate a first output;
    • performing a second spatial audio processing of the obtained audio signals for at least a second frequency range to generate a second output, wherein the signal processing operations of the first spatial audio processing comprises processing based on parametric audio and the signal processing operations of the second spatial audio processing comprises processing based on beamforming audio; and
    • combining the first output and the second output to generate a combined output.


While the above examples of the disclosure and optional features are described separately, it is to be understood that their provision in all possible combinations and permutations is contained within the disclosure. It is to be understood that various examples of the disclosure can comprise any or all of the features described in respect of other examples of the disclosure, and vice versa. Also, it is to be appreciated that any one or more or all of the features, in any combination, may be implemented by/comprised in/performable by an apparatus, a method, and/or computer program instructions as desired, and as appropriate.





BRIEF DESCRIPTION

Some examples will now be described with reference to the accompanying drawings in which:



FIG. 1 shows an example electronic device;



FIG. 2 shows an example method;



FIG. 3 shows an example processor;



FIG. 4 shows an example beamforming-based spatial processing block;



FIG. 5 shows an example parametric-based spatial processing block;



FIGS. 6A and 6B show example beam pattern designs; and



FIG. 7 shows an example apparatus.





The figures are not necessarily to scale. Certain features and views of the figures can be shown schematically or exaggerated in scale in the interest of clarity and conciseness. For example, the dimensions of some elements in the figures can be exaggerated relative to other elements to aid explication. Corresponding reference numerals are used in the figures to designate corresponding features. For clarity, all reference numerals are not necessarily displayed in all figures.


DETAILED DESCRIPTION


FIG. 1 shows an example electronic device 100. The example electronic device 100 can comprise an audio capture device that can be configured to capture audio for spatial audio signals. The electronic device 100 can be, a smart phone, a personal computer, an image capturing device, a teleconference device, a capture device integrated in a vehicle or any other suitable type of device comprising at least two microphones.


In the example of FIG. 1 the electronic device 100 comprises a processor 102, a memory 104, multiple microphones 106, a transceiver 108 and storage 110. Only components of the electronic device 100 that are referred to below are shown in FIG. 1. The electronic device 100 can comprise additional components that are not shown.


The microphones 106 can comprise any means that can be configured to detect audio signals. The microphones 106 provide microphone audio signals 112 as an output. The microphone audio signals 112 are based on the detected audio signals.


In the example of FIG. 1 the electronic device 100 comprises three microphones 106. The microphones 100 can be positioned within the electronic device 100 so as to enable spatial audio capture. In the example of FIG. 1 a first microphone 106A is provided at a first end of the electronic device 100, a second microphone 106B is provided at a second end of the electronic device 100, a third microphone 106C is provided on the rear of the electronic device 100. In this example, the microphones 106A-C are positioned on, or close to, a single axis. Other numbers and arrangements of microphones 106 could be used in other examples of the disclosure.


The electronic device 100 is configured so that the microphone audio signals 112 from the microphones 106 are provided to the processor 102 as an input. The microphone audio signals 112 can be provided to the processor 102 in any suitable format. In some examples the microphone audio signals 112 can be provided to the processor 102 in a digital format. The digital format could comprise pulse code modulation (PCM) or any other suitable type of format. In other examples the microphones 106 could comprise analog microphones 106. In such examples an analog-to-digital converter can be provided between the microphones 106 and the processor 102.


The processor 102 can be configured to process the microphone audio signals 112 to perform spatial audio processing of the microphone audio signals 112. The spatial audio processing comprises processing the microphone audio signals 112 so as to enable spatial properties of the captured audio to be rendered. This can enable a listener to perceive the spatial properties of the audio. For example, the listener can perceive different directions or positions for audio sources.


The processor 102 can be configured to access program code 116 stored in the memory 104. The processor 102 can use the program code 116 to perform spatial audio processing of the microphone audio signals 112. The processor 102 can use the program code 116 to implement examples of the disclosure. The processor 102 can be configured to use methods as shown in FIGS. 2 to 5 and/or as described herein, to perform the spatial audio processing.


The processor 102 processes the microphone audio signals 112 to provide a processed audio signal 114 as an output. Other types of processing could also be applied to the microphone audio signals 112 by the processor 102. For example, the processor 102 could also perform speech enhancement, noise reduction, and/or any other suitable procedure.


The processed audio signal 114 can be provided in any suitable format. The processed audio signal 114 can be provided in an encoded form such as an Immersive Voice and Audio Stream (IVAS), a binaural audio signal, a surround sound loudspeaker signal, Ambisonic audio signals or any other suitable type of spatial audio output. Other formats for the processed audio signal 114 could also be used.


The electronic device 100 shown in FIG. 1 comprises a transceiver 108 and storage 110. The processor 102 is coupled to the transceiver 108 and/or storage 110 so that the processed audio signal 114, and/or any other outputs, from the processor 102 can be provided to the transceiver 108 and/or the storage 110.


The transceiver 108 can comprise any means that can enable data to be transmitted from the electronic device 100. This can enable the processed audio signal 114, and/or any other suitable data, to be transmitted from the electronic device 100 to an audio rendering device or any other suitable device.


The storage 110 can comprise any means for storing the processed audio signal 114 and/or any other outputs from the processor 102. The storage 110 could comprise one or more memories or any other suitable means. The processed audio signal 114 can be retrieved from the storage 110 at a later time. This can enable the electronic device 100 to render the processed audio signal 114 at a later time or can enable the processed audio signal 114 to be transmitted to another device.


In some examples the processed audio signal 114 can be associated with other information. For instance, the processed audio signal 114 can be associated with images such as video images. The images or video images could be captured by a camera of the electronic device 100. The processed audio signal 114 can be associated with the other information so that the processed audio signal 114 can be stored with this information and/or can be transmitted with this other information. The association can enable the images or other information to be provided to a user of the playback device when the processed audio signal 114 is played back.


Devices other than the electronic device 100 shown in FIG. 1 can be used to implement examples of the disclosure. For instance, in the example of FIG. 1 the processor 102 obtained the audio signals 112 from the microphones 106 of the device 100. In other examples the processor 102 could receive the audio signals 112 from a different source, such as the transceiver 108 or the storage 110.


The example electronic device 100 of FIG. 1 comprises a modest arrangement of microphones 106 that can be used to capture spatial audio. The limited number of microphones 106 and their positions are not ideal for spatial audio capture. For example, there will be a large portion of frequencies for which the arrangement of the microphones 106 is too sparse. In some example devices 100 the arrangement of the microphones 106 could be too small and/or too sparse.


Parametric spatial audio capture can be used with such arrangements of microphones 106 however this is not ideal for all circumstances. For instance, if there is a single sound source then parametric spatial audio capture can be used to accurately determine the direction. This results in a stable reproduction of the sound source.


However, if there are multiple sound sources that are active at the same time the direction metadata can fluctuate which can cause perceivable instabilities in reproduction of the sound sources.


Examples of the disclosure address these issues and provide methods of spatial audio processing that can provide high quality, stable reproductions even with a limited microphone arrangement.


In examples of the disclosure the electronic device 100 can be operated in multiple orientations. The electronic device 100 is shown in a landscape orientation in FIG. 1. The electronic device 100 could also be used in other orientation such as a portrait orientation, a main camera orientation or a selfie camera orientation. Other types of electronic devices 100 could be arranged in different orientations. The methods and processes described herein can be initialized differently for different orientations of the electronic device 100. In some examples, the methods described herein could be used if the electronic device 100 is in a first orientation and not used if the electronic device 100 is in a second orientation.


For example, in the electronic device 100 in FIG. 1 the microphones 106 are located in approximately the horizontal plane when the electronic device 100 is in the landscape orientation. This arrangement of microphones 106 does not support horizontal spatial audio capture very well when the electronic device 100 is arranged in a portrait orientation. Other example electronic devices 100, could comprise two microphones 106 at one edge of the electronic device 100 (for example, the right edge of an electronic device 100 as shown in FIG. 1) with suitable spacing so that the processing according to the present disclosure can also be performed also when the electronic device 100 is in the portrait orientation. However, even in this case the algorithms would initialize differently because the microphone 106 arrangement and spacing would be different in dependence of the orientation.



FIG. 2 shows an example method. The example method could be implemented by an electronic device 100 as shown in FIG. 1, by an apparatus 700 as shown in FIG. 7 or by any other suitable means.


At block 200 the method comprises obtaining at least two audio signals based on signals from at least two microphones 106. The microphones 106 can be arranged in a modest or limited arrangement or could be in any other suitable type of arrangement.


At block 202 the method comprises performing a first spatial audio processing of the obtained audio signals for at least a first frequency range to generate a first output. At block 204 the method comprises performing a second spatial audio processing of the obtained audio signals for at least a second frequency range to generate a second output.


The first and second spatial audio processing can comprise different types of processing. The signal processing operations of the first spatial audio processing comprises processing based on parametric audio. The processing based on parametric audio can comprise obtaining parameters such as energy and direction parameters for multiple frequency bands, and using them and the audio signals to generate the first output.


The signal processing operations of the second spatial audio processing comprises processing based on beamforming audio. The processing based on parametric audio varies more over time than the processing based on beamforming audio. Processing based on beamforming audio refers to combining the microphone signals, typically in a frequency-by-frequency basis, with time-variant or time-invariant processing weights so that as the result audio signals are obtained that have certain spatial capture characteristics.


The frequency ranges for which the respective spatial audio processes are used can comprise subsets of the available frequency ranges. In such examples the second spatial audio processing is not performed for at least some of the first frequency range and/or the first spatial audio processing is not performed for at least some of the second frequency range.


The frequencies within a frequency range do not need to be continuous. For example, it could be that the second frequency range comprises the lowest and highest frequencies and the first frequency range comprises the frequency portion in between. In such cases the lowest and highest frequencies would be processed based on beam forming audio and the frequency portion in between would be processed based on parametric audio.


The frequencies within the respective ranges can be dependent upon the arrangement of the microphones 106, the orientation of the electronic device 100 and/or any other suitable factor. The frequencies that are to be used for the respective ranges can be selected using any suitable procedure.


At block 206 the method comprises combining the first output and the second output to generate a combined output. The combined output can comprise at least some of the first output for a first frequency range and at least some of the second output for a second frequency range.


The combining can comprise any suitable process. In some examples the combining can comprise using the first output for the first frequency range and the second output for the second frequency range. This type of combining could be used if there is no overlap in the respective frequency ranges.


In some examples the combining can comprise applying a weighting to the first output and the second output. The weightings can be applied so that the first output has a higher weighting than the second output in a frequency range and the second output has a higher weighting than the first output in another frequency range. The weightings can be applied so that the first output is the main or dominant source in one frequency range and the second output is the main or dominant source in another frequency range. The frequency ranges for which the weightings apply can be the first frequency range and the second frequency range or could be different frequency ranges.


The combined output can comprise any suitable type of output. For example, the combined output can comprise at least one of binaural outputs, multi-channel outputs, stereo outputs, or any other suitable type of outputs.


In some examples the method can comprise additional blocks that are not shown in FIG. 2. For example, the spatial audio processes could be dependent upon the orientation of the electronic device 100 or apparatus 700. In such examples the method can comprise determining an orientation of the microphone arrangement and applying a mode of at least one of the first spatial audio processing or the second spatial audio processing based on the determined orientation.


In some examples the different modes of the respective spatial audio processes could use different ranges. That is the first frequency range and/or the second frequency range would be different depending on whether the electronic device 100 or apparatus 700 is in a landscape or portrait orientation.


In some examples a first mode could be to use one of the respective spatial audio processes and a second mode could be not to use the respective spatial audio processes. For example, a first mode could comprise using the first spatial audio processing for the first frequency range and using the second spatial audio processing for the second frequency range and a second mode could comprises using the first spatial audio processing for both the first frequency range and the second frequency range. In this case the first spatial audio processing would be used for all frequencies and the second spatial audio processing would not be used.


In some examples the different modes could comprise modifications to the respective spatial audio processes. For instance, in a first mode a first set of coefficients could be used for the second spatial audio processing and in a second mode a second set of coefficients could be used for the second spatial audio processing. The coefficients could be used in mixing or processing matrices used in the spatial audio processes.


In some examples one or more of the spatial audio processes could be adjusted based on a head orientation of a listener or any other suitable factor.



FIG. 3 schematically shows the operation of an example processor 102 according to examples of the disclosure. The processor 102 can be comprised in an electronic device 100 as shown in FIG. 1, or in any other suitable type of electronic device 100.


The processor 102 receives the microphone audio signals 112 as an input. The microphone audio signals 112 can be obtained from two or more microphones 106. The microphones 106 can be arranged in an electronic device 100 as shown in FIG. 1 or could be arranged in any other suitable arrangement. In some examples the microphone audio signals 112 can be retrieved from a memory 104 or could be received from another device.


The microphone audio signals 112 are provided as an input to a time-frequency transform block 300. The time-frequency transform block 300 is configured to convert the input microphone audio signals 112 from a time domain signal to a time-frequency domain signal. The time-frequency transform block 300 can be configured to apply any suitable time-frequency transform to convert the microphone audio signals 112 to time-frequency audio signals 302. For example, the transform used by the time-frequency transform block 300 could comprise a short-time Fourier transform (STFT) configured to take a frame of 1024 samples of the microphone audio signals 112 and concatenate this frame with the previous 1024 samples, and the applying a square-root of the 2*1024 length Hann window to the concatenated frames, and apply a fast Fourier transform (FFT) to the result. Other time-frequency transforms and their configurations can be used. The time-frequency transform block 300 provides time-frequency audio signals 302 as an output.


The time-frequency audio signals 302 are provided as an input to both a parametric-based spatial processing block 304 and a beamforming-based spatial processing block 306.


The parametric-based spatial processing block 304 can be configured to perform processing based on parametric audio. The processing based on parametric audio can comprise parametric capture and rendering. The parametric capture and rendering can use any suitable processes. The parametric capture and rendering can comprise estimating spatial parameters such as directions and direct-to-total energy ratios based on the microphone audio signals 112. These estimated parameters can then be used to synthesize a spatial audio output accordingly. FIG. 5 shows an example of a parametric-based spatial processing block 304 in more detail.


The processor 102 is configured so that the parametric-based spatial processing block 304 performs the processing based on parametric audio for a first frequency range.


The beamforming-based spatial processing block 306 can be configured to perform processing based on beam forming audio. The processing based on beam forming audio can comprise rendering based on measured or simulated responses of the electronic device 100 so that beam patterns are generated. In examples where the output processed audio signal 114 is a binaural audio signal the beam patterns that are generated can approximate the two complex-valued beam patterns (amplitude and phase) of the head-related transfer functions related to the left and right ears of a human listener. In other examples the beam patterns that are generated can comprise beam patterns that correspond to amplitude panning patterns suitable for stereo reproduction. FIG. 4 shows an example of a beamforming-based spatial processing block 306 in more detail.


The processor 102 is configured so that the beamforming-based spatial processing block 306 performs the processing based on beamforming audio for a second frequency range. The second frequency range can be different to the first frequency range. There can be some overlap between the first frequency range and the second frequency range.


The frequencies within a frequency range do not need to be continuous. For example, it could be that the second frequency range comprises the lowest and highest frequencies and the first frequency range comprises the frequency portion in between. In such cases the lowest and highest frequencies would be processed based on beam forming audio and the frequency portion in between would be processed based on parametric audio.


The frequencies within the respective ranges can be dependent upon the arrangement of the microphones 106, the orientation of the electronic device 100 and/or any other suitable factor.


The parametric-based spatial processing block 304 provides a first output. The output comprises a first frequency portion signal 308. The first frequency portion signal 308 comprises a parametric-based processed time-frequency audio signal. The beamforming-based spatial processing block 306 provides a second output. The second output comprises a second frequency portion signal 310. The second frequency portion signal 310 comprises a beamforming-based processed time-frequency audio signal.


Both the first frequency portion signal 308 and the second frequency portion signal 310 are provided to combiner 312. The combiner 312 is configured to combine the first frequency portion signal 308 and the second frequency portion signal 310. The combiner 312 therefore combines a first output that has been processed based on parametric audio and a second output that has been processed based on beamforming audio.


In some examples the first frequency range and the second frequency range might not overlap. In such examples the combiner 312 can generate a processed time-frequency audio signal 314 by making a frequency-dependent selection of the respective inputs. In some examples there could be some overlap between the first frequency range and the second frequency range. In such examples the combiner 312 can generate a processed time-frequency audio signal 314 by performing smooth transitioning between the respective signals. For some frequencies the combined output can be based on both inputs.


The combiner 312 provides a processed time-frequency audio signal 314 as an output. The processed time-frequency audio signal 314 is provided to an inverse time-frequency transform block 316.


The inverse time-frequency transform block 316 is configured to convert the processed time-frequency audio signal 314 to a time domain signal. The inverse time-frequency transform block 316 can be configured to apply any suitable inverse time-frequency transform to convert the processed time-frequency audio signal 314 to a time domain signal. The transform used by the inverse time-frequency transform block 316 can be the inverse of the transform used by the time-frequency transform block 300. If the time-frequency transform block 300 uses an STFT then the inverse time-frequency transform block 316 can use an inverse STFT.


The inverse time-frequency transform block 316 provides a processed PCM audio signal 318 as an output. In the example of FIG. 3 the processed PCM audio signal 318 is provided to an encoding block 320. The encoding block 320 is configured to encode the processed PCM audio signal 318. Any suitable procedure can be used for the encoding. For example, Advanced Audio Coding (AAC) or Immersive Voice and Audio Services (IVAS) could be used.


The encoding block 320 provides processed audio signals 114 as an output. The processed audio signals 114 can be the output of the processor 102 as shown in FIG. 1. The processed audio signals 114 can be directly reproduced (using a headphones or loudspeakers or any other suitable means), and/or they could be stored and/or transmitted to another device.


In some examples the encoding block 320 could be omitted. In such cases the processed PCM audio signal 318 could be the output of the processor 102.



FIG. 4 shows an example beamforming-based spatial processing block 306 in more detail. The processing of the beamforming-based spatial processing block 306 comprises some offline steps and some online steps. The offline steps only need to be performed once. The offline steps can be performed at an initialization stage of software that performs the processing according to the present disclosure. The offline steps can be performed even before compiling the processing software, so that the program code 116 only contains the resulting beam matrices or other coefficients that are to be used at the online processing stage. The offline steps could be performed at other times in other examples.


In the offline steps target patterns 400 and device response 404 are provided as an input to a beam matrices design block 402.


The target patterns 400 could comprise head-related transfer functions (HRTFs) or any other suitable type of patterns. The HRTFs comprise a set of complex value pairs for different frequencies and different directions. The respective HRTF pairs comprise the amplitude and phase responses corresponding to sound arriving to human ears from a defined direction, at a defined frequency.


The HRTFs or other target patterns 400 can be based on measurements (real person or dummy head) or simulation, or a mixture of measurements and simulation. For example, target patterns 400 for the lowest frequencies can be obtained from simulations, but the target patterns 400 for the higher frequencies can be from measurements.


The target patterns 400 could comprise loudspeaker-based beam patterns. Such target patterns 400 could be used when the use case is to capture multi-channel signals (such as stereo) instead of a binaural signal. Details of examples of these loudspeaker-patterns are described further below.


The device responses 404 are also provided as inputs to the beam matrices design block 402. Similar to the target patterns 400 the device response 404 can comprise a set of complex weights per direction and frequency and indicate the phase and amplitude of sounds arriving to the microphones 106 of an electronic device 100 at different directions and frequencies. The device responses 404 can also be based on measurements or simulation, or a mixture of measurements and simulation.


The beam matrices design block 402 is configured to use the input target responses 400 and device responses 404 to generate a set of coefficients for beam matrices for use by the beamforming-based spatial processing block 306.


The beam matrices design block 402 can use any suitable procedure. As an example, the target patterns 404 can be denoted as t(b,DOA) which is a column vector comprising two (or more) entries, where b is the frequency bin index, and DOA is the direction of the arriving sound which may indicate azimuth and elevation. In the present example, the target patterns 404 are for binaural or stereo output. Therefore, the column vector t(b,DOA) has two rows. A first row is for the left output channel and a second row is for the right output channel.


For binaural output, the vector t(b, DOA) comprises the complex-valued HRTFs for bin b and the direction DOA.


For stereo loudspeaker output, the target patterns 404 could be defined by first converting the direction DOA to a unit vector pointing towards the corresponding direction, finding y(DOA) that is the y-dimension value of that unit vector, and then converting it to a cone-of-confusion azimuth value θcc(DOA)=sin−1 y(DOA). Then, the target pattern is determined so that












t
(

b
,
DOA

)

=



[



1




0



]



if




θ

c

c


(
DOA
)




30

°



;

and






t
(

b
,
DOA

)


=



[



0




1



]



if




θ

c

c


(
DOA
)





-
30


°



;




and when 30°<θcc(DOA)<30° then t(b, DOA) values are the amplitude panning gains according to a suitable panning scheme, such as the tangent panning law assuming loudspeakers at ±30°. In this example of stereo loudspeaker output, the target patterns 404 do not depend on bin b. In some examples, the target patterns 404 for loudspeaker playback could also comprise a phase in dependence of bin b. For example, the phase of the target value could be phase of the left and right microphones of the electronic device 100 for the same frequency and direction.


The device responses 404 are the same regardless of the processing output. The device responses 404 can be denoted as a complex valued column vector r(b,DOA) comprising as many rows as there are microphones 106. In some cases, some microphones 106 signals are omitted. For example, an electronic device 100 could comprise three microphones 106 in the same plane and one further microphone 106 in a different plane and above one of the three microphones 106. In such examples this further microphone 106 can be omitted in the design of the beam matrices.


The beam matrices design block 402 is configured to design a beam matrix B(b) that aims to optimize the relation







t
(

b
,
DOA

)

=


B
(
b
)



r
(

b
,
DOA

)






For any response r(b,DOA) from any DOA, the matrix B(b) maps the response to the target response t(b,DOA) at that same direction. Therefore, ideally the matrix B(b) is designed so that a sound that arrives from any direction is captured as if the capture was according to the target patterns 400.


The matrix B(b) only approximates this. Such a matrix can only feasibly be obtained when the wavelength is not too short with respect to the spacing of the microphones 106 (due to spatial aliasing), and not too long with respect to the spacing of the microphones 106 (due to diminishing inter-microphone difference signals and array noise).


For an electronic device 100 in landscape mode as shown in FIG. 1 the suitable frequency range to obtain the matrix B(b) is from around 0 kHz to the around 1 kHz or above. For smaller microphone array sizes and correspondingly smaller spacing of microphones 106, the suitable frequency range could start from a frequency higher than zero, but could also range to higher frequencies. When the output is stereo (not binaural), then the suitable frequency range for processing based on beamforming audio is typically smaller because the target patterns 404 that approximate amplitude panning are significantly more spatially selective at the lowest frequencies when compared to HRTF-based target patterns 404. At lowest frequencies the audio wavelength is long with respect to the size of the microphone array and achieving such spatially selective beams is no longer feasible. Therefore, for stereo rendering the lowest frequencies can be rendered using the processing based on parametric audio.


The matrix B(b) can be solved by first determining a suitable set of directions DOA at which the operation of the matrix is optimized. What happens in between these directions is not separately optimized, and as such the number of directions should not be too sparse. A practical choice for the optimization is to obtain the r(b,DOA) and t(b, DOA) in the horizontal plane with suitable spacing (for example, 5 degrees intervals denoted DOA1, DOA2, . . . DOA72), and then stack the data into matrices







R
(
b
)

=

[




r
(

b
,

D

O


A
1



)




r
(

b
,

D

O


A
2



)







r
(

b
,

D

O


A
72



)




]








T
(
b
)

=

[




t
(

b
,

D

O


A
1



)




t
(

b
,

D

O


A
2



)







t
(

b
,

D

O


A
72



)




]





The solution then is







B
(
b
)

=


T
(
b
)



inverse
(

R
(
b
)

)






where the operator inverse( ) could be the Moore-Penrose pseudoinverse, however, at least some regularization of the inverse is preferred. Omitting the bin index in the notation, one example is to first use a singular value decomposition R=USVH, then performing inverse processing for the diagonal matrix S. However, S is not necessarily a square matrix, depending on the number of microphones 106. The regularized inverse processing can be performed by transposing S, and performing, for all the diagonal elements, the operation









s
ˆ


i
,
i


=

1

max

(



α
s



s

m

ax



,

s

j
,
j



)



,




where sj,j is the j:th row j:th column element of S; smax is the maximum sj,j for all j; and αs is a regularization factor, which could be 0.01. The values ŝi,i then replace all values si of the transposed S, and the result is then denoted Ŝ−1. The regularized inverse is then inverse(R)=VŜ−1UH.


The matrices B(b) are the beam matrices 406 that are provided as the output of the beam matrices design block 402. The beam matrices 406 are provided as an input to a matrix multiplication block 408. The matrix multiplication block 408 can be performed as part of the online processing.


The matrix multiplication block 408 also receives the time-frequency audio signals 302 as an input. The time-frequency audio signals 302 can be denoted as a column vector s(b,n) comprising as many rows as there are channels, and where n is the current temporal frame index (of the time-frequency transform).


The matrix multiplication block 408 performs the processing by








s
b

(

b
,
n

)

=


B

(
b
)



s

(

b
,
n

)






where the subscript b denotes that the output is the beam-processed output.


The output of the matrix multiplication block 408 provides the resulting signal sb(b,n) as an output signal. The output signal is the second frequency portion 310 as shown in FIG. 3.


The beam matrices 406 can be dependent upon the orientation of the electronic device 100. For example, if the electronic device 100 is in the landscape orientation as shown in FIG. 1 then a first set of beam matrices 406 can be determined and used for that orientation. If the microphone array also supports processing in a portrait orientation, then a second set of beam matrices 406 can be determined and used for that orientation. Similarly, different beam matrices 406 can be determined based on whether a main or selfie camera of the electronic device 100 is in use.


The respective orientations of the electronic device 100 can be associated with corresponding beam matrices 406 unless the microphone arrangement has enough symmetry that two or more different orientations acoustically match each other. In such situations these orientations of the electronic device 100 can be associated with the same beam matrices 406. In some of these cases the microphone channels can be mutually switched if the left and right directions are different for the different orientations.


In the method described above the regularization factor as used by the beam matrices design block 402 was set to 0.01. The regularization factor can be set to any value between 0 and 1, where the value 0 means no regularization, and 1 means maximum regularization. The value of the regularization factor depends on the use case. Smaller values mean more accurate approximation of a beam design but also higher amplification of potential incoherent noises such as wind noise or microphone noise. For example, an electronic device 100 for outdoor use could have a regularization factor αs=0.1 and an electronic device 100 for indoor-use could have a regularization factor αs=0.01. The same electronic device 100 can have more than one mode depending on where it is used. The best mode can be detected automatically (this could be detected based on wind noise detection or camera input or any other suitable input). In examples where smaller values of the regularization factor are used, for example if αs=0.01, the processing can comprise means to avoid excessive noise amplification. For example, the input and output energies of the beam matrix processing can be monitored, and a limit can be applied as to how much louder the output can be. For example, a limiter could be configured so that the mean output of the beam matrix design block 402 would not be more than twice the mean input energy. This limiter would operate per frequency bin, and would remove any overshoots due to the noise, for example amplification of microphone noise.



FIG. 5 shows an example parametric-based spatial processing block 304 in more detail. The parametric processing used in the following examples could, in some examples, be replaced with other parametric processes.


The parametric-based spatial processing block 304 comprises a parametric spatial analysis block 500 and a parametric spatial synthesis block 504.


The parametric spatial analysis block 500 receives the time-frequency audio signals 302 and uses the time-frequency audio signals 302 to determine spatial metadata 502. The spatial metadata 502 can comprise a direction parameter and a direct-to-total energy ratio parameter for frequency bands k. One frequency band k can contain multiple frequency bins b, and has the lowest frequency bin blow(k) and the highest bin bhigh(k). The frequency resolution of these bands could be one that follows the Bark frequency scale, or any other suitable frequency resolution. The typical practical design principle is that for higher bands k there is a higher number of bins b included within the bands.


Any suitable process can be used to determine the spatial metadata 502. In some examples the spatial metadata 502 can be determined as follows. In this example the spatial metadata 502 is determined using only a subset of the available microphone audio signals 112. For example, only the microphone audio signals 112 from the microphones 106 at the edge of the electronic device might be used.


s(b,n,1) and s(b,n,2) can denote the first and second row entries of s(b,n). Cross correlation can be formulated by







c

(

b
,
n

)

=



s

(

b
,
n
,
1

)


s



(

b
,
n
,
2

)






Then, a real-valued normalized correlation measure in the range from bin b0 to b1 for delay d can be defined by








c

n

o

r

m


(


b
0

,

b
1

,
d
,
n

)

=


Real



(







b
=

b
0





b
1





e



i


2

π


f
(
b
)


d





c
(

b
,
n

)


)









b
=

b
0



b
1






"\[LeftBracketingBar]"


c
(

b
,
n

)



"\[RightBracketingBar]"








The delay search range can be selected based on the spacing of the microphones 106. The distance between a first microphone 106A and a second microphone 106B can be denoted dmic. This distance could be 0.16 meters. Then, the search could be performed by determining 41 uniformly spaced azimuth search directions θd ranging from −90° to 90°, and then converting them to delay values by








d

(

θ
d

)

=



d

m

i

c



v
s




sin

-
1




θ
d



,




where vs is the speed of sound.


For band k, for each d=d(θd), the following value is evaluated








c


n

orm

,
k


(

k
,
d
,
n

)

=


c

n

o

r

m


(



b

l

o

w


(
k
)

,


b

h

i

g

h


(
k
)

,
d
,
n

)





Then, for each band k, the value of d=d(θd) for which the maximum value cnorm,k(k, d, n) is obtained is evaluated. This value of d is the obtained delay estimate for that band. Accordingly, for that band, the determined direction parameter θ(k,n) is then the corresponding θd and the direct-to-total ratio parameter r(k,n) is the value of cnorm,k(k, d, n), for that same d. The procedure is repeated for each band k.


In this example the direction parameter is only azimuth, and elevation analysis is not performed. The elevation angle can be set to zero. In other examples the elevation angle can also be estimated. Any suitable method can be used to estimate the elevation angle.


The determined direction parameter θ(k,n) and the ratio r(k,n) provide the spatial metadata 502. The spatial metadata 502 is provided as the output from the parametric spatial analysis block 500.


In the described example the processing of the parametric spatial analysis block 500 provided the azimuth direction parameter θ(k,n) based on the delay estimations between the two microphones 106 at the edge of the electronic device 100. The angles were assumed to be at the front horizontal plane. This means that any sounds arriving from the rear directions are mirrored to the front. In some examples the parametric capture can resolve if the sound is arriving from the front or rear. In case it is determined that the sound arrives from the rear then the azimuth direction parameter θ(k,n) is mirrored to back (for example, sound at 30 degrees is mirrored to 150 degrees).


Any suitable process can be used to determine if an azimuth direction parameter θ(k,n) should be at the front or the rear. In the example electronic device 100 of FIG. 1 the process could comprise using delay analysis between the second microphone 106B and the third microphone 106C to determine if the sound at the time-frequency interval (k,n) is at front or back. However, it is not critically important to perform such front-back determination for binaural rendering because the binaural cues are similar for the front and back directions within the same cone of confusion. Also, for stereo playback the rear directions are mirrored to the front.


In the example of FIG. 5 the parametric-based spatial processing block 304 determines the spatial metadata. In some examples the spatial metadata 502 could have been previously formulated so that the parametric-based spatial processing block 304 receives the spatial metadata 502 as an input. The previously formulated spatial metadata 502 could be formulated using any suitable procedure. In such examples the parametric-based spatial processing block 304 does not need to comprise a parametric spatial analysis block 500.


The parametric spatial synthesis block 504 receives the time-frequency audio signals 302, and the target patterns 400, and the spatial metadata 502 as inputs. The parametric spatial synthesis block 504 is configured to render a spatial audio output based on the respective inputs. Any suitable process can be used to render the spatial audio output.


As an illustrative example using the electronic device 100 of FIG. 1 for binaural rendering, the parametric spatial synthesis block 504 can perform only level manipulation. The time-frequency audio signals 302 that are used can comprise only the signals from the left microphone 106A and the right microphone 106B. The energies or levels of these signals are modified based on the spatial metadata 502. In this example configuration the spacing of the microphones 106A, 106B is similar to that of the human ears (and therefore, the inter-channel time differences are correct, or close to correct), but the head-shadowing effects are not correct. Therefore, the parametric spatial synthesis block 504 will modify the spectral cues.


For each frequency band k, the input energies are formulated for channel i=1,2 by








E

i

n


(

k
,
n
,
i

)

=




b
=


b

l

o

w


(
k
)




b

h

i

g

h


(
k
)






"\[LeftBracketingBar]"


s

(

k
,
n
,
i

)



"\[RightBracketingBar]"


2






Then, the target energy is formulated using the spatial metadata 502 and the target patterns 400. The target patterns t(k,DOA) are available in frequency bands k. The target patterns 400 used by the parametric spatial synthesis block 504 can be the same as the target patterns 400 used by the beamforming-based spatial processing block 306, or could be derived form the same data set used for the target patterns used by the beamforming-based spatial processing block 306, or could be derived from a different data set.


From the target patterns t(k,DOA) the pattern with azimuth θ(k,n) and zero elevation is selected. If the target patterns 400 does not have exactly that data point available, the nearest DOA may be selected or interpolation can be performed. The selected target pattern vector is denoted t(k,θ(k,n)). The target energy for band k is then defined by








E

o

u

t


(

k
,
n
,
i

)

=


(




t

(

k
,

θ

(

k
,
n

)

,
i

)

2



r

(

k
,
n

)


+


0
.
5



(

1
-

r

(

k
,

n

)


)



)




(



E

i

n


(

k
,
n
,
1

)

+


E

i

n


(

k
,
n
,
2

)


)






where t(k,θ(k,n),i) is the i:th row value of t(k,θ(k,n)). In this example, it was assumed that in case of HRTF reproduction the target patterns t(k,θ(k,n)) are diffuse field equalized so that its mean value over surrounding spatial directions is








[




0.5






0.5




]

.





Then, the parametrically processed spatial audio output signals are obtained by








s
p

(

k
,
n
,
i

)

=


g
(

k
,
n
,
i

)



s
(

k
,
n
,
i

)







where







g


(



k
,
n
,
i




)




=






E
out

(

k
,
n
,
i

)



E
in

(

k
,
n
,
i

)








The division of the energies for obtaining g(k,n,i) can be regularized by a small epsilon at the divisor and limiting the maximum value of g(k,n,i), for example, to a value of 4.


The signal sp(k,n, i) is denoted in a vector form as sp(b,n). This signal provides the first frequency portion 308 as shown in FIG. 3 and can be provided as the output of the parametric-based spatial processing block 304. In examples of the disclosure only a subset of the bins of sp(b,n) need to be formulated, because a portion of the frequencies can be formulated by the beamforming-based spatial processing block 306.


In the above described examples binaural rendering could be used. Other types of spatial audio outputs could be used. For stereo rendering the above described examples could be modified as appropriate or a different method could be used.


The signal sp(b,n) provides the first frequency portion 308 that has been processed based on parametric audio and the signal sb(b,n) provides the second frequency portion 310 that has been processed based on beamforming audio. These respective signals can be combined by the combiner 312. The combiner 312 can use any suitable process. In some examples the combiner 312 can the bins from the different inputs so that the combined output is sb(b,n) for bins where processing based on beamforming audio is feasible and the combined output is sp(b,n) for bins where processing based on beamforming audio is not feasible. In some examples the combined output can be interpolated at some frequencies so that, for at least some frequencies, the combined output can comprise a mixture sb(b,n)m(b)+sp(b,n)(1−m(b)), where m(b) is a value that transitions from 0 to 1 or from 1 to 0 during an interpolation frequency interval.


In some examples of the disclosure the parametric-based spatial processing block 304 can perform the processing based on parametric audio for just the first frequency range and the beamforming-based spatial processing block 306 can perform the processing based on beamforming audio for just the second frequency range. In some examples one or both of the processing based on parametric audio or the processing based on beamforming audio can also be performed outside of the allocated frequency range.


For instance, in some examples the processing based on parametric audio can be performed for the entire frequency range. In such cases the processed audio signals 114 could be encoded with the spatial metadata 502 for the entire frequency range. This can enable the spatial metadata 502 to be used at a later stage and/or at a remote decoder to render the processed audio signals 114 to other formats.


In some cases, the examples of the disclosure could be implemented with head tracking. In such cases the beamforming-based spatial processing block 306 would determine the beam matrices 406 for different head orientations. The different head orientations could comprise orientations with 5-degree spacing for the yaw rotation. The beamforming-based spatial processing block 306 could use the methods describe above, or any other suitable methods, but the target pattern 400 would be rotated based on the orientation of the head.


In such examples the beam matrices 406 would be used by the matrix multiplication block 408 as part of the online processing. To accommodate the head tracking the correct beam matrices 406 would be determined for the current head orientation. The correct beam matrices 406 can be selected by selecting the closest beam matrices 406 from the determined matrices, or via interpolation.


The parametric-based spatial processing block 304 can take head orientation into account by rotating the determined direction parameters θ(k,n) before applying them in the rendering. In some examples the parametric-based spatial processing block 304 could also perform some additional processing of the audio signals based on the head orientation before the rendering.


Therefore, examples of the disclosure can be used with head tracking. The head-tracked binaural rendering can be performed by processing the stored or retrieved microphone audio signals 112 in real-time based on the tracked head orientation of the listener.


In examples which use headtracking the frequency ranges for the respective types of spatial processing can change over time based on the head orientation of the listener and the positions of the microphones 106 in the electronic device 100. For example, the cross over frequency between the first frequency for which processing based on parametric audio is used and the second frequency for which processing based on beamforming audio is used can be changed. The change in the frequency ranges can take into account that the suitable frequency range for the processing based on beamforming audio can be different for different head orientations.


Although the examples described above have been described for binaural and stereo outputs other types of output could be used in other examples. The examples of the disclosure could be used to perform the processing for any multi-channel output, such as a 5.1 multi-channel loudspeaker setup. The processing steps could be as described above with the processing based on beamforming audio being used for a frequency range where suitable beam patterns can be achieved with the microphones 106 in use and the processing based on parametric audio is used at other frequencies.


In examples where the output is a multi-channel output, such as a 5.1 multi-channel output there are more output channels than would be used for a binaural output or a stereo output. The beam patterns used for the multi-channel output are narrower than the beam patterns used for binaural outputs or stereo outputs. This means that more microphones 106 might be needed to obtain the device responses 404 for such loudspeaker setups. Therefore, to implement examples of the disclosure for a multi-channel output the electronic device 100 might need more than three microphones 106. For example, the electronic device 100 might need five to ten microphones 106.


Examples of the disclosure can also be implemented with other types of signals. For instance, examples of the disclosure can be applied with cross-talk cancelled stereo reproduction. The cross-talk cancelled stereo can be used in electronic devices such as mobile phones that have stereo loudspeakers. The cross-talked cancelled stereo reproduction is applied in order to produce a perception of stereo image wider than the physical locations of the loudspeaker. With such a reproduction system, the processing would be as above, but the target patterns 400 for normal stereo reproduction would be replaced by the corresponding target patterns 400 for the cross-talk cancelled stereo reproduction in processing based on beamforming audio.


Examples of the disclosure can also be implemented with a soundbar reproduction system. In such examples the device responses 404 from the soundbar would be utilized for processing based on beamforming audio. The processing based on parametric audio could be performed using any suitable process.



FIGS. 6A and 6B show example beam pattern designs for approximating HRTF responses. The electronic device 100 used to obtain the beam patterns can be a mobile phone or similar shaped device. The electronic device 100 can be arranged in a landscape mode.



FIG. 6A shows example beam pattern designs for an electronic device 100 with two microphones 106. The two microphones 106 can be located at left and right edges of the electronic device 100. FIG. 6B shows example beam pattern designs for an electronic device 100 with six microphones 106. The arrangement of six microphones can comprise three microphones 106 located at edge left of the electronic device 100, three microphones 106 located at a right edge of the electronic device 100 the additional microphones 106 can be located on the front and back surfaces near to the edges.


In FIGS. 6A and 6B the solid lines represent beam based HRFT capture patterns obtained with the respective microphone arrangements and the dashed lines represent actual HRTF capture patterns in the horizontal plane. The beam matrices can be designed as described above.



FIGS. 6A and 6B show the respective patterns for left and right channels at multiple different frequencies. The frequencies shown are 211 Hz, 563 Hz, 1852 Hz, and 7008 Hz.



FIGS. 6A and 6B show that there is not a single frequency where the processing based on beamforming audio no longer well approximates the target patterns. Instead, the patterns gradually turn into the aliased patterns. This can affect the frequencies that are chosen for the respective types of processing. In some examples one suitable choice for the first frequency range (for processing based on parametric audio) and the second frequency range (for processing based on beamforming audio) would be to have the first range as frequencies above 1 kHz and the second range as frequencies below 1 kHz.


The frequencies that are to be used for the respective ranges can be selected by inspecting the obtained beam patterns against the target patterns as shown in FIGS. 6A and 6B. The frequencies where the processing based on beamforming audio sufficiently approximates the target patterns can be determined from this inspection. These frequencies can then be used as the second frequency range (for processing based on beamforming audio) and the other frequencies can be used as the first frequency range (for processing based on parametric audio). This method of selecting the frequency ranges can be used for any arrangement of microphones, orientation of electronic devices 100, and target pattern.


In the examples of FIGS. 6A and 6B the obtained beam capture patterns can approximate the HRTF patterns at low frequencies. At 563 Hz the microphone arrangement with six microphones 106 performs slightly better than the microphone arrangement with two microphones 106. At higher frequencies both arrays suffer significantly from the spatial aliasing. Therefore, at high frequencies the processing based on beamforming audio is not feasible for these form factors and microphone arrangements.


However, FIGS. 6A and 6B show that the processing based on beamforming audio is feasible at low frequencies. This means that the processing based on parametric audio can be replaced at low frequencies with the processing based on beamforming audio. The provides the benefit that, for the selected frequency range, no matter how complex the sound scene is, the processing based on beamforming audio well represents the spatial sound scene and does not have the spatial capture artefacts (such as source instability) that can arise in processing based on parametric audio.



FIG. 7 schematically illustrates an apparatus 700 that could be used to implement examples of the disclosure. The apparatus 700 comprises at least one processor 102 and at least one memory 104. The processors 102 and the memory 104 can be as shown in FIG. 1. The apparatus 700 could comprise additional components that are not shown in FIG. 7. The apparatus 700 could be provided within an electronic device 100 such as a smart phone, a personal computer, an image capturing device, a teleconference device, a capture device integrated in a vehicle or any other suitable type of device comprising at least two microphones.


In the example of FIG. 7 the apparatus 700 comprises a processing apparatus. The apparatus 700 can be configured to process audio data. The apparatus 700 can be configured to process audio data for use in communication systems or for any other suitable purpose.


In the example of FIG. 7 the implementation of the apparatus 700 can be as processing circuitry. In some examples the apparatus 700 can be implemented in hardware alone, have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware).


As illustrated in FIG. 7 the apparatus 700 can be implemented using instructions that enable hardware functionality, for example, by using executable instructions of a computer program 116 in a general-purpose or special-purpose processor 102 that can be stored on a computer readable storage medium (disk, memory etc.) to be executed by such a processor 102.


The processor 102 is configured to read from and write to the memory 104. The processor 102 can also comprise an output interface via which data and/or commands are output by the processor 102 and an input interface via which data and/or commands are input to the processor 102.


The memory 104 is configured to store a computer program 116 comprising computer program instructions (computer program code 704) that controls the operation of the apparatus 700 when loaded into the processor 102. The computer program instructions, of the computer program 116, provide the logic and routines that enables the apparatus 700 to perform the methods described herein. The processor 102 by reading the memory 104 is able to load and execute the computer program 116.


The apparatus 700 therefore comprises: at least one processor 102; and at least one memory 104 storing instructions that, when executed by the at least one processor 102, cause an apparatus 700 at least to perform:

    • obtaining 200 at least two audio signals based on signals from at least two microphones;
    • performing 202 a first spatial audio processing of the obtained audio signals for at least a first frequency range to generate a first output;
    • performing 204 a second spatial audio processing of the obtained audio signals for at least a second frequency range to generate a second output, wherein the signal processing operations of the first spatial audio processing comprises processing based on parametric audio and the signal processing operations of the second spatial audio processing comprises processing based on beamforming audio; and
    • combining 206 the first output and the second output to generate a combined output.


As illustrated in FIG. 7 the computer program 116 can arrive at the apparatus 700 via any suitable delivery mechanism 702. The delivery mechanism 702 can be, for example, a machine-readable medium, a computer-readable medium, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a Compact Disc Read-Only Memory (CD-ROM) or a Digital Versatile Disc (DVD) or a solid-state memory, an article of manufacture that comprises or tangibly embodies the computer program 116. The delivery mechanism can be a signal configured to reliably transfer the computer program 116. The apparatus 700 can propagate or transmit the computer program 116 as a computer data signal. In some examples the computer program 116 can be transmitted to the apparatus 700 using a wireless protocol such as Bluetooth, Bluetooth Low Energy, Bluetooth Smart, 6LoWPan (IPv6 over low power personal area networks) ZigBee, ANT+, near field communication (NFC), Radio frequency identification, wireless local area network (wireless LAN) or any other suitable protocol.


The computer program 116 can comprise computer program instructions for causing an apparatus 700 to perform at least the following or for performing at least the following:

    • obtaining 200 at least two audio signals based on signals from at least two microphones;
    • performing 202 a first spatial audio processing of the obtained audio signals for at least a first frequency range to generate a first output;
    • performing 204 a second spatial audio processing of the obtained audio signals for at least a second frequency range to generate a second output, wherein the signal processing operations of the first spatial audio processing comprises processing based on parametric audio and the signal processing operations of the second spatial audio processing comprises processing based on beamforming audio; and
    • combining 206 the first output and the second output to generate a combined output.


The computer program instructions can be comprised in a computer program 116, a non-transitory computer readable medium, a computer program product, a machine readable medium. In some but not necessarily all examples, the computer program instructions can be distributed over more than one computer program 116.


Although the memory 104 is illustrated as a single component/circuitry it can be implemented as one or more separate components/circuitry some or all of which can be integrated/removable and/or can provide permanent/semi-permanent/dynamic/cached storage.


Although the processor 102 is illustrated as a single component/circuitry it can be implemented as one or more separate components/circuitry some or all of which can be integrated/removable. The processor 102 can be a single core or multi-core processor.


References to “computer-readable storage medium”, “computer program product”, “tangibly embodied computer program” etc. or a “controller”, “computer”, “processor” etc. should be understood to encompass not only computers having different architectures such as single/multi-processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.


As used in this application, the term “circuitry” can refer to one or more or all of the following:

    • (a) hardware-only circuitry implementations (such as implementations in only analog and/or digital circuitry) and
    • (b) combinations of hardware circuits and software, such as (as applicable):
      • (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and
      • (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and
    • (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g. firmware) for operation, but the software might not be present when it is not needed for operation.


This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit for a mobile device or a similar integrated circuit in a server, a cellular network device, or other computing or network device.


The blocks illustrated in the Figs. can represent steps in a method and/or sections of code in the computer program 116. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block can be varied. Furthermore, it can be possible for some blocks to be omitted.


In the example of FIG. 7 the apparatus 700 is shown as a single entity. In other examples the apparatus 700 could be provided a plurality of different entities that could be distributed within a cloud or other suitable network.


The term ‘comprise’ is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising Y indicates that X may comprise only one Y or may comprise more than one Y. If it is intended to use ‘comprise’ with an exclusive meaning then it will be made clear in the context by referring to “comprising only one . . . ” or by using “consisting”.


In this description, the wording ‘connect’, ‘couple’ and ‘communication’ and their derivatives mean operationally connected/coupled/in communication. It should be appreciated that any number or combination of intervening components can exist (including no intervening components), i.e., so as to provide direct or indirect connection/coupling/communication. Any such intervening components can include hardware and/or software components.


As used herein, the term “determine/determining” (and grammatical variants thereof) can include, not least: calculating, computing, processing, deriving, measuring, investigating, identifying, looking up (for example, looking up in a table, a database or another data structure), ascertaining and the like. Also, “determining” can include receiving (for example, receiving information), accessing (for example, accessing data in a memory), obtaining and the like. Also, “determine/determining” can include resolving, selecting, choosing, establishing, and the like.


In this description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term ‘example’ or ‘for example’ or ‘can’ or ‘may’ in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus ‘example’, ‘for example’, ‘can’ or ‘may’ refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.


Although examples have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the claims.


Features described in the preceding description may be used in combinations other than the combinations explicitly described above.


Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.


Although features have been described with reference to certain examples, those features may also be present in other examples whether described or not.


The term ‘a’, ‘an’ or ‘the’ is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a/an/the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use ‘a’, ‘an’ or ‘the’ with an exclusive meaning then it will be made clear in the context. In some circumstances the use of ‘at least one’ or ‘one or more’ may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning.


The presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and also to features that achieve substantially the same technical effect (equivalent features). The equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way. The equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.


In this description, reference has been made to various examples using adjectives or adjectival phrases to describe characteristics of the examples. Such a description of a characteristic in relation to an example indicates that the characteristic is present in some examples exactly as described and is present in other examples substantially as described.


The above description describes some examples of the present disclosure however those of ordinary skill in the art will be aware of possible alternative structures and method features which offer equivalent functionality to the specific examples of such structures and features described herein above and which for the sake of brevity and clarity have been omitted from the above description. Nonetheless, the above description should be read as implicitly including reference to such alternative structures and method features which provide equivalent functionality unless such alternative structures or method features are explicitly excluded in the above description of the examples of the present disclosure.


Whilst endeavoring in the foregoing specification to draw attention to those features believed to be of importance it should be understood that the Applicant may seek protection via the claims in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings whether or not emphasis has been placed thereon.

Claims
  • 1. An apparatus, comprising: at least one processor; andat least one memory storing instructions that, when executed with the at least one processor, cause the apparatus at least to: obtain at least two audio signals based on signals from at least two microphones;perform a first spatial audio processing of the obtained audio signals for at least a first frequency range to generate a first output;perform a second spatial audio processing of the obtained audio signals for at least a second frequency range to generate a second output, wherein the signal processing operations of the first spatial audio processing comprise processing based on parametric audio and the signal processing operations of the second spatial audio processing comprise processing based on beamforming audio; andcombine the first output and the second output to generate a combined output.
  • 2. The apparatus as claimed in claim 1, wherein the instructions, when executed with the at least one processor, cause the apparatus to perform the processing based on parametric audio that varies more over time than the processing based on beamforming audio.
  • 3. The apparatus as claimed in claim 1, wherein the respective outputs comprise at least one of: binaural outputs;multi-channel outputs; orstereo outputs.
  • 4. The apparatus as claimed in claim 1, wherein the second spatial audio processing is not performed for at least some of the first frequency range.
  • 5. The apparatus as claimed in claim 1, wherein the first spatial audio processing is not performed for at least some of the second frequency range.
  • 6. The apparatus as claimed in claim 1, wherein the instructions, when executed with the at least one processor, cause the apparatus to at least one of: use the first output for the first frequency range and the second output for the second frequency range, orapply a weighting to the first output and the second output so that the first output has a higher weighting than the second output in a frequency range and the second output has a higher weighting than the first output in another frequency range.
  • 7. (canceled)
  • 8. A method, comprising: obtaining at least two audio signals based on signals from at least two microphones;performing a first spatial audio processing of the obtained audio signals for at least a first frequency range to generate a first output;performing a second spatial audio processing of the obtained audio signals for at least a second frequency range to generate a second output, wherein the signal processing operations of the first spatial audio processing comprise processing based on parametric audio and the signal processing operations of the second spatial audio processing comprise processing based on beamforming audio;combining the first output and the second output to generate a combined output; anddetermining an orientation of the microphone arrangement and applying a mode of at least one of the first spatial audio processing or the second spatial audio processing based on the determined orientation.
  • 9. The method as claimed in claim 8, wherein at least one of the first frequency range or the second frequency range are different for different modes of the first spatial audio processing or the second spatial audio processing.
  • 10. The method as claimed in claim 8, wherein a first mode comprises using the first spatial audio processing for the first frequency range and the second spatial audio processing for the second frequency range and a second mode comprises using the first spatial audio processing for both the first frequency range and the second frequency range.
  • 11. The method as claimed in claim 8, wherein a first mode uses a first set of coefficients for the second spatial audio processing and a second mode uses a second set of coefficients for the second spatial audio processing.
  • 12. The method as claimed in claim 8, further comprising adjusting at least one of the first spatial audio processing or the second spatial audio processing based on a head orientation of a listener.
  • 13. (canceled)
  • 14. A method, comprising: obtaining at least two audio signals based on signals from at least two microphones;performing a first spatial audio processing of the obtained audio signals for at least a first frequency range to generate a first output;performing a second spatial audio processing of the obtained audio signals for at least a second frequency range to generate a second output, wherein the signal processing operations of the first spatial audio processing comprise processing based on parametric audio and the signal processing operations of the second spatial audio processing comprise processing based on beamforming audio; andcombining the first output and the second output to generate a combined output.
  • 15. (canceled)
  • 16. The apparatus as claimed in claim 1, wherein the instructions, when executed with the at least one processor, cause the apparatus to determine an orientation of the microphone arrangement and apply a mode of at least one of the first spatial audio processing or the second spatial audio processing based on the determined orientation.
  • 17. The apparatus as claimed in claim 1, wherein at least one of the first frequency range or the second frequency range are different for different modes of the first spatial audio processing or the second spatial audio processing.
  • 18. The apparatus as claimed in claim 1, wherein a first mode causes the apparatus to use the first spatial audio processing for the first frequency range and the second spatial audio processing for the second frequency range and a second mode causes the apparatus to use the first spatial audio processing for both the first frequency range and the second frequency range.
  • 19. The apparatus as claimed in claim 1, wherein a first mode causes the apparatus to use a first set of coefficients for the second spatial audio processing and a second mode causes the apparatus to use a second set of coefficients for the second spatial audio processing.
  • 20. The apparatus as claimed in claim 1, wherein the instructions, when executed with the at least one processor, cause the apparatus to adjust at least one of the first spatial audio processing or the second spatial audio processing based on a head orientation of a listener.
  • 21. The method as claimed in claim 12, wherein the processing based on parametric audio varies more over time than the processing based on beamforming audio.
  • 22. The method as claimed in claim 12, further comprising at least one of: the second spatial audio processing is not performed for at least some of the first frequency range; orthe first spatial audio processing is not performed for at least some of the second frequency range.
  • 23. The method as claimed in claim 12, wherein the combining comprises at least one of: using the first output for the first frequency range and the second output for the second frequency range; orapplying a weighting to the first output and the second output so that the first output has a higher weighting than the second output in a frequency range and the second output has a higher weighting than the first output in another frequency range.
Priority Claims (1)
Number Date Country Kind
2319500.1 Dec 2023 GB national