This disclosure relates to an apparatus and method for weighting audio signals so as to achieve a desired audio effect when those audio signals are heard by a user.
Stereo sound playback is commonly used in entertainment systems. It reproduces sound using two or more independent audio channels to create an impression of sound heard from various directions, as with natural hearing. Stereo sound is preferably played through a pair of stereo speakers that are located symmetrically with respect to the user. However, asymmetrical or unbalanced stereo speakers are inevitably encountered in reality. Examples include the stereophonic configuration in cars relative to the driver position and the unbalanced speaker setup on small-scale mobile devices. Asymmetric loudspeaker setups do not create good spatial effects. This is because the stereo image collapses if the listener is out of the sweet spot. In response, many sound images are localized at the position of the closest loudspeaker. This results in narrow soundfield distribution and poor spatial effects.
One common example of an asymmetric speaker arrangement occurs in mobile devices such as smartphones. It is getting more and more popular to equip mobile devices with stereo speakers. However, it is difficult to embed a pair of symmetrical speakers due to hardware constraints (e.g., size, battery), especially for smart phones. One solution is to use the embedded ear-piece receiver as a speaker unit. However, the frequency responses of the receiver and speaker are inevitably different (e.g. due to different baffle sizes), which leads to poor stereo effects and an unbalanced stereo sound image. Equalization of the receiver/speaker responses can address the unbalanced stereo sound image, but it does not achieve sound stage widening.
One option for creating a widened sound stage is to implement virtual source rendering with cross talk cancellation. Previous research explores the possibility of virtual source rendering using an ‘irregular’ loudspeaker arrangement (see e.g. “360 localisation via 4.x RACE processing” by Glasgel, 123rd AES Convention and “Experiments on the synthesis of virtual acoustic sources in automotive interiors” by Kahana et al, 16th International Conference, Spatial Sound Reproduction). This research is limited to the rendering of a single virtual source. Optimisation for a balanced stereo stage is not considered. Additionally, both methods only consider cases with geometrical asymmetry; they fail to mitigate discrepancies that are due to other asymmetries, such as differences in the natural frequency responses of the two speakers. These methods are thus incapable of optimising the asymmetrical speaker setup on smart phones. They also suffer from poor playback quality (including significant pre-echoes in filter design) and the robustness of soundfield widening effect is limited, especially in difficult car environments.
It is an object of the disclosure to provide concepts for improving the playback of audio signals through unbalanced speaker set ups.
The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
According to a first aspect, a signal generator is provided. The signal generator has a filter bank that is configured to receive at least two audio signals, to apply weights to the audio signals and to provide the weighted versions of the audio signals to at least two speakers. The filter bank may weight the signals such that, when the weighted signals are output by the speakers, it simulates an effect of the speakers being a different distance apart than they actually are. The filter bank in the signal generator is configured to apply weights that were derived by identifying a first constraint that limits a weight that can be applied to an audio signal to be provided to a first speaker. A characteristic of a second speaker that affects how a user will perceive audio signals output by that speaker relative to audio signals output by the first speaker was also determined. A second constraint was determined based on the determined characteristic and the first constraint. The weights were then determined so as to minimize a difference between an actual balance of each signal that is expected to be heard by a user when the weighted signals are output by the speakers and a target balance. The weights to be applied to audio signals that will be provided to the first speaker were further determined in dependence on the first constraint. The weights to be applied to audio signals to be provided to the second speaker were further determined in in dependence on the second constraint. The signal generator can achieve sweet spot correction and sound stage widening simultaneously. It also achieves a balanced sound stage, by applying weights that were determined based on the constraints that affect real-life speakers. The balanced sound stage is further reinforced by taking into account how the constraints of individual speakers affect the user's perception of the audio signals that they output, particularly when those speakers have some form of asymmetric arrangement. That asymmetry may be due to the physical arrangement of the speakers (e.g., one speaker may be more distant from the user than the other, such as in a car) or due to the speakers having different impulse responses (which is often the case in mobile devices).
In a first implementation form of the first aspect, the weights applied by the filter bank may have been derived by determining an attenuation factor for stereo balancing in dependence on the characteristic of the second speaker and determining the first constraint in dependence on that attenuation factor. The attenuation factor captures the effect that an asymmetric speaker arrangement has on how the constraits of those respective speakers are perceived by a user. Deriving the filter weights in dependence on the attenuation factor thus improves the balance of the resulting sound stage.
In a second implementation form of the first aspect, the weights applied by the filter bank in any of the above mentioned implementation forms may have been derived by, when the first and second speakers are different distances away from a user, determining the characteristic to be a relative distance of the second speaker from the user compared with the first speaker from the user. This addresses one of the common asymmetries in stereo speaker arrangements: an asymmetry in the physical arrangement of the speakers relative to the user that means audio signals from one speaker have to travel further than audio signals from another speaker to reach the user.
In a third implementation form of the first aspect, the weights of the second implementation form that are applied by the filter bank may have been derived by determining the relative distance to be:
where d1 is the distance between the second speaker and the user and d2 is the distance between the first speaker and the user, wherein k is a frequency index. This captures the effect that having the speakers different distances away from the user can have on how a constraint will be perceived by the user listening to the audio signals, enabling that effect to be compensated.
In a fourth implementation form of the first aspect, the weights of any of the above mentioned implementation forms applied by the filter bank may have been derived by, when the first and second speakers have different frequency responses, determine the characteristic to be a relative frequency response of the second speaker compared with the first speaker. This addresses another common asymmetry in stereo speaker arrangements: an asymmetry in the frequency responses of the speakers that means that a particular frequency band of the audio signal might be amplified differently by each speaker.
In a fifth implementation form of the first aspect, the weights of the fourth implementation form applied by the filter bank may have been derived by determining the relative frequency response to be:
where t1(k) is the impulse response of the second speaker and t2(k) is the impulse response of the first speaker, wherein k is a frequency index. This captures the effect that having speakers with different frequency responses can have on how a constraint will be perceived by the user listening to the audio signals, enabling that effect to be compensated.
In a sixth implementation form of the first aspect, the weights of any of the above mentioned implementation forms applied by the filter bank may have been derived by determining the first constraint to be a maximum gain associated with two or more speakers. This limits the weights so that playback of the resulting audio signals by the speakers is practically realisable.
In a seventh implementation form of the first aspect, for the case of the signal generator being used for providing the audio signals to at least two speakers in a car, the first constraint of the sixth implementation form may be a maximum gain associated with the more distant speaker to the user. This accounts for the fact that audio signals from the more distant speaker have to travel further to reach the user, and thus will typically have to be amplified more at playback if they are to be perceived by the user as having the same volume as audio signals from the other speaker.
In an eighth implementation form of the first aspect, the weights of any of the above mentioned implementation forms applied by the filter bank may have been derived by determining the weights such that a sum of the squares of the weights to be applied to the audio signals to be provided to one of the speakers does not exceed the constraint for that speaker. This helps to ensure that the derived weights do not exceed what is practically realisable in a real-world speaker arrangement.
In a ninth implementation form of the first aspect, the weights of any of the above mentioned implementation forms applied by the filter bank may have been derived by determining the target balance in dependence on a physical arrangement of the two or more speakers relative to a user. This enable the filter weights to compensate for asymmetry in the physical arrangements of the speakers.
In a tenth implementation form of the first aspect, the weights of any of the above mentioned implementation forms applied by the filter bank may have been derived by determining the target balance so as to simulate speakers that are symmetrically arranged with respect to the user. The user may be represented by a user head model, and the target balance may aim to reproduce a virtual speaker arrangement that is symmetric around that head model. This enables the weights to create the effect of a balanced sound stage at the user.
In an eleventh implementation form of the first aspect, the weights of any of the above mentioned implementation forms applied by the filter bank may have been derived by determining the target balance so as to simulate speakers that are further apart than the two or more speakers. This has the effect of widening the sound stage.
According to a second aspect, a method is provided that comprises receiving at least two audio signals, applying weights to the audio signals and providing the weighted versions of the audio signals to at least two speakers. The weights applied to the audio signals were derived by identifying a first constraint that limits a weight that can be applied to an audio signal to be provided to a first speaker. A characteristic of a second speaker that affects how a user will perceive audio signals output by that speaker relative to audio signals output by the first speaker was also determined. A second constraint was determined based on the determined characteristic and the first constraint. The weights were then determined so as to minimize a difference between an actual balance of each signal that is expected to be heard by a user when the weighted signals are output by the speakers and a target balance. The weights to be applied to audio signals that will be provided to the first speaker were further determined in dependence on the first constraint. The weights to be applied to audio signals to be provided to the second speaker were further determined in in dependence on the second constraint.
According to a third aspect, a non-transitory machine readable storage medium having stored thereon processor executable instructions is provided for controlling a computer to implement a method that comprises receiving at least two audio signals, applying weights to the audio signals and providing the weighted versions of the audio signals to at least two speakers. The weights applied to the audio signals were derived by identifying a first constraint that limits a weight that can be applied to an audio signal to be provided to a first speaker. A characteristic of a second speaker that affects how a user will perceive audio signals output by that speaker relative to audio signals output by the first speaker was also determined. A second constraint was determined based on the determined characteristic and the first constraint. The weights were then determined so as to minimize a difference between an actual balance of each signal that is expected to be heard by a user when the weighted signals are output by the speakers and a target balance. The weights to be applied to audio signals that will be provided to the first speaker were further determined in dependence on the first constraint. The weights to be applied to audio signals to be provided to the second speaker were further determined in in dependence on the second constraint.
The present disclosure will now be described by way of example with reference to the accompanying drawings. In the drawings:
An example of a signal generator is shown in
The precalculated weights are preferably derived using a multi-constraint optimisation technique that is described in more detail below. This technique is adapted to derive weights that can achieve sound stage balancing for asymmetric speaker arrangements. A speaker arrangement might be asymmetric due to one speaker being more distant from one speaker than from another speaker (e.g. in a car). A speaker arrangement might be asymmetric due to one speaker having a different impulse response from another speaker (e.g. in a smartphone scenario). The sound generator (100) is configured to achieve a sound stage widening and sweet spot correction simultaneously.
In some embodiments, the signal generator may include a data store 105 for storing a plurality of different sets of filter weights. Each filter set might be applicable to a different scenario. The filter bank may be configured to use a set of filter weights in dependence on user input and/or internally or externally generated observations that suggest a particular scenario is applicable. For example, where the signal generator is providing audio signals to a stereo system in a car, the user might usually want to optimise the sound stage for the driver but the sound stage could also be optimised for one of the passengers. This might be an option that a user could select via a user interface associated with the car stereo system. In another example, the appropriate weights to achieve sound stage optimisation might depend on how a mobile device such as a smart phone is being used. For example, different weights might be appropriate if the device's sensors indicate that it is positioned horizontally on a flat surface from if sensor outputs indicate that the device is positioned vertically and possibly near the user's face.
In many implementations the signal generator is likely to form part of a larger device. That device could be, for example, a mobile phone, smart phone, tablet, laptop, stereo system or any generic user equipment, particularly user equipment with audio playback capability.
The structures shown in
One common example of an asymmetric speaker arrangement occurs in cars. This is a scenario in which sound stage widening can be particularly beneficial.
An example of a system structure for determining filter weights that can be used to address the type of unbalanced speaker arrangement illustrated in
The system structure has, as its inputs 301, the original left and right stereo sound signals. These are audio signals for being output by a loudspeaker. The system structure is described below with specific reference to an example that involves two audio signals: one for a left-hand speaker and one for a right-hand speaker, but the techniques described below can be readily extended to more than two audio channels.
Functional blocks 302 to 305 are largely configured to mimic what happens as the input audio signals 301 are output by a loudspeaker and travel through the air to be heard by a listener. Very low and high frequencies are expected to be bypassed, which is represented in the system structure of
The frequency-dependent transfer functions hml(k) for sound propagation from the loudspeakers to a listener's ears are determined by the positions of the loudspeakers and the positions of the listener's ears. This is illustrated in
h11(k), h12(k), h21(k), h22(k) can be determined using the spherical head model, based on the respective loudspeaker and listener positions.
In the system of
For each frequency bin k, it is possible to formulate an optimization with two (and possibly more than two) constraints. This formulation starts by denoting a loudspeaker weights matrix, of dimension 2×2:
The diagonal elements of W(k) represent the ipsilateral filter gains for the left stereo channel and for the right stereo channel. The off-diagonal elements represent the contralateral filter gains for the two channels. The gains are specific to frequency bins, so the matrix is in the frequency domain.
The short-time Fourier transform (STFT) coefficients for the stereo sound signals can be denoted sn(k) (n∈{1,2}) where n is the channel index. The STFT coefficients can be computed by dividing the audio signal into short segments of equal length and then computing an FFT separately on each short segment. The STFT coefficients thus have an amplitude and a time extension. The left channel has n=1, the right channel has n=2. The playback signal which drives the l-th speaker can therefore be written as:
where l∈{1,2}. This represents an audio signal that is bandpass filtered into separate frequency bins, with each frequency bin being separately weighted before playback.
Referring to the physical arrangement of the two speakers relative to the user that is illustrated in
where m∈{1; 2}.
The weights applied to the audio signals by the loudspeakers thus combine with the transfer functions determined using the spherical head model to form response coefficients bmn(k):
The response coefficients transform the left and right channel signals s1(k) and s2(k) into the signals ym(k) (m∈{1; 2}) that are perceived by the listener. The weights wln(k) can, in principle, be freely chosen. The transfer functions hml(k) are fixed by the geometry of the system.
The aim is to choose weights wln(k) for the actual setup such that the resulting response coefficients bmn(k) are identical or at least close to the response coefficients of a desired virtual setup:
The (2×2)-matrix {circumflex over (b)}(k)=[{circumflex over (b)}mn(k)] associated with the virtual setup represents a desired frequency response observed at listener's ears. The target matrix {circumflex over (b)}(k) is preferably selected such that the resulting filters show minimal pre-echoes, which leads to good quality playback and better sound widening perception.
The desired virtual setup is an imaginary setup in which the two loudspeakers are positioned more favourably than in the actual setup, in terms of both sound stage widening and good playback quality. An example of a desired virtual set-up is shown in
For car scenarios, in which two loudspeakers are usually asymmetrically positioned with respect to the driver, it is often desirable to physically widen at least one of the speakers. Referring to the physical arrangement of the two speakers relative to the user that is illustrated in
For smart phone scenarios, the two loudspeakers are usually symmetrically positioned with respect to the user. In this scenario the first and second columns of the {circumflex over (b)}(k) matrix may represent the frequency responses of a symmetrical pair of left and right virtual speakers, with those virtual sources having a wider spatial interval than the physical speakers. The asymmetry in the smart phone scenario is linked to the frequency responses of the speakers rather than their physical arrangement. The two physical speakers are likely to have different frequency responses.
Returning to the system structure of
One option would be for the system to determine the filter weights directly as soon as the plant matrix and the set of desirable response coefficients have been determined (e.g. by means of equation (6)). This is not optimal, however, as it does not account for one or more constraints that are inherent in the physical speaker arrangement, and that can affect how the user will perceive the audio signals output by the different speakers. In particular, there may be physical constraints that limit a weight that can applied to audio signals before they are supplied to a physical loudspeaker. One such constraint is associated with the upper gain limit for a particular loudspeaker. This constraint may be denoted N.
In the system structure of
∥w(1,:)(k)∥2≤N1 that is, Σn=12|w1,n(k)|2≤N1, and
∥w(2,:)(k)∥2≤N2, that is Σn=12|w2,n(k)|2≤N2 (7)
So the sum of the squares of the weights for each speaker should not exceed the constraint for that speaker.
The constraint derivation unit may determine that one of the constraints is set by a maximum gain associated with both speakers. This sets an upper limit on the filter gain for either speaker. For example, if the two loudspeakers have different gain limits, the upper limit for the speaker pair may be the lower of those gain limits. The upper limit might also be affected by the loudspeakers respective positions with respect to the user and/or their respective frequency responses. For example, if the two loudspeakers are asymmetrically positioned with respect to the user, the upper limit may be determined by the loudspeaker that is the further away of the two. This is particularly expected to apply to the case where the audio signals are provided to speakers in a car. For mobile devices, it will usually be the case that either speaker can provide the upper gain limit. This is described in more detail below with respect to the scenario illustrated in
The constraint derivation unit 307 may be configured to use a preset upper gain limit—6 dB might be a suitable example—and assign this to whichever speaker the upper limit is considered more appropriate to. For example, in
Often, the same constraint will not be applicable to all speakers. This can be because of inherent differences between the speakers themselves and/or because of differences in the way those speakers are physically arranged with respect to the user. The constraint derivation unit (307) is preferably configured to address this by determining a characteristic of one speaker that affects how the user will perceive audio signals output by that other speaker relative to audio signals output by another speaker (step S604). The aim is to create a balanced sound stage, in which the user perceives the stereo signals as being output equally by the virtual speakers.
In one example, the constraint derivation unit 307 is configured to quantify this characteristic of the other loudspeaker through determining an attenuation factor for stereo balancing. The attenuation factor is denoted τ(k), and the constraint for the other speaker can be determined as:
N
1=τ(k)N2 (8)
For a typical car scenario, the constraint derivation unit 307 may assume that the speakers are essentially the same—so they have the same frequency response and the same gain limit—meaning that the characteristic that determines how the user will perceive audio signals is dependent on the relative distances between each respective speaker and the user. In this scenario, τ(k) can be derived using distance-based amplitude panning (DBAP):
In
For a typical smartphone scenario, the constraint derivation unit 307 may assume that the speakers are the same distance from the user but have different frequency responses. In this scenario, τ(k) can be derived from the measured impulse responses of the left and right speaker/receiver:
where tl(k) and tr(k) are the frequency responses of the left-hand and right-hand speakers at frequency k, respectively.
The constraint derivation unit may be provided with the appropriate frequency responses 309. Frequency responses of virtual sources can be determined, for example, based on online CIPIC HRTF databases available from the University of California Davis.
Having determined the characteristic of the second speaker that will affect how the user perceives audio signals output by that speaker compared with audio signals output by the first speaker, the constraint determination unit is able to determine the constraint for the second speaker in dependence on the constraint for the first speaker and the determined characteristic, e.g. by applying equation 8 (step S605).
In the system structure of
subject to:
∥w(1,:)(k)∥2≤N1 that is, Σn=12|w1,n(k)|2≤N1, and
∥w(2,:)(k)∥2≤N2, that is Σn=12|w2,n(k)|2≤N2
where H(k)W(k) represents the actual balance of each audio signal that is expected to be heard by the user and {circumflex over (b)}(k) represents the target balance. N1 and N2 limit the weight gain in the complex dimension.
As described above, the target balance may aim to simulate a symmetric speaker arrangement, i.e. a physical speaker arrangement in which the speakers are symmetrically arranged with respect to the user (which is achieved by representing the user via a user head model around which the simulated speakers are symmetrically arranged) and/or a speaker arrangement in which both speakers show the same frequency response. The target balance may also aim to simulate a speakers that are further apart than the speakers are in reality.
The optimisation unit 308 is thus capable of generating weights that accurately render the desired virtual source while also satisfying the attenuation constraints of the left channel speaker compared with the right channel speaker. If the optimisation unit applies equation 8, it will find the globally optimal solution in the MMSE (minimum mean square error) sense that minimizes the reproduction error compared with the desired virtual source responses in the complex frequency domain, while also being effectively constrained by the specified filter gain attenuation.
The system structure shown in
The structures shown in
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present disclosure may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the disclosure.
This application is a continuation of International Application No. PCT/EP2016/077376, filed on Nov. 11, 2016, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2016/077376 | Nov 2016 | US |
Child | 16409368 | US |