The example and non-limiting embodiments of the present invention relate to processing of audio signals. In particular, various embodiments of the present invention relate to modification of a spatial image represented by a multi-channel audio signal, such as a two-channel stereo signal.
Many portable handheld devices such as mobile phones, portable media player devices, tablet computers, laptop computers, etc. have a pair of loudspeakers that enable playback of stereophonic sound. Typically, the two loudspeakers are positioned at opposite ends or sides of the device to maximize the distance therebetween and thereby facilitate reproduction of stereophonic audio. However, due to small sizes of such devices the two loudspeakers are typically still relatively close to each other, thereby resulting in a narrow spatial audio image in the reproduced stereophonic audio. Consequently, the perceived spatial audio image may be quite different from that perceivable by playing back the same stereophonic audio signal e.g. via loudspeakers of a home stereo system, where the two loudspeakers can be arranged in suitable positions with respect to each other (e.g. sufficiently far from each other) to ensure reproduction of spatial audio image in its full width.
So-called stereo widening is a technique known in the art for enhancing the perceivable spatial audio image of a stereophonic audio signal when reproduced via loudspeakers of a portable handheld device. Such a technique aims at processing a stereophonic audio signal such that reproduced sound is not only perceived as originating from directions that are localized between the loudspeakers but at least part of the sound field is perceived as if it originated from directions that are not localized between the loudspeakers, thereby widening the perceivable width of spatial audio image from that conveyed in the stereophonic audio signal. Herein, we refer to such spatial audio image as a widened or enlarged spatial audio image. An example of processing that provides stereo widening is described in O. Kirkeby, P. A. Nelson, H. Hamada and F. Orduna-Bustamante, “Fast deconvolution of multichannel systems using regularization,” IEEE Transactions on Speech and Audio Processing, vol. 6.
While outlined above via references to a two-channel stereophonic audio signal, stereo widening may be applied to multi-channel audio signals that have more than two channels, such as 5.1-channel or 7.1-channel surround sound for playback via a pair of loudspeakers (of a portable handheld device). In some contexts, the term virtual surround is applied to refer to a processed audio signal that conveys a spatial audio image originally conveyed in a multi-channel surround audio signal. Hence, even though the term stereo widening is predominantly applied throughout this disclosure, this term should be construed broadly, encompassing a technique for processing the spatial audio image conveyed in a multi-channel audio signal (i.e. a two-channel stereophonic audio signal or a surround sound of more than two channels) to provide audio playback at widened spatial audio image.
For brevity and clarity of description, in this disclosure we use the term multi-channel audio signal to refer to audio signals that have two or more channels. Moreover, the term stereo signal is used to refer to a stereophonic audio signal and the term surround signal is used to refer to a multi-channel audio signal having more than two channels.
When applied to a stereo signal, stereo widening techniques known in the art typically involve adding a processed (e.g. filtered) version of a contralateral channel signal to each of the left and right channel signals of the stereo signal in order to derive an output stereo signal having a widened spatial audio image (referred to in the following as a widened stereo signal). In other words, a processed version of the right channel signal of the stereo signal is added to the left channel signal of the stereo signal to create the left channel of a widened stereo signal and a processed version of the left channel signal of the stereo signal is added to the right channel signal of the stereo signal to create the right channel of the widened stereo signal. Moreover, the procedure of deriving the widened stereo signal may further involve pre-filtering (or otherwise processing) each of the left and right channel signals of the stereo signal prior to adding the respective processed contralateral signals thereto in order to preserve desired frequency response in the widened stereo signal.
Along the lines described above, stereo widening readily generalizes into widening the spatial audio image of a multi-channel input audio signal, thereby deriving an output multi-channel audio signal having a widened spatial audio image (referred to in the following as a widened multi-channel signal). In this regard, the processing involves creating the left channel of the widened multi-channel audio signal as a sum of (first) filtered versions of channels of the multi-channel input audio signal and creating the right channel of the widened multi-channel audio signal as a sum of (second) filtered versions of channels of the multi-channel input audio signal. Herein, a dedicated predefined filter may be provided for each pair of an input channel (channels of the multi-channel input signal) and an output channel (left and right). As an example in this regard, the left and right channel signals of the widened multi-channel signal Sout,left and Sout,right, respectively, may be defined on basis of channels of a multi-channel audio signal S according to the equation (1):
S
out,left(b,n)=ΣiS(i,b,n)Hleft(i,b),
S
out,right(b,n)=ΣiS(i,b,n)Hright(i,b) (1)
where S(i,b,n) denotes frequency bin b in time frame n of channel i of the multi-channel signal S, Hleft(i,b) denotes a filter for filtering frequency bin b of channel i of the multi-channel signal S to create a respective channel component for creation of the left channel signal Sout,left(b,n), and Hright(i,b) denotes a filter for filtering frequency bin b of channel i of the multi-channel signal S to create a respective channel component for creation of the right channel signal Sout,right(b,n).
In practice, summing the processed contralateral signals to the (processed) left and right channel signals of the multi-channel signal results in reduction of the available dynamic range for driving the loudspeakers applied for playback. On the other hand, in many portable handheld devices that are small in size the loudspeakers are likewise small and hence typically prone to distortion already at relatively low signal levels, and introduction of the signal component arising from the (processed) contralateral signals in the played back signal may result in a situation where the distortion occurs already at lower perceivable signal levels that without the stereo widening. Therefore, in order to ensure undistorted sound, the audio playback level of a widened stereo signal typically needs to be lower than that of the unprocessed stereo signal. Consequently, the widened stereo signal is typically perceived as softer and/or more distorted than its unwidened counterpart.
An additional challenge involved in stereo widening is degraded engagement and timbre in the central part of the spatial audio image (the concept of “engagement” is discussed, for example, in D. Griesinger, “Phase Coherence as a Measure of Acoustic Quality, part two: Perceiving Engagement”, available at the time of filing of the present patent application e.g. at http://www.akutek.info/Papers/DG_Perceiving_Engagement.pdf). In many real-life stereo signals the central part of the spatial audio image includes perceptually important audio content, e.g. in case of music the voice of the vocalist is typically rendered in the center of the spatial audio image. A sound component that is in the center of the spatial audio image is rendered by reproducing the same signal in both channels of the stereo signal and hence via both loudspeakers of a device. When stereo widening as applied to such an input stereo signal (e.g. according to the equation (1) above), each channel of the resulting widened stereo signal involves outcome of two filtering operations carried out for the channels of the input stereo signal. This may result in a comb filtering effect, which may cause differences in the perceived timbre, which may be referred to as ‘coloration’ of the sound. Moreover, the comb filtering effect may further result in degradation of the engagement of the sound source.
According to an example embodiment, a method for processing an input audio signal comprising a multi-channel audio signal is provided, the method comprising: deriving, based on the input audio signal, a first signal component comprising a multi-channel audio signal that represents a focus portion of a spatial audio image conveyed by the input audio signal and a second signal component comprising a multi-channel audio signal that represents a non-focus portion of the spatial audio image; processing the second signal component into a modified second signal component wherein the width of the spatial audio image is extended from that of the second signal component; and combining the first signal component and the modified second signal component into an output audio signal comprising a multi-channel audio signal that represents partially extended spatial audio image.
According to another example embodiment, an apparatus for processing an input audio signal comprising a multi-channel audio signal is provided, the apparatus comprising: a signal decomposer for deriving, based on the input audio signal, a first signal component comprising a multi-channel audio signal that represents a focus portion of a spatial audio image conveyed by the input audio signal and a second signal component comprising a multi-channel audio signal that represents a non-focus portion of the spatial audio image; a stereo widening processor for processing the second signal component into a modified second signal component wherein the width of the spatial audio image is extended from that of the second signal component; and a signal combiner for combining the first signal component and the modified second signal component into an output audio signal comprising a multi-channel audio signal that represents partially extended spatial audio image.
According to another example embodiment, an apparatus for processing an input audio signal comprising a multi-channel audio signal is provided, the apparatus configured to: derive, based on the input audio signal, a first signal component comprising a multi-channel audio signal that represents a focus portion of a spatial audio image conveyed by the input audio signal and a second signal component comprising a multi-channel audio signal that represents a non-focus portion of the spatial audio image; process the second signal component into a modified second signal component wherein the width of the spatial audio image is extended from that of the second signal component; and combine the first signal component and the modified second signal component into an output audio signal comprising a multi-channel audio signal that represents partially extended spatial audio image.
According to another example embodiment, an apparatus for processing an input audio signal comprising a multi-channel audio signal is provided, the apparatus comprising: a means for deriving, based on the input audio signal, a first signal component comprising a multi-channel audio signal that represents a focus portion of a spatial audio image conveyed by the input audio signal and a second signal component comprising a multi-channel audio signal that represents a non-focus portion of the spatial audio image; a means for processing the second signal component into a modified second signal component wherein the width of the spatial audio image is extended from that of the second signal component; and a means for combining the first signal component and the modified second signal component into an output audio signal comprising a multi-channel audio signal that represents partially extended spatial audio image.
According to another example embodiment, an apparatus for processing an input audio signal comprising a multi-channel audio signal is provided, wherein the apparatus comprises at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: derive, based on the input audio signal, a first signal component comprising a multi-channel audio signal that represents a focus portion of a spatial audio image conveyed by the input audio signal and a second signal component comprising a multi-channel audio signal that represents a non-focus portion of the spatial audio image; process the second signal component into a modified second signal component wherein the width of the spatial audio image is extended from that of the second signal component; and combine the first signal component and the modified second signal component into an output audio signal comprising a multi-channel audio signal that represents partially extended spatial audio image.
According to another example embodiment, a computer program is provided, the computer program comprising computer readable program code configured to cause performing at least a method according to the example embodiment described in the foregoing when said program code is executed on a computing apparatus.
The computer program according to an example embodiment may be embodied on a volatile or a non-volatile computer-readable record medium, for example as a computer program product comprising at least one computer readable non-transitory medium having program code stored thereon, the program which when executed by an apparatus cause the apparatus at least to perform the operations described hereinbefore for the computer program according to an example embodiment of the invention.
The exemplifying embodiments of the invention presented in this patent application are not to be interpreted to pose limitations to the applicability of the appended claims. The verb “to comprise” and its derivatives are used in this patent application as an open limitation that does not exclude the existence of also unrecited features. The features described hereinafter are mutually freely combinable unless explicitly stated otherwise.
Some features of the invention are set forth in the appended claims. Aspects of the invention, however, both as to its construction and its method of operation, together with additional objects and advantages thereof, will be best understood from the following description of some example embodiments when read in connection with the accompanying drawings.
The embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, where
Nevertheless, the audio processing system 100 readily generalizes into a one that enables processing of a spatial audio signal (i.e. a multi-channel audio signal with more than two channels, such as a 5.1-channel spatial audio signal or a 7.1-channel spatial audio signal), some aspects of which are also described in the examples provided in the following.
The audio processing system 100 may further receive two control inputs: a first control input that indicates a target loudspeaker configuration applied in the stereo signal 101 and a second control input that indicates output loudspeaker configuration in a device intended for playback of the widened stereo signal 115.
The audio processing system 100 according to the example illustrated in
In the following, the audio processing technique described in the present disclosure is predominantly described via examples that pertain to the audio processing system 100 according to the example of
The audio processing system 100, 100′ may be implemented by one or more computing devices and the resulting widened stereo signal 115 may be provided for playback via loudspeakers of one of these devices. Typically, the audio processing system 100, 100′ is implemented in a portable handheld device such as a mobile phone, a media player device, a tablet computer, a laptop computer, etc. that is also applied to play back the widened stereo signal 115 via a pair of loudspeakers provided in the device. In another example, the audio processing system 100, 100′ is provided in a first device, whereas the playback of the widened stereo signal 115 is provided in a second device. In a further example, a first part of the audio processing system 100, 100′ is provided in a first device, whereas a second part of the audio processing system 100, 100′ and the playback of the widened stereo signal 115 is provided in a second device. In these two latter examples, the second device may comprise a portable handheld device such as a mobile phone, a media player device, a tablet computer, a laptop computer, etc. while the first device may comprise a computing device of any type, e.g. a portable handheld device, a desktop computer, a server device, etc.
Still referring to
The stereo signal 101 may be received at the signal processing system 100, 100′ e.g. by reading the stereo signal from a memory or from a mass storage device in the device 50. In another example, the stereo signal is obtained via communication interface (such as a network interface) from another device that stores the stereo signal in a memory or from a mass storage device provided therein. The widened stereo signal 115 may be provided for rendering by the audio playback system of the device 50. Additionally or alternatively, the widened stereo signal may be stored in the memory or the mass storage device in the device 50 and/or provided via a communication interface to another device for storage therein.
As described in the foregoing, the audio processing system 100, 100′ may receive the first control input that conveys information defining the target loudspeaker configuration applied in the stereo signal 101. The target loudspeaker configuration may also be referred to as channel configuration (of the stereo signal 101). This information may be obtained, for example, from metadata that accompanies the stereo signal 101, e.g. metadata included in an audio container within which the stereo signal 101 is stored. In another example, the information defining the target loudspeaker configuration applied in the stereo signal 101 may be received (as user input) via a user interface of the device 50. The target loudspeaker configuration may be defined by indicating, for each channel of the stereo signal 101, a respective target loudspeaker position with respect to an assumed listening point. As an example, a target position for a loudspeaker may comprise a target direction, which may be defined as an angle with respect to a reference direction (e.g. a front direction). Hence, for example in case of a two-channel stereo signal the target loudspeaker configuration may be defined as respective target angles ∝in (1) and ∝in (2) with respect to the front direction for the left and right loudspeakers. The target angles ∝in (i) with respect to the front direction may be, alternatively, indicated by a single target angle ∝in, which defines the absolute value of the target angles with respect to the front direction e.g. such that ∝in (1)=∝in and ∝in (2)=−∝in.
In a further example, no first control input is received in the audio processing system 100, 100′ and the elements of the audio processing system 100, 100′ that make use of the information that defines the target loudspeaker configuration applied in the stereo signal 101 (the signal decomposer 104, the re-panner 106) apply predefined information in this regard instead. An example in this regard involves applying a fixed predefined target loudspeaker configuration. Another example involves selecting one of a plurality of predefined target loudspeaker configurations in dependence of the number of audio channels in the received stereo signal 101. Non-limiting examples in this regard include selecting, in response to a two-channel signal 101 (which is hence assumed as a two-channel stereophonic audio signal), a target loudspeaker configuration where the channels are positioned ±30 degrees with respect to the front direction and/or selecting, in response to a six-channel signal (that is hence assumed to represent a 5.1-channel surround signal), a target loudspeaker configuration where the channels are positioned at target angles ∝in (i) of 0 degrees, ±30 degrees and ±110 degrees with respect to the front direction and complemented with a low frequency effects (LFE) channel.
As described in the foregoing, the audio processing system 100, 100′ may receive the second control input that conveys information defining the output loudspeaker configuration in the device 50. Therein, the output loudspeaker configuration may define a respective output loudspeaker position with respect to a listening position, which may indicate an assumed listening position or the actual position of the listener. The output loudspeaker configuration may define, for example, a respective output loudspeaker direction with respect to a reference direction (e.g. the front direction) for each of the output loudspeakers. In this regard, an output loudspeaker direction may be defined as a respective output loudspeaker angle ∝out (i) with respect to the reference direction for each of the output loudspeakers. The output loudspeaker angles ∝out (i) with respect to the reference direction may be, alternatively, indicated by a single output loudspeaker angle ∝out, which e.g. in case of two loudspeakers defines the absolute value of the output loudspeaker angles ∝out (i) with respect to the reference direction e.g. such that ∝out (1)=∝out and ∝out (2)=−∝out.
The output loudspeaker angles ∝out (i) may be directly indicated in the second control input or the second control input may define the an output loudspeaker positions as distances with respect to one or more predefined reference positions and/or reference directions, e.g. such that the a first output loudspeaker is positioned y1 meters forward along a (conceptual) line that defines the front direction with respect to the listener (or with respect to the assumed listening position) and x1 meters left from the front direction, and a second output loudspeaker is positioned y2 meters forward along a (conceptual) line that defines the front direction with respect to the listener (or with respect to the assumed listening position) and x2 meters left from the front direction. Consequently, the output loudspeaker angles ∝out (1) and ∝out (2) for the first and second output loudspeakers, respectively, may be computed as
∝out(1)=tan−1y1/x1.
∝out(2)=tan−1y2/x2. (2)
The second control input may convey information that defines static or dynamic output loudspeaker positions: in a scenario that applies static output loudspeaker positions, the output loudspeaker positions may be obtained and/or defined based on assumed average distance and position of a listener with respect to each of the loudspeakers of the device 50, whereas in a scenario that applies dynamic output loudspeaker positions, the output loudspeaker positions with respect to the listener may be defined and updated (e.g. at predefined time intervals) on basis of a sensor signal (e.g. a video signal from a camera).
The information that defines the output loudspeaker positions with respect to the listener's position may be applied to enable controlling the stereo widening processing such that the spatial audio image is widened beyond a range of directions spanned by the loudspeakers of the device 50 while at the same time ensuring that the focus portion of the spatial audio image (that commonly includes perceptually important audio content) is positioned in the spatial audio image in a direction that is between the loudspeakers of the device 50.
The audio processing system 100, 100′ may be arranged to process the stereo signal 101 arranged into a sequence of input frames, each input frame including a respective segment of digital audio signal for each of the channels, provided as a respective time series of input samples at a predefined sampling frequency. In typical example, the audio processing system 100, 100′ employs a fixed predefined frame length. In other examples, the frame length may be a selectable frame length that may be selected from a plurality of predefined frame lengths, or the frame length may be an adjustable frame length that may be selected from a predefined range of frame lengths. A frame length may be defined as number samples L included in the frame for each channel of the stereo signal 101, which at the predefined sampling frequency maps to a corresponding duration in time. As an example in this regard, the audio processing system 100, 100′ may employ a fixed frame length of 20 milliseconds (ms), which at a sampling frequency of 8, 16, 32 or 48 kHz results in a frame of L=160, L=320, L=640 and L=960 samples per channel, respectively. The frames may be non-overlapping or they may be partially overlapping. These values, however, serve as non-limiting examples and frame lengths and/or sampling frequencies different from these examples may be employed instead, depending e.g. on the desired audio bandwidth, on desired framing delay and/or on available processing capacity.
Referring back to
The transform entity 102 may further divide each of the channels into a plurality of frequency sub-bands, thereby resulting in the transform-domain stereo signal 103 that provides a respective time-frequency representation for each channel of the stereo signal 101. A given frequency band in a given frame may be referred to as a time-frequency tile. The number of frequency sub-bands and respective bandwidths of the frequency sub-bands may be selected e.g. in accordance with the desired frequency resolution and/or available computing power. In an example, the sub-band structure involves 24 frequency sub-bands according to the Bark scale, an equivalent rectangular band (ERB) scale or 3rd octave band scale known in the art. In other examples, different number of frequency sub-bands that have the same or different bandwidths may be employed. A specific example in this regard is a single frequency sub-band that covers the input spectrum in its entirety or a continuous subset thereof.
A time-frequency tile that represents frequency bin b in time frame n of channel i of the transform-domain stereo signal 103 may be denoted as S(i,b,n). The transform-domain stereo signal 103, e.g. the time-frequency tiles S(i,b,n), are passed to the signal decomposer 104 for decomposition into the first signal component 105-1 and the second signal component 105-2 therein. As described in the foregoing, a plurality of consecutive frequency bins may be grouped into a frequency sub-band, thereby providing a plurality of frequency sub-bands k=0, . . . , K−1. For each frequency sub-band k, the lowest bin (i.e. a frequency bin that represents the lowest frequency in that frequency sub-band) may be denoted as bk,low and the highest bin (i.e. a frequency bin that represents the highest frequency in that frequency sub-band) may be denoted as bk,high.
Referring back to
The signal decomposer 104 may derive, on basis of the transform-domain stereo signal 103, the first signal component 105 that represents those coherent sounds of the spatial audio image that are within a predefined focus range, such sounds hence constituting the focus portion of the spatial audio image. In contrast, the signal decomposer 104 may derive, on basis of the transform-domain stereo signal 103, the second signal component 105 that represents coherent sound sources or sound components of the spatial audio image that are outside the predefined focus range and all non-coherent sound sources of the spatial audio image, such sound sources or components hence constituting the non-focus portion of the spatial audio image. Hence, the signal decomposer 104 decomposes the sound field represented by the stereo signal 101 into the first signal component 105-1 that is excluded from subsequent stereo widening processing and into the second signal component 105-2 that is subsequently subjected to the stereo widening processing.
The signal decomposer 104 may comprise a coherence analyzer 116 for estimating, on basis of the transform-domain stereo signal 103, coherence values 117 that are descriptive of coherence between the channels of the transform-domain stereo signal 103. The coherence values 117 are provided for a decomposition coefficient determiner 124 for further processing therein.
Computation of the coherence values 117 may involve deriving a respective coherence value γ(k,n) for a plurality of frequency sub-bands k in a plurality of time frames n based on the time-frequency tiles S(i,b,n) that represent the transform domain stereo signal 103. As an example, the coherence values 117 may be computed e.g. according to the equation (3):
where Re denotes the real part operator and * denotes the complex conjugate.
Still referring to
Computation of the energy values 119 may involve deriving a respective energy value E(i,k,n) for a plurality of frequency sub-bands k in plurality of audio channels i in a plurality of time frames n based on the time-frequency tiles S(i,b,n). As an example, the energy values E(i,k,n) may be computed e.g. according to the equation (4):
E(i,k,n)=Σb
Still referring to
The direction estimation may involve deriving a respective direction angle θ(k,n) for a plurality of frequency sub-bands k in a plurality of time frames n based on the estimated energies E(i,k,n) and the target loudspeaker positions ∝in (i), the direction angles θ(k,n) thereby indicating the estimated perceived arrival direction of the sound in frequency sub-bands of input frames. The direction estimation may be carried out, for example, using the tangent law according to the equations (5) and (6), where an underlying assumption is that sound sources in the sound field represented by the stereo signal 101 are arranged (to a significant extent) in their desired spatial positions using amplitude panning:
where ∝in denotes the absolute value of the target angles ∝in (1) and ∝in (2) that define, respectively, the target positions of the left and right loudspeakers with respect to the front direction, which in this example are positioned symmetrically with respect to the front direction. In other examples, the target positions of the left and right loudspeakers may be positioned non-symmetrically with respect to the front direction (e.g. such that |∝_in (1)|≠|∝_in (2)|). Modification of the equation (5) such that it accounts for this aspect is a straightforward task for a person skilled in the art.
Still referring to
The focus coefficients 123 may be derived based at least in part on the direction angles 121. The focus estimator 122 may optionally further receive the indication of the target loudspeaker configuration applied in the stereo signal 101 and/or the indication of the output loudspeaker positions in the device 50, and compute the focus coefficients 123 further in view on one or both of these pieces of information. The focus coefficients 123 are provided for the decomposition coefficient determiner 124 for further processing therein.
Typically, the one or more angular ranges define a set of arrival directions that cover a predefined portion around the center of the spatial audio image, thereby rendering the focus estimation as a ‘frontness’ estimation. The focus estimation may involve deriving a respective focus coefficient χ(k,n) for a plurality of frequency sub-bands k in a plurality of time frames n based on the direction angles θ(k,n), e.g. according to the equation (7):
In the equation (7), the first threshold value θTh1 and the second threshold value θTh2, where θTh1<θTh2, serve to define a primary (center) angular range (between angles −θTh1 to θTh1 around the front direction), a secondary angular range (from −θTh2 to −θTh1 and from θTh1 to θTh2 with respect to the front direction) and a non-focus range (outside −θTh2 and θTh2 with respect to the front direction). As a non-limiting example, the first and second threshold values may be set to θTh1=5° and θTh2=15°, whereas in other examples different threshold values θTh1 and θTh2 may be applied instead. Focus estimation according to the equation (7) hence applies a focus range that includes two angular ranges (i.e. the primary angular range and the secondary angular range) and sets the focus coefficient χ(k,n) to unity in response to a sound source direction residing within the primary angular range and sets the focus coefficient χ(k,n) to zero in response to the sound source direction residing outside the focus range, whereas a predefined function of sound source direction is applied to set the focus coefficient χ(k,n) to a value between unity and zero in response to the sound source direction residing within the secondary angular range. In general, the focus coefficient χ(k,n) is set to a non-zero value in response to the sound source direction residing within the focus range and the focus coefficient χ(k,n) is set to zero value in response to the sound source direction residing outside the focus range. In an example, the equation (7) may be modified such that no secondary angular range is applied and hence only a single threshold may be applied to define the limit(s) between the focus range and the non-focus range.
Along the lines described in the foregoing, the focus range may be defined as one or more angular ranges. As an example, the focus range may include a single predefined angular range or two or more predefined angular ranges. According to another example, at least one of the focus ranges is selectable or adaptive, e.g. such that an angular range may be selected or adjusted (e.g. via selection or adjustment of one or more threshold values that define the respective angular range) in dependence of the target loudspeaker configuration applied in the stereo signal 101 and/or in dependence if the output loudspeaker positions in the device 50.
Still referring to
The decomposition coefficient determination aims at providing a high value for a decomposition coefficient β(k,n) for a frequency sub-band k and frame n that exhibits relatively high coherence between the channels of the stereo signal 101 and that conveys a directional sound component that is within the focus portion of the spatial audio image (see description of the focus estimator 122 in the foregoing). In this regard, the decomposition coefficient determination may involve deriving a respective decomposition coefficient β(k,n) for a plurality of frequency sub-bands k in a plurality of time frames n based on the respective coherence value γ(k,n) and the respective focus coefficient χ(k,n) e.g. according to the equation (8):
β(k,n)=γ(k,n)χ(k,n). (8)
In an example, the decomposition coefficients β(k,n) may be applied as such as the decomposition coefficients 125 that are provided for the signal divider 126 for decomposition of the transform-domain stereo signal 103 therein. In another example, energy-based temporal smoothing is applied to the decomposition coefficient β(k,n) obtained from the equation (8) in order to derive smoothed decomposition coefficients β′(k,n), which may be provided for the signal divider 126 to be applied for decomposition of the transform-domain stereo signal 103 therein. Smoothing of the decomposition coefficients results in slower variations over time in sub-portions of the spatial audio image assigned to the first signal component 105-1 and the second signal component 105-2, which may enable improved perceivable quality in the resulting widened stereo signal 115 via avoidance of small-scale fluctuances in the spatial audio image therein. A weighting that provides the energy-based temporal smoothing may be provided, for example, according to the equation (9a):
where E(k,n) denotes the total energy of the transform-domain stereo signal 103 for a frequency sub-band k in time frames n (derivable e.g. based on the energies E(i,k,n) derived using the equation (4)) and a and b (where, preferably, a+b=1) denote predefined weighting factors. As a non-limiting example, values a=0.2 and b=0.8 may be applied, whereas in other examples other values in the range from 0 to 1 may be applied instead.
Still referring to
where Sdr(i,b,n) denotes frequency bin b in time frame n of channel i of the first signal component 105-1, Ssw(i,b,n) denotes frequency bin b in time frame n of channel i of the second signal component 105-2, and p denotes predefined constant parameter (e.g. p=0.5). In general case, the scaling coefficient β(b,n)p in the equation (9) may be replaced with another scaling coefficient that increases with increasing value of the decomposition coefficient β(b,n) (and decreases with decreasing value of the decomposition coefficient β(b,n)) and the scaling coefficient (1−β(b,n))p in the equation (10a) may be replaced with another scaling coefficient that decreases with increasing value of the decomposition coefficient β(b,n) (and increases with decreasing value of the decomposition coefficient β(b,n)).
In another example, the signal decomposition may be carried out for a plurality of frequency sub-bands k in a plurality of channels i in a plurality of time frames n based on the time-frequency tiles S(i,b,n), according the equation (10b):
wherein βTh denotes a predefined threshold value that has value in the range from 0 to 1, e.g. βTh=0.5. If applying the equation (10b) the temporal smoothing of the decomposition coefficients 125 described in the foregoing and/or temporal smoothing of the resulting signal components Ssw(i,b,n) and Sdr(i,b,n) may be advantageous for improved perceivable quality of the resulting widened stereo signal 115.
The decomposition coefficients β(k,n) according to the equation (8) are derived on time-frequency tile basis, whereas the equations (10a) and (10b) apply the decomposition coefficients β(b,n) on frequency bin basis. In this regard, the decomposition coefficients β(k,n) derived for a frequency sub-band k may be applied for each frequency bin b within the frequency sub-band k.
Consequently, the transform-domain stereo signal 103 is divided, in each time-frequency tile, into the first signal component 105-1 that represents sound components positioned in the focus portion of the spatial audio image represented by the stereo signal 101 and into the second signal component 105-2 that represents sound components positioned outside the focus portion of the spatial audio image represented by the stereo signal 101. The first signal component 105-1 is subsequently provided for playback without applying stereo widening thereto, whereas the second signal component 105-2 is subsequently provided for playback after being subjected to stereo widening.
Referring back to
The re-panner 106 may comprise an energy estimator 128 for estimating energy of the first signal component 105-1. The energy values 129 are provided for a direction estimator 130 and for a re-panning gain determiner 136 for further processing therein. The energy value computation may involve deriving a respective energy value Edr(i,k,n) for a plurality of frequency sub-bands k in plurality of audio channels i in a plurality of time frames n based on the time-frequency tiles Sdr(i,b,n). As an example, the energy values Edr(i,k,n) may be computed e.g. according to the equation (11):
E
dr(i,k,n)=Σb
In another example, the energy values 119 computed in the energy estimator 118 (e.g. according to the equation (4)) may be re-used in the re-panner 106, thereby dispensing with a dedicated energy estimator 128 in the re-panner 106. Even though the energy estimator 118 of the signal decomposer 104 estimates the energy values 119 based on the transform-domain stereo signal 103 instead of the first signal component 105-1, the energy values 119 enable correct operation of the direction estimator 130 and the re-panning gain determiner 136.
Still referring to
The direction estimation may involve deriving a respective direction angle θdr(k,n) for a plurality of frequency sub-bands k in a plurality of time frames n based on the estimated energies Edr(i,k,n) and the target loudspeaker positions ∝in (i), the direction angles θdr(k,n) thereby indicating the estimated perceived arrival direction of the sound in frequency sub-bands of first signal component 105-1. The direction estimation may be carried out, for example, according to the equations (12) and (13):
In another example, the direction angles 121 computed in the energy estimator 128 (e.g. according to the equations (5) and (6)) may be re-used in the re-panner 106, thereby dispensing with a dedicated direction estimator 130 in the re-panner 106. Even though the direction estimator 120 of the signal decomposer 104 estimates the direction angles 121 based on the energy values 119 derived from the transform-domain stereo signal 103 instead of the first signal component 105-1, the sound source positions are the same or substantially the same and hence the direction angles 121 enable correct operation of the direction adjuster 132.
Still referring to
The direction adjustment may comprise mapping the direction angles 131 into respective modified direction angles 133 that represent adjusted perceivable arrival direction of the sound in view of the output loudspeaker positions of the device 50. The target loudspeaker configuration may be indicated by the target angles ∝in (i) and the output loudspeaker positions of the device 50 may be indicated by the respective output loudspeaker angles ∝out (i). According to a non-limiting example, assuming symmetrical target positions for the channels of the stereo signal 101 with respect to the front direction (i.e. target angles ∝in) and symmetrical output loudspeaker positions of the device 50 with respect to the front direction (i.e. output loudspeaker angles ∝out), the mapping between the direction angles 131 and the modified direction angles 132 may be provided by determining a mapping coefficient μ according to the equation (14):
μ=∝in/∝out, (14)
which may be applied for deriving a respective modified direction angle θ′(k,n) for a plurality of frequency sub-bands k in a plurality of time frames n e.g. according to the equation (15):
θ′(k,n)=μθ(k,n). (15)
The example above assumes that both the target angles ∝in (i) and the output loudspeaker angles ∝out (i) are positioned symmetrically with respect to the front direction. According to another non-limiting example, the mapping between direction angles 131 and the modified direction angles 132 may be provided according to the equations (16) and (17):
where ∝out,c denotes an angle that defines the center position (i.e. direction) between the left and right output loudspeakers, ∝out,hr denotes an angle that defines a half range position (i.e. direction) for the left and right output loudspeakers, and ∝in,hr denotes an angle that defines a half range position (i.e. direction) for the left and right target loudspeaker positions. The approach according to the equations (16) and (17) applies to a general case where the left and right target loudspeaker positions ∝in (i) are arranged symmetrically with respect to the front direction (or another reference direction) and the left and right output loudspeaker positions ∝out (i) are arranged either symmetrically or asymmetrically with respect to the front direction (or another reference direction).
The determination of the mapping coefficient μ and derivation of the modified direction angles θ′(k,n) according to the equations (14) and (15) serves as a non-limiting example and a different procedure for deriving the modified direction angles 133 may be applied instead.
Still referring to
Still referring to
The re-panning gain determination procedure may comprise computing a respective total energy Es(k,n) for a plurality of frequency sub-bands k in a plurality of time frames n e.g. according to the equation (18):
E
s(k,n)=ΣiEdr(i,k,n). (18)
The re-panning gain determination may further comprise computing a respective target energy Et(i,k,n) for a plurality of frequency sub-bands k in plurality of audio channels i in a plurality of time frames n based on the total energies Es(k,n) and the panning gains g′(i,k,n), e.g. according to the equation (19):
E
t(i,k,n)=g′(i,k,n)2Es(k,n). (19)
The target energies Et(i,k,n) may be applied with the energy values Edr(i,k,n) to derive a respective re-panning gain gr(i,k,n) for a plurality of frequency sub-bands k in plurality of audio channels i in a plurality of time frames n, e.g. according to the equation (20):
g
r(i,k,n)=√{square root over (Et(i,k,n)/Edr(i,k,n))}. (20)
In an example, the re-panning gains gr(i,k,n) obtained from the equation (20) may be applied as such as the re-panning gains 137 that are provided for the re-panning processor 138 for derivation of the modified first signal component 107 therein. In another example, energy-based temporal smoothing is applied to the re-panning gains gr(i,k,n) obtained from the equation (20) in order to derive smoothed re-panning gains g′r(i,k,n), which may be provided for the re-panning processor 138 to be applied for re-panning therein. Smoothing of the re-panning gains gr(i,k,n) results in slower variations over time within the sub-portion of the spatial audio image assigned to the first signal component 105-1, which may enable improved perceivable quality in the resulting widened stereo signal 115 via avoidance of small-scale fluctuances in the respective portion of the widened spatial audio image therein.
Still referring to
The procedure for deriving the modified first signal component 107 may comprise deriving a respective time-frequency tile Sdr,rp(i,b,n) for a plurality of frequency bins b in plurality of audio channels i in a plurality of time frames n based on a corresponding time-frequency tiles Sdr(i,b,n) of the first signal component 105-1 in dependence of the re-panning gains gr(i,b,n), e.g. according to the equation (21):
S
dr,rp(i,b,n)=gr(i,b,n)Sdr(i,b,n). (21)
The re-panning gains gr(i,k,n) according to the equation (20) are derived on time-frequency tile basis, whereas the equation (21) applies the re-panning gains gr(i,k,n) on frequency bin basis. In this regard, the re-panning gain gr(i,k,n) derived for a frequency sub-band k may be applied for each frequency bin b within the frequency sub-band k.
Referring back to
Referring back to
Referring back to
In an example, the stereo widening processor 112 may be provided with a dedicated set of filters HLL, HRL, HLR and HRR that is designed to produce a desired extent of stereo widening for a predefined pair of the target loudspeaker configuration and output loudspeaker positions in the device 50. In another example, the stereo widening processor 112 may be provided with a plurality of sets of filters HLL, HRL, HLR and HRR, each set designed to produce a desired extent of stereo widening for a respective pair of the target loudspeaker configuration and output loudspeaker positions in the device 50. In the latter example, the set of filters is selected in dependence of the indicated target loudspeaker configuration and the output loudspeaker positions in the device 50. In a scenario with a plurality of sets of filters, the stereo widening processor 112 may dynamically switch been sets of filters e.g. in response to a change in the indicated output loudspeaker positions (e.g. a change in the user's position with respect to the output loudspeakers 50). There are various ways for designing a set of filters HLL, HRL, HLR and HRR. In this regard, further information is available for example in O. Kirkeby, P. A. Nelson, H. Hamada and F. Orduna-Bustamante, “Fast deconvolution of multichannel systems using regularization,” IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pp. 189-194, 1998 and in S. Bharitkar and C. Kyriakakis, “Immersive Audio Signal Processing”, ch. 4, Springer, 2006.
Referring back to
Referring back to
Referring back to
Referring back to
s
out(i,m)=s′sw(i,m)+s′dr(i,m), (22)
where sout(i,m) denotes the widened stereo signal 115.
Referring back to
Each of the exemplifying audio processing systems 100, 100′ described in the foregoing via a number of examples may further varied in a number of ways. In the following, non-limiting examples in this regard are described.
In the foregoing, description of elements of the audio processing systems 100, 100′ refer to processing of relevant audio signals in a plurality of frequency sub-bands k. In an example, the processing of the audio signal in each element of the audio processing systems 100, 100′ is carried out across (all) frequency sub-bands k. In other examples, in at least some elements of the audio processing systems 100, 100′ the processing of the audio signal is carried out in a limited number of frequency sub-bands k. As examples in this regard, the processing in a certain element of the audio processing system 100, 100′ may be carried out for a predefined number of lowest frequency sub-bands k, for a predefined number of highest frequency sub-bands k, or for a predefined subset of frequency sub-bands k in the middle of the frequency range such that a first predefined number of lowest frequency sub-bands k and a second predefined number of highest frequency sub-bands k is excluded from the processing. The frequency sub-bands k excluded from the processing (e.g. ones at the lower end of the frequency range and/or ones at the higher end of the frequency range) may be passed unmodified from an input to an output of the respective element. As a non-limiting example concerning elements of the audio processing systems 100, 100′ where the processing may be carried out only for a limited subset of frequency sub-bands k, involves one or both of the re-panner 116 and the stereo widening processor 112, 112′, which may only process the respective input signal in a respective desired sub-range of frequencies, e.g. in a predefined number of lowest frequency sub-bands k or in a predefined subset of frequency sub-bands k in the middle of the frequency range.
In another example, as already described in the foregoing, the input audio signal 101 may comprise a multi-channel signal different from a two-channel stereophonic audio signal, e.g. surround signal. For example in case the input audio signal 101 comprises a 5.1-channel surround signal, the audio processing technique(s) described in the foregoing with references to the left and right channels of the stereo signal 101 may be applied to the front left and front right channels of the 5.1-channel surround signal to derive the left and right channels of the output audio signal 115. The other channels of the 5.1-channel surround signal may be processed e.g. such that the center channel of the 5.1-channels surround signal scaled by a predefined gain factor (e.g. by one having value √{square root over (0.5)}) is added to the left and right channels of the output audio signal 115 obtained from the audio processing system 100, 100′, whereas the rear left and right channels of the 5.1-channel surround signal may be processed using a conventional stereo widening technique that makes use of target response(s) that correspond(s) to respective target positions of the left and right rear loudspeakers (e.g. ±110 degrees with respect to the front direction). The LFE channel of the 5.1-channel surround signal may be added to the center signal of the 5.1-channel surround signal prior to adding the scaled version thereof to the left and right channels of the output audio signal 115.
In another example, additionally or alternatively, the audio processing system 100, 100′ may enable adjusting balance between the contribution from the first signal component 105-1 and the second signal component 105-2 in the resulting widened stereo signal 115. This may be provided, for example, by applying respective different scaling gains to the first signal component 105-1 (or a derivative thereof) and to the second signal component 105-2 (or a derivative thereof). In this regard, respective scaling gains may be applied e.g. in the signal combiner 114, 114′ to scale the signal components derived from the first and second signal components 105-1, 105-2 accordingly, or in the signal divider 126 to scale the first and second signal components 105-1, 105-2 accordingly. A single respective scaling gain may be defined for scaling the first and second signal components 105-1, 105-2 (or a respective derivative thereof) across all frequency sub-bands or in predefined sub-set of frequency sub-bands. Alternatively or additionally, different scaling gains may be applied across the frequency sub-bands, thereby enabling adjustment of the balance between the contribution from the first and second signal components 105-1, 105-2 only on some of the frequency sub-bands and/or adjusting the balance differently at different frequency sub-bands.
In a further example, alternatively or additionally, the audio processing system 100, 100′ may enable scaling of one or both of the first signal component 105-1 and the second signal component 105-2 (or respective derivatives thereof) independently of each other, thereby enabling equalization (across frequency sub-bands) for one or both of the first and second signal components. This may be provided, for example, by applying respective equalization gains to the first signal component 105-1 (or a derivative thereof) and to the second signal component 105-2 (or a derivative thereof). A dedicated equalization gain may be defined for one or more frequency sub-bands for the first signal component 105-1 and/or for the second signal component 105-2. In this regard, for each of the first and second signal components 105-1, 105-2, a respective equalization gain may be applied e.g. in the signal divider 126 or in the signal combiner 114, 114′ to scale a respective frequency sub-band of the respective one of the first and second signal components 105-1, 105-2 (or a respective derivative thereof). For a certain frequency sub-band, the equalization gain may be the same for both the first and second signal components 105-1, 105-2 or different equalization gains be applied for the first and second signal component 105-1, 105-2.
In a further example, additionally or alternatively, the audio processing system 100, 100′ may receive a sensor signal that enables deriving information that is indicative of the distance between the output loudspeakers and the listener's ears, which distance may be applied to derive or adjust the information that is indicative of the output loudspeaker configuration (e.g. the second control input) accordingly. As an example, the sensor signal may originate from a camera serving as the sensor 64, whereas the loudspeaker configuration entity 62 may derive, accordingly, the second control input that indicates output loudspeaker configuration with respect to the listening position based on the sensor signal from the camera and possibly further based on information on the positions of the loudspeakers 60 in the device 50 with respect to the position of the camera. With this information the loudspeaker configuration entity 62 may derive whether the user is holding the device 50 close to his/her face (e.g. closer than 30 cm) at a normal or typical distance (e.g. from 30 to 40 cm) or further away (e.g. farther away than 40 cm). In response to detecting the device to be close to the user's face, the loudspeaker configuration entity 62 may adjust the output loudspeaker positions, e.g. the output loudspeaker angles ∝out (i), accordingly to indicate a larger-than-normal angle between the output loudspeakers due to the user being closer to the device 50, whereas in response to detecting the device to be further away from the user's face, the loudspeaker configuration entity 62 may adjust the output loudspeaker positions, e.g. the output loudspeaker angles ∝out (i),accordingly to indicate a smaller-than-normal angle between the output loudspeakers due to the user being further away from the device 50. The updated output loudspeaker configuration may affect e.g. the operation of the signal decomposer 104 and/or the re-panner 106.
Operation of the audio processing system 100, 100′ described in the foregoing via multiple examples enables adaptively decomposing the stereo signal 101 into the first signal component 105-1 that represents the focus portion of the spatial audio image and that is provided for playback without application of stereo widening thereto and into the second signal component 105-2 that represents peripheral (non-focus) portion of the spatial audio image that is subjected to the stereo widening processing. In particular, since the decomposition is carried out on basis of audio content conveyed by the stereo signal 101 on frame by frame basis, the audio processing system 100, 100′ enables both adaptation for relatively static spatial audio images of different characteristics and adaptation to changes in the spatial audio image over time.
The disclosed stereo widening technique that relies on excluding coherent sound sources within the focus portion of the spatial audio image from the stereo widening processing and applies the stereo widening processing predominantly to coherent sounds that are outside the focus portion and to non-coherent sounds (such as ambience) enables improved timbre and engagement and reduced ‘coloration’ of sounds that are within the focus portion while still providing a large extent of perceivable stereo widening. Moreover, the disclosed stereo widening technique that excludes the coherent sounds within the focus portion from the stereo widening processing allows for a higher dynamic range of the widened stereo signal 115 and hence enables driving the loudspeakers 50 at a higher perceivable signal levels without audible distortion in comparison to widened stereo signal produced by the stereo widening techniques known in the art.
Components of the audio processing system 100, 100′ may be arranged to operate, for example, in accordance with a method 200 illustrated by a flowchart depicted in
The method 200 comprises deriving, based on the input audio signal 101, a first signal component 105-1 comprising a multi-channel audio signal that represents a focus portion of the spatial audio image and a second signal component 105-2 comprising a multi-channel audio signal that represents a non-focus portion of the spatial audio image, as indicated in block 202. The method 200 further comprises processing the second signal component 105-2 into a modified second signal component 113 wherein the width of the spatial audio image is extended from that of the second signal component 105-2, as indicated in block 204. The method 200 further comprises combining the first signal component 105-2 and the modified second signal component 113 into an output audio signal 115 comprising a multi-channel audio signal that represents partially extended spatial audio image, as indicated in block 206. The method 200 may be varied in a number of ways, for example in view of the examples concerning operation of the audio processing system 100 and/or the audio processing system 100′ described in the foregoing.
The apparatus 300 comprises a processor 316 and a memory 315 for storing data and computer program code 317. The memory 315 and a portion of the computer program code 317 stored therein may be further arranged to, with the processor 316, to implement at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing system 100, 100′.
The apparatus 300 comprises a communication portion 312 for communication with other devices. The communication portion 312 comprises at least one communication apparatus that enables wired or wireless communication with other apparatuses. A communication apparatus of the communication portion 312 may also be referred to as a respective communication means.
The apparatus 300 may further comprise user I/O (input/output) components 318 that may be arranged, possibly together with the processor 316 and a portion of the computer program code 317, to provide a user interface for receiving input from a user of the apparatus 300 and/or providing output to the user of the apparatus 300 to control at least some aspects of operation of the audio processing system 100, 100′ implemented by the apparatus 300. The user I/O components 318 may comprise hardware components such as a display, a touchscreen, a touchpad, a mouse, a keyboard, and/or an arrangement of one or more keys or buttons, etc. The user I/O components 318 may be also referred to as peripherals. The processor 316 may be arranged to control operation of the apparatus 300 e.g. in accordance with a portion of the computer program code 317 and possibly further in accordance with the user input received via the user I/O components 318 and/or in accordance with information received via the communication portion 312.
Although the processor 316 is depicted as a single component, it may be implemented as one or more separate processing components. Similarly, although the memory 315 is depicted as a single component, it may be implemented as one or more separate components, some or all of which may be integrated/removable and/or may provide permanent/semi-permanent/dynamic/cached storage.
The computer program code 317 stored in the memory 315, may comprise computer-executable instructions that control one or more aspects of operation of the apparatus 300 when loaded into the processor 316. As an example, the computer-executable instructions may be provided as one or more sequences of one or more instructions. The processor 316 is able to load and execute the computer program code 317 by reading the one or more sequences of one or more instructions included therein from the memory 315. The one or more sequences of one or more instructions may be configured to, when executed by the processor 316, cause the apparatus 300 to carry out at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing system 100, 100′.
Hence, the apparatus 300 may comprise at least one processor 316 and at least one memory 315 including the computer program code 317 for one or more programs, the at least one memory 315 and the computer program code 317 configured to, with the at least one processor 316, cause the apparatus 300 to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing system 100, 100′.
The computer program(s) stored in the memory 315 may be provided e.g. as a respective computer program product comprising at least one computer-readable non-transitory medium having the computer program code 317 stored thereon, the computer program code, when executed by the apparatus 300, causes the apparatus 300 at least to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing system 100, 100′. The computer-readable non-transitory medium may comprise a memory device or a record medium such as a CD-ROM, a DVD, a Blu-ray disc or another article of manufacture that tangibly embodies the computer program. As another example, the computer program may be provided as a signal configured to reliably transfer the computer program.
Reference(s) to a processor should not be understood to encompass only programmable processors, but also dedicated circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processors, etc. Features described in the preceding description may be used in combinations other than the combinations explicitly described.
Although in the foregoing some functions have been described with reference to certain features and/or elements, those functions may be performable by other features and/or elements whether described or not. Although features have been described with reference to certain embodiments, those features may also be present in other embodiments whether described or not.
Number | Date | Country | Kind |
---|---|---|---|
1818690.8 | Nov 2018 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FI2019/050795 | 11/8/2019 | WO | 00 |