The present invention relates to an apparatus and method for multichannel direct-ambient decomposition for audio signal processing.
Audio signal processing becomes more and more important. In this field, separation of sound signals into direct and ambient sound signals plays an important role.
In general, acoustic sounds consist of a mixture of direct sounds and ambient (or diffuse) sounds. Direct sounds are emitted by sound sources, e.g. a musical instrument, a vocalist or a loudspeaker, and arrive on the shortest possible path at the receiver, e.g. the listener's ear entrance or microphone.
When listening to a direct sound, it is perceived as coming from the direction of the sound source. The relevant auditory cues for the localization and for other spatial sound properties are interaural level difference, interaural time difference and interaural coherence. Direct sound waves evoking identical interaural level difference and interaural time difference are perceived as coming from the same direction. In the absence of diffuse sound, the signals reaching the left and the right ear or any other multitude of sensors are coherent.
Ambient sounds, in contrast, are emitted by many spaced sound sources or sound reflecting boundaries contributing to the same ambient sound. When a sound wave reaches a wall in a room, a portion of it is reflected, and the superposition of all reflections in a room, the reverberation, is a prominent example for ambient sound. Other examples are audience sounds (e.g. applause), environmental sounds (e.g. rain), and other background sounds (e.g. babble noise). Ambient sounds are perceived as being diffuse, not locatable, and evoke an impression of envelopment (of being “immersed in sound”) by the listener. When capturing an ambient sound field using a multitude of spaced sensors, the recorded signals are at least partially incoherent.
Various applications of sound post-production and reproduction benefit from a decomposition of audio signals into direct signal components and ambient signal components. The main challenge for such signal processing is to achieve high separation while maintaining high sound quality for an arbitrary number of input channel signals and for all possible input signal characteristics. Direct-ambient decomposition (DAD), i.e. the decomposition of audio signals into direct signal components and ambient signal components, enables the separate reproduction or modification of the signal components, which is for example desired for the upmixing of audio signals.
The term upmixing refers to the process of creating a signal with P channels given an input signal with N channels where P>N. Its main application is the reproduction of audio signals using surround sound setups having more channels than available in the input signal. Reproducing the content by using advanced signal processing algorithms enables the listener to use all available channels of the multichannel sound reproduction setup. Such processing may decompose the input signal into meaningful signal components (e.g. based on their perceived position in the stereo image, direct sounds versus ambient sounds, single instruments) or into signals where these signal components are attenuated or boosted.
Two concepts of upmixing are widely known.
Advanced upmixing methods can be further categorized with respect to the positioning of direct and ambient signals. It is distinguished between the “direct/ambient-approach” and the “In-the-band”-approach. The core component of direct/ambience-based techniques is the extraction of an ambient signal which is fed e.g. into the rear channels or the height channels of a multi-channel surround sound setup. The reproduction of ambience using the rear or height channels evokes an impression of envelopment (being “immersed in sound”) by the listener. Additionally, the direct sound sources can be distributed among the front channels according to their perceived position in the stereo panorama. In contrast, the “In-the-band”-approach aims at positioning all sounds (direct sound as well as ambient sounds) around the listener using all available loudspeakers.
Decomposing an audio signal into direct and ambient signals also enables the separate modification of the ambient sounds or direct sounds, e.g. by scaling or filtering it. One use case is the processing of a recording of a musical performance which has been captured with a too high amount of ambient sound. Another use case is audio production (e.g. for movie sound or music), where audio signals captured at different locations and therefore having different ambient sound characteristics are combined.
In any case, the requirements for such signal processing is to achieve high separation while maintaining high sound quality for an arbitrary number of input channel signals and for all possible input signal characteristics.
Various approaches in the conventional technology for DAD or for attenuating or boosting either the direct signal components or the ambient signal components have been provided, and are briefly reviewed in the following.
Known concepts relates to processing of speech signals with the aim to remove undesired background noise from microphone recordings.
A method for attenuating the reverberation from speech recordings having two input channels is described in [1]. The reverberation signal components are reduced by attenuating the uncorrelated (or diffuse) signal components in the input signal. The processing is implemented in the time-frequency domain such that subband signals are processed by means of a spectral weighting method. The real-valued weighting factors are computed using the power spectral densities (PSD)
ϕxx(m,k)=E{X(m,k)X*(m,k)} (1)
ϕyy(m,k)=E{Y(m,k)Y*(m,k)} (2)
ϕxy(m,k)=E{X(m,k)Y*(m,k)} (3)
where X(m,k) and Y(m,k) denote time-frequency domain representations of the time-domain input signals xt[n] and yt[n], E{⋅} is the expectation operation and X* is the complex conjugate of X.
The original authors point out that different spectral weighting functions are feasible when proportional to ϕxy(m,k), e.g. when using weights equal to the normalized cross-correlation function (or coherence function)
Following a similar rationale, the method description in [2]extracts an ambient signal using spectral weighting with weights derived from the normalized cross-correlation function computed in frequency bands, sec Formula (4) (or with the words of the original authors, the “interchannel short time coherence function”). The difference compared to [1] is that instead of attenuating the diffuse signal components, the direct signal components are attenuated using the spectral weights which are a monotonic steady function of (1−ρ(m, k)).
The decomposition for the application of upmixing of input signals having two channels using multichannel Wiener filtering has been described in [3]. The processing is done in the time-frequency domain. The input signal is modelled as mixture of the ambient signal and one active direct source (per frequency band), where the direct signal in one channel is restricted to be a scaled copy of the direct signal component in the second channel, i.e. amplitude panning. The panning coefficient and the powers of direct signal and ambient signal are estimated using the normalized cross-correlation and the input signal powers in both channels. The direct output signal and the ambient output signals are derived from linear combinations of the input signals, with real-valued weighting coefficients. Additional postscaling is applied such that the power of the output signals equals the estimated quantities.
The method described in [4] extracts an ambience signal using spectral weighting, based on an estimate of the ambience power. The ambience power is estimate based on the assumptions that the direct signal components in both channels are fully correlated, that the ambient channel signals are uncorrelated with each other and with the direct signals, and that the ambience powers in both channels are equal.
A method for upmixing of stereo signals based on Directional Audio Coding (DirAC) is described in [5]. DirAC aims analyzing and reproducing of direction of arrival, diffuseness and the spectrum of a sound field. For upmixing of stereo input signals, anechoic B-format recordings of the input signals are simulated.
A method for extracting the uncorrelated reverberation from stereo audio signal using an adaptive filter algorithm which aims at predicting the direct signal component in one channel signal using the other channel signal by means of a Least Mean Square (LMS) algorithm is described in [6]. Subsequently the ambient signals are derived by subtracting the estimated direct signals from the input signals. The rationale of this approach is that the prediction only works for correlated signals and the prediction error resembles the uncorrelated signal. Various adaptive filter algorithms based on the LMS principle exist and are feasible, e.g. the LMS or the Normalized LMS (NLMS) algorithm.
For the decomposition of input signals with more than two channels, a method is described in [7] where the multichannel signals are firstly downmixed to obtain a 2-channel stereo signal and subsequently a method for processing stereo input signals presented in [3] is applied.
For the processing of mono signals, the method described in [8] extracts an ambience signal using spectral weighting where the spectral weights are computed using feature extraction and supervised learning.
Another method for extracting an ambience signal from mono recordings for the application of upmixing obtains the time-frequency domain representation from the difference of the time-frequency domain representation of the input signal and a compressed version of it, advantageously computed using non-negative matrix factorization [9].
A method for extracting and changing the reverberant signal components in an audio signal based on the estimation of the magnitude transfer function of the reverberant system which has generated the reverberant signal is described in [10]. An estimate of the magnitudes of the frequency domain representation of the signal components is derived by means of recursive filtering and can be modified.
According to an embodiment, an apparatus for generating one or more audio output channel signals depending on two or more audio input channel signals, wherein each of the two or more audio input channel signals includes direct signal portions and ambient signal portions, may have: a filter determination unit for determining a filter by estimating first power spectral density information and by estimating second power spectral density information, and a signal processor for generating the one or more audio output channel signals by applying the filter on the two or more audio input channel signals, wherein the first power spectral density information indicates power spectral density information on the two or more audio input channel signals, and the second power spectral density information indicates power spectral density information on the ambient signal portions of the two or more audio input channel signals, or wherein the first power spectral density information indicates the power spectral density information on the two or more audio input channel signals, and the second power spectral density information indicates power spectral density information on the direct signal portions of the two or more audio input channel signals, or wherein the first power spectral density information indicates the power spectral density information on the direct signal portions of the two or more audio input channel signals, and the second power spectral density information indicates the power spectral density information on the ambient signal portions of the two or more audio input channel signals.
According to another embodiment, a method for generating one or more audio output channel signals depending on two or more audio input channel signals, wherein each of the two or more audio input channel signals includes direct signal portions and ambient signal portions, may have the steps of: determining a filter by estimating first power spectral density information and by estimating second power spectral density information, and generating the one or more audio output channel signals by applying the filter on the two or more audio input channel signals, wherein the first power spectral density information indicates power spectral density information on the two or more audio input channel signals, and the second power spectral density information indicates power spectral density information on the ambient signal portions of the two or more audio input channel signals, or wherein the first power spectral density information indicates the power spectral density information on the two or more audio input channel signals, and the second power spectral density information indicates power spectral density information on the direct signal portions of the two or more audio input channel signals, or wherein the first power spectral density information indicates the power spectral density information on the direct signal portions of the two or more audio input channel signals, and the second power spectral density information indicates the power spectral density information on the ambient signal portions of the two or more audio input channel signals.
Another embodiment may have a computer program for implementing the inventive method when being executed on a computer or processor.
An apparatus for generating one or more audio output channel signals depending on two or more audio input channel signals is provided. Each of the two or more audio input channel signals comprises direct signal portions and ambient signal portions. The apparatus comprises a filter determination unit for determining a filter by estimating first power spectral density information and by estimating second power spectral density information. Moreover, the apparatus comprises a signal processor for generating the one or more audio output channel signals by applying the filter on the two or more audio input channel signals. The first power spectral density information indicates power spectral density information on the two or more audio input channel signals, and the second power spectral density information indicates power spectral density information on the ambient signal portions of the two or more audio input channel signals. Or, the first power spectral density information indicates the power spectral density information on the two or more audio input channel signals, and the second power spectral density information indicates power spectral density information on the direct signal portions of the two or more audio input channel signals. Or, the first power spectral density information indicates the power spectral density information on the direct signal portions of the two or more audio input channel signals, and the second power spectral density information indicates the power spectral density information on the ambient signal portions of the two or more audio input channel signals.
Embodiments provide concepts for decomposing audio input signals into direct signal components and ambient signal components, which can be applied for sound post-production and reproduction. The main challenge for such signal processing is to achieve high separation while maintaining high sound quality for an arbitrary number of input channel signals and for all possible input signal characteristics. The provided concepts are based on multichannel signal processing in the time-frequency domain which leads to a constrained optimal solution in the mean squared error sense, and, e.g. subject to constraints on the distortion of the estimated desired signals or on the reduction of the residual interference.
Embodiments for decomposing audio input signals into direct signals components and ambient signal components are provided. Furthermore, a derivation of filters for computing the ambient signal components will be provided, and moreover, embodiments for the applications of the filters are described.
Some embodiments relate to the unguided upmix following the direct/ambient-approach with input signals having more than one channel.
For the envisaged applications of the described decomposition, one is interested in computing output signals having the same number of channels as the input signal. For this application, embodiments provide very good results in terms of separation and sound quality, because it can cope with input signals where the direct signals are time delayed between the input channels. In contrast to other concepts, e.g. the concepts provided in [3], embodiments do not assume that the direct sounds in the input signals are panned by scaling only (amplitude panning), but also by introducing time differences between the direct signals in each channel.
Furthermore, embodiments are able to operate on input signal having an arbitrary number of channels, in contrast to all other concepts in the conventional technology (see above) which can only process input signals having one or two channels.
Other advantages of embodiments are the use of the control parameters, the estimation of the ambient PSD matrix and further modifications of the filter as described below.
Some embodiments provide consistent ambient sounds for all input sound objects. When the input signals are decomposed into direct and ambient sounds, some embodiments adapt the ambient sound characteristics by means of appropriate audio signal processing, and other embodiments replace the ambient signal components by means of artificial reverberation and other artificial ambient sounds.
According to an embodiment, the apparatus may further comprise an analysis filterbank being configured to transform the two or more audio input channel signals from a time domain to a time-frequency domain. The filter determination unit may be configured to determine the filter by estimating the first power spectral density information and the second power spectral density information depending on the audio input channel signals, being represented in the time-frequency domain. The signal processor may be configured to generate the one or more audio output channel signals, being represented in a time-frequency domain, by applying the filter on the two or more audio input channel signals, being represented in the time-frequency domain. Moreover, the apparatus may further comprise a synthesis filterbank being configured to transform the one or more audio output channel signals, being represented in a time-frequency domain, from the time-frequency domain to the time domain.
Moreover, a method for generating one or more audio output channel signals depending on two or more audio input channel signals is provided. Each of the two or more audio input channel signals comprises direct signal portions and ambient signal portions. The method comprises:
The first power spectral density information indicates power spectral density information on the two or more audio input channel signals, and the second power spectral density information indicates power spectral density information on the ambient signal portions of the two or more audio input channel signals. Or, the first power spectral density information indicates the power spectral density information on the two or more audio input channel signals, and the second power spectral density information indicates power spectral density information on the direct signal portions of the two or more audio input channel signals. Or, the first power spectral density information indicates the power spectral density information on the direct signal portions of the two or more audio input channel signals, and the second power spectral density information indicates the power spectral density information on the ambient signal portions of the two or more audio input channel signals.
Moreover, a computer program for implementing the above-described method when being executed on a computer or signal processor is provided.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
The apparatus comprises a filter determination unit 110 for determining a filter by estimating first power spectral density information and by estimating second power spectral density information.
Moreover, the apparatus comprises a signal processor 120 for generating the one or more audio output channel signals by applying the filter on the two or more audio input channel signals.
The first power spectral density information indicates power spectral density information on the two or more audio input channel signals, and the second power spectral density information indicates power spectral density information on the ambient signal portions of the two or more audio input channel signals.
Or, the first power spectral density information indicates the power spectral density information on the two or more audio input channel signals, and the second power spectral density information indicates power spectral density information on the direct signal portions of the two or more audio input channel signals.
Or, the first power spectral density information indicates the power spectral density information on the direct signal portions of the two or more audio input channel signals, and the second power spectral density information indicates the power spectral density information on the ambient signal portions of the two or more audio input channel signals.
Embodiments provide concepts for decomposing audio input signals into direct signal components and ambient signal components are described which can be applied for sound post-production and reproduction. The main challenge for such signal processing is to achieve high separation while maintaining high sound quality for an arbitrary number of input channel signals and for all possible input signal characteristics. The provided embodiments are based on multichannel signal processing in the time-frequency domain and provide an optimal solution in the mean squared error sense subject to constraints on the distortion of the estimated desired signals or on the reduction of the residual interference.
At first, inventive concepts are described, on which embodiments of the present invention are based.
It is assumed that N input channel signals yt[n] are received:
yt[n]=[y1[n] . . . yN[n]]T. (5)
For example, N≥2. The aim of the provided concepts is to decompose the input channel signals y1[n] . . . yN[n] (=[yt[n]]T) into N direct signal components denoted by dt[n]=[d1[n] . . . dN[n]]T and/or N ambient signal components denoted by at[n]=[a1[n] . . . aN[n]]T. The processing can be applied for all input channels, or the input signal channels are divided into subsets of channels which are processed separately.
According to embodiments, one or more of the direct signal components d1[n], . . . , dN[n] and/or one or more of the ambient signal components a1[n], . . . , aN[n] shall be estimated from the two or more input channel signals y1[n], . . . , yN[n] to obtain one or more estimations ({circumflex over (d)}1[n], . . . , {circumflex over (d)}N[n], â1, . . . , âN [n]) of the direct signal components d1[n], . . . , dN[n] and/or of the ambient signal components a1[n], . . . , aN[n] as the one or more output channel signals.
An example for the provided outputs of some embodiments is depicted in
According to embodiments, the processing may, for example, be performed in the time-frequency domain. A time-frequency domain representation of the input audio signal may, for example, be obtained by means of a filterbank (the analysis filterbank), e.g. the Short-time Fourier transform (STFT).
According to an embodiment illustrated by
In the embodiment of
A time-frequency domain representation comprises a certain number of subband signals which evolve over time. Adjacent subbands can optionally be linearly combined into broader subband signals in order to reduce computational complexity. Each subband of the input signals is separately processed, as described in detail in the following. Time domain output signals are obtained by applying the inverse processing of the filterbank, i.e. the synthesis filterbank, respectively. All signals are assumed to have zero mean, the time-frequency domain signals can be modeled as complex random variables.
In the following, definitions and assumptions are provided.
The following definitions are used throughout the description of the devised method: The time-frequency domain representation of a multichannel input signal with N channels is given by
y(m,k)=[Y1(m,k)Y2(m,k) . . . YN(m,k)]T, (6)
with time index m and subband index k, k=1 . . . K and is assumed to be an additive mixture of the direct signal component d(m, k) and the ambient signal component a(m, k), i.e.
y(m,k)=d(m,k)+a(m,k), (7)
with
d(m,k)=[D1(m,k)D2(m,k) . . . DN(m,k)]T (8)
a(m,k)=[A1(m,k)A2(m,k) . . . AN(m,k)]T, (9)
where Di(m,k) denotes the direct component and Ai(m,k) the ambient component in the i-th channel.
The objective of the direct-ambient decomposition is to estimate d(m,k) and a(m,k). The output signals are computed using the filter matrices HD(m,k) or HA(m,k) or both. The filter matrices are of size N×N and are complex-valued, or may, in some embodiments, e.g., be real-valued. An estimate of the N-channel signals of direct signal components and ambient signal components is obtained from
{circumflex over (d)}(m,k)=HDH(m,k)y(m,k) (10)
{circumflex over (a)}(m,k)=HAH(m,k)y(m,k), (11)
Alternatively, only one filter matrix can be used, and the subtraction illustrated in
{circumflex over (d)}(m,k)=HDH(m,k)y(m,k) (12)
{circumflex over (a)}(m,k)=[I−HD(m,k)]Hy(m,k), (13)
where I is the identity matrix of size N×N, or, as shown in
{circumflex over (a)}(m,k)=HAH(m,k)y(m,k) (14)
{circumflex over (d)}(m,k)=[I−HA(m,k)]Hy(m,k), (15)
respectively. Here, superscript H denotes the conjugate transpose of a matrix or a vector. The filter matrix HD(m,k) is used for computing estimates for the direct signals {circumflex over (d)}(m,k). The filter matrix HA(m,k) is used for computing estimates for the ambient signals â(m,k).
In the above, Formulae (10)-(15), y(m,k) indicates the two or more audio input channel signals. â(m,k) indicates an estimation of the ambient signal portions and {circumflex over (d)}(m,k) indicates an estimation of the direct signal portions of the audio input channel signals, respectively. â(m,k) and/or {circumflex over (d)}(m,k) or one or more vector components of â(m,k) and/or {circumflex over (d)}(m,k) may be the one or more audio output channel signals.
One, some or all of the Formulae (10), (11), (12), (13), (14) and (15) may be employed by the signal processor 120 of
The filtering matrices are computed from estimates of the signal statistics as described below. In particular, the filter determination unit 110 is configured to determine the filter by estimating first power spectral density (PSD) information and second PSD information.
Define:
ϕxixj(m,k)=E{Xi(m,k)Xj*(m,k)}, (16)
where E{⋅} is the expectation operator and X* denotes complex conjugate of X. For i=j the PSD and for i≠j the cross-PSDs are obtained.
The covariance matrices for y(m, k), d(m,k) and a(m,k) are
Φy(m,k)=E{y(m,k)yH(m,k)} (17)
Φd(m,k)=E{d(m,k)dH(m,k)} (18)
Φa(m,k)=E{a(m,k)aH(m,k)}. (19)
The covariance matrices Φy(m,k), Φd(m,k) and Φa(m,k) comprise estimates of the PSD for all channels on the main diagonal, while the off-diagonal elements are estimates of the cross-PSD of the respective channel signals. Thus, each of the matrices Φy(m,k), Φd(m,k) and Φa(m,k) represent an estimation of power spectral density information.
In Formulae (17)-(19), Φy(m,k) indicates an power spectral density information on the two or more audio input channel signals. Φd(m,k) indicates a power spectral density information on the direct signal components of the two or more audio input channel signals. Φa(m,k) indicates a power spectral density information on the ambient signal components of the two or more audio input channel signals.
Each of the matrices Φy(m,k), Φd(m,k) and Φa(m,k) of Formulae (17), (18) and (19) can be considered as power spectral density information. However, it should be noted that in other embodiments, the first and the second power spectral density information is not a matrix, but may be represented in any other kind of suitable format. For example, according to embodiments, the first and/or the second power spectral density information may be represented as one or more vectors. In further embodiments, the first and/or the second power spectral density information may be represented as a plurality of coefficients.
It is assumed that
As a consequence it holds that
Φy(m,k)=Φd(m,k)+Φa(m,k), (20)
Φa(m,k)=ϕA(m,k)IN×N, (21)
As a consequence of Formula (20) it follows that when two matrices of the matrices Φy(m,k), Φd(m,k) and Φa(m,k) are determined, then the third one of the matrices is immediately available. As a further consequence, it follows that it is enough to determine only:
because the third power spectral density information (that has not been estimated) becomes immediately apparent from the relationship of the three kinds of power spectral density information (e.g., by Formula (20) or by any other reformulation of the relationship of the three kinds of power spectral density information (PSD of complete input signal, PSD of ambience components and PSD of direct components), when said three kinds of PSD information are not represented as matrices, but when they are available in another kind of suitable representation, e.g., as one or more vectors, or e.g., as a plurality of coefficients, etc.
For assessing the performance of the devised method, the following signals are defined:
In the following, the derivation of the filler matrices are described below according to
At first, embodiments for the estimation of the direct signal components are described.
The rationale of the devised method is to compute the filters such that the residual ambient signal ra is minimized while constraining the direct signal distortion qd. This leads to the constrained optimization problem
where σd,max2 is the maximum allowable direct signal distortion. The solution is given by
HD(βi)=[Φd+βiΦa]−1Φd. (23)
The filter for computing the direct output signal of the i-th channel equals
hD,i(βi)=[Φd+βiΦa]−1Φdui. (24)
where ui is a null vector of length N with 1 at the i-th position. The parameter βi enables a trade-off between residual ambient signal reduction and ambient signal distortion. For the system depicted in
It is noted that a similar solution can be obtained by formulating the constrained optimization problem as
When Φd is of rank one, the relation between σd,max2 and βi for the i-th channel signal is derived as
where ϕD
where the trace of a square matrix A equals the sum of the elements on the main diagonal,
It should be noted that the statement, that Φd is of rank one, is only an assumption. No matter whether in reality this assumption is true or not, embodiments of the present invention employ the above Formulae (26), (27) and (28), even in situations, where, in reality, the exact result of Φd is so that Φd is not of rank one. In such situations, embodiments of the present invention also provide good results, even when the assumption, that Φd is of rank one, is, in reality, not true.
In the following, an estimation of the ambient signal components is described.
The rationale of the devised method is to compute the filters such that the residual direct signal rd is minimized while constraining the ambient signal distortion qa. This leads to the constrained optimization problem
where σa,max2 is the maximum allowable ambient signal distortion. The solution is given by
HA(βi)=[βiΦd+Φa]−1Φa, (30)
The filter for computing the ambient output signal of the i-th channel equals
hA,i(βi)=[βiΦd+Φa]−1Φaui. (31)
In the following, embodiments are provided in detail which realize concepts of the present invention.
To determine power spectral density information, for example, the PSD matrix of the audio input channel signals Φy might be estimated directly using short-time moving averaging or recursive averaging. The ambient PSD matrix Φa, may, for example, be estimated as described below. The direct PSD matrix Φd, may, for example, be then obtained using Formula (20).
In the following, it is again assumed that not more than one direct sound source is active at a time in each subband (single direct source), and that consequently Φd is of rank one.
It should be noted that the statements, that not more than one direct sound source is active, and that Φd is of rank one, are only assumptions. No matter whether in reality these assumptions are true or not, embodiments of the present invention employ the formulae below, in particular, Formulae (32) and (33), even in situations, where, in reality, more than one direct sound source is active, and even when, in reality, the exact result of Φd is so that Φd is not of rank one. In such situations, embodiments of the present invention also provide good results, even when the assumptions, that not more than one direct sound source is active, and that Φd is of rank one, are, in reality, not true.
Thus, assuming that not more than one direct sound source is active, and that Φd is of rank one, Formula (23) can be written as
Formula (33) provides a solution for the constrained optimization problem of Formula (22).
In the above Formulae (32) and (33), Φa−1 is the inverse matrix of Φa. It is apparent that Φa−1 also indicates power spectral density information on the ambient signal portions of the two or more audio input channel signals.
To determine HD(βi), Φa−1 and Φd have to be determined. When Φa is available, Φa−1 can be immediately be determined. λ is defined in according to Formulae (27) and (28) and its value is available when Φa−1 and Φd are available. Besides determining Φa−1, Φd and λ, a suitable value for βi has to be chosen.
Moreover, Formula (33) can be reformulated (see Formula (20)), so that:
and, thus, so that only the PSD information Φy on the audio input channel signals and the PSD information Φd on the direct signal portions of the audio input channel signals have to be determined.
Moreover, Formula (33) can be reformulated (see Formula (20)), so that:
and, thus, so that only the PSD information Φa−1 on the ambient signal portions of the audio input channel signals and the PSD information Φd on the direct signal portions of the audio input channel signals have to be determined.
Furthermore, Formula (33) can be reformulated, so that:
and, thus, so that HA(βi) is determined.
Formula (33c) provides a solution for the constrained optimization problem of Formula (29).
Similarly, Formulae (33a) and (33b) can be reformulated to:
It should be noted that by determining HD(βi), the filter HA(βi) is immediately available as: HA(βi)=IN×N−HD(βi).
Furthermore, it should be noted that by determining HA(βi), the filter HD(βi) is immediately available as: HD(βi)=IN×N−HA(βi).
As stated above, to determine HD(βi), e.g., according to Formula (33), Φy and Φa may be determined:
The PSD matrix of the audio Signals Φy(m,k) can, for example, be estimated directly, for example, by using recursive averaging
Φy(m,k)=(1−α)y(m,k)yH(m,k)+αΦy(m−1,k), (34a)
where α is a filter coefficient which determines the integration time, or
for example, by using short-time moving weighted averaging
Φy(m,k)=b0·y(m,k)yH(m,k)+b1·y(m−1,k)yH(m−1,k)+b2·y(m−2,k)yH(m−2,k)+ . . . +bL·y(m−L,k)yH(m−L,k) (34b)
where L is, e.g., the number of past values used for the computation of the PSD, and b0 . . . bL are the filter coefficients which are, for example, in the range [0 1](e.g., 0≤filter coefficient≤1), or
for example, by using short-time moving averaging, according to Equation (34b) but with
for all i=0 . . . L.
Now, estimating the ambient PSD matrix Φa according to embodiments is described.
The ambient PSD matrix Φa is given by
Φa={circumflex over (ϕ)}AIN×N, (35)
where IN×N is the identity matrix of size N×N. {circumflex over (ϕ)}A is, e.g., a number.
One solution according to an embodiment is, for example, obtained by using a constant value, by using Formula (21) and setting {circumflex over (ϕ)}A to a real-positive constant ε. The advantage of this approach is that the computational complexity is negligible.
In embodiments, the filter determination unit 110 is configured to determine {circumflex over (ϕ)}A depending on the two or more audio input channel signals.
An option with very low computational complexity is, according to an embodiment, to use a fraction of the input power and to set {circumflex over (ϕ)}A to the mean value or the minimum value of the input PSD or a fraction of it, e.g.
where the parameter g controls the amount of ambience power, and 0<g<1.
According to a further embodiment, an estimation is conducted based on the arithmetic mean. Given the assumption that lead to Formula (20) and Formula (21), it can be shown that the PSD {circumflex over (ϕ)}A can be computed using
While tr{Φy} can be directly computed using e.g. the recursive integration of Formula (34a), or, e.g., the short-time moving weighted averaging of Formula (34b), tr{Φd} is estimated as
Alternatively, the PSD {circumflex over (ϕ)}A(m,k) can be computed for N>2 by choosing two input channel signals and estimating {circumflex over (ϕ)}A(m,k) only for one pair of signal channels. More accurate results are obtained when applying this procedure to more than one pair of input channel signals and combining the results, e.g. by averaging overall estimates. The subsets can be chosen by taking advantage of a-priori about channels having similar ambient power, e.g. by estimating the ambient power separately in all rear channels and all front channels of a 5.1 recording.
Moreover, it should be noted that from Formulae (20) and (35), it follows that
Φd=Φy−{circumflex over (ϕ)}AIN×N. (35a)
According to some embodiments, Φd is determined by determining {circumflex over (ϕ)}A (e.g., according to Formula (35), or Formula (36) or according to Formulae (37)-(40)) and by employing Formula (35a) to obtain the power spectral density information on the ambient signal portions of the audio input channel signals. Then, HD(βi) may be determined, for example, by employing Formula (33a).
In the following, the choice for the parameter βi is considered.
βi is a trade-off parameter. The trade-off parameter βi is a number.
In some embodiments, only one trade-off parameter βi is determined which is valid for all of the audio input channel signals, and this trade-off parameter is then considered as the trade-off information of the audio input channel signals.
In other embodiments, one trade-off parameter βi is determined for each of the two or more audio input channel signals, and these two or more trade-off parameters of the audio input channel signals then form together the trade-off information.
In further embodiments, the trade-off information may not be represented as a parameter but may be represented in a different kind of suitable format.
As noted above, the parameter βi enables a trade-off between ambient signal reduction and direct signal distortion. It can either be chosen to be constant, or signal-dependent, as shown in
A plurality of K beta determination units 1111, . . . , 11K1 (“compute Beta”) determine the parameters βi. Moreover, a plurality of K subfilter computation units 1112, . . . , 11K2 determine subfilters HDH(m,1), . . . , HDH(m,K). The plurality of the beta determination units 1111, . . . , 11K1 and the plurality of the subfilter computation units 1112, . . . , 11K2 together form the filter determination unit 110 of
Moreover,
In the following, different use cases for controlling the parameter βi by means of signal analysis are described.
At first, transient signals are considered.
According to an embodiment, the filter determination unit 110 is configured to determine the trade-off information (βi, βj) depending on whether a transient is present in at least one of the two or more audio input channel signals.
The estimation of the input PSD matrix works best for stationary signal. On the other hand, the decomposition of transient input signal can result in leakage of the transient signal component into the ambient output signal. Controlling βi by means of a signal analysis with respect to the degree of non-stationarity or transient presence probability such that βi is smaller when the signal comprises transients and larger in sustained portions leads to more consistent output signals when applying filters HD(βi). Controlling βi by means of a signal analysis with respect to the degree of non-stationarity or transient presence probability such that βi is larger when the signal comprises transients and smaller in sustained portions leads to more consistent output signals when applying filters HA(βi).
Now, undesired ambient signals are considered.
In an embodiment, the filter determination unit 110 is configured to determine the trade-off information (βi, βj) depending on a presence of additive noise in at least one signal channel through which one of the two or more audio input channel signals is transmitted.
The proposed method decomposes the input signals regardless of the nature of the ambient signal components. When the input signals have been transmitted over noisy signal channels, it is advantageous to estimate the probability of undesired additive noise presence and to control βi such that the output DAR (direct-to-ambient ratio) is increased.
Now, controlling the levels of the output signals is described.
In order to control the levels of output signals, βi can be set separately for the i-th channel. The filters for computing the ambient output signal of the i-th channel are given by Formula (31).
For any two channels, βi can be computed given βi such that the PSDs of the residual ambient signals ra,i and ra,j at the i-th and j-th output channel are equal, i.e.,
hA,iH(βi)ΦahA,i(βi)=hA,jH(βj)ΦahA,j(βj). (41)
or
(ui−hD,i(βi))HΦa(ui−hD,i(βi))=(uj−hD,j(βj))HΦa(uj−hD,j(βj)). (42)
Alternatively, βi can be computed such that the PSDs of the output ambient signals âi and âj are equal for all pairs i and j.
Now, using panning information is considered.
For the case of two input channels, panning information quantifies level differences between both channels per subband. The panning information can be applied for controlling βi in order to control the perceived width of the output signals.
In the following, equalizing output ambient channel signals is considered.
The described processing does not ensure that all output ambient channel signals have equal subband powers. To ensure that all output ambient channel signals have equal subband powers, the filters are modified as described in the following for the embodiment using filters HD as described above. The covariance matrix of the ambient output signal (comprising the auto-PSDs of each channel on the main diagonal) can be obtained as
Φâ=(I−HD)HΦy(I−HD). (43)
In order to ensure that the PSDs of all output ambient channels are equal, the filters HD are replaced by {tilde over (H)}D:
{tilde over (H)}D=I−G(I−HD)=I−G+GHD (44)
where G is a diagonal matrix whose elements on the main diagonal are
For the embodiment using filters HA as described above, the covariance matrix of the ambient output signal (comprising the auto-PSDs of each channel on the main diagonal) can be obtained as
Φâ×HAHΦyHA. (46)
In order to ensure that the PSDs of all output ambient channels are equal, the filters HA are replaced by {tilde over (H)}A:
{tilde over (H)}A=GHA (47)
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
The inventive decomposed signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
Some embodiments according to the invention comprise a non-transitory data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
While this invention has been described in terms of several advantageous embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
This application is a continuation of copending International Application No. PCT/EP2013/072170, filed Oct. 23, 2013, which claims priority from U.S. Provisional Application No. 61/772,708, Mar. 5, 2013, which are each incorporated herein in its entirety by this reference thereto.
Number | Name | Date | Kind |
---|---|---|---|
8036767 | Soulodre et al. | Oct 2011 | B2 |
20070154031 | Avendano et al. | Jul 2007 | A1 |
20090080666 | Uhle et al. | Mar 2009 | A1 |
20100030563 | Uhle et al. | Feb 2010 | A1 |
20100094633 | Kawamura et al. | Apr 2010 | A1 |
20130006619 | Muesch et al. | Jan 2013 | A1 |
20130216047 | Kuech et al. | Aug 2013 | A1 |
20150380002 | Uhle et al. | Dec 2015 | A1 |
Number | Date | Country |
---|---|---|
101636783 | Jan 2010 | CN |
102792374 | Nov 2012 | CN |
102859590 | Jan 2013 | CN |
2009522942 | Jun 2009 | JP |
2016513814 | May 2016 | JP |
20120128143 | Nov 2012 | KR |
2011104146 | Sep 2011 | WO |
Entry |
---|
Allen, J.B. et al., “Multimicrophone signal-processing technique to remove room reverberation from speech signals”, Journal of Acoustical Society of America, vol. 62, Oct. 1977, pp. 912-915. |
Avendano, Carlos et al., “A Frequency-Domain Approach to Multichannel Upmix”, Journal of the Audio Engineering society, Audio, Engineering Society, vol. 52, No. 7/8, Jul./Aug. 2004, pp. 740-749. |
Faller, Christof, “Multiple-Loudspeaker Playback of Stereo Signals”, Journal of Audio Engineering Society; vol. 54, No. 11, Nov. 2006, 1051-1064. |
Habets, et al., “New Insights Into the MVDR Beamformer in Room Acoustics”, IEEE Transaction on Audio, Speech and Language Processing, vol. 18, No. 1, Jan. 2010, pp. 158-170. |
McCowan, I. et al., “Microphone Array Post-Filter for Diffuse Noise Field”, IEEE Int'l Conference on Acoustics, Speech and Signal Processing; Orlando, FL, May 13-17, 2002, pp. I-905-I-908. |
Merimaa, et al., “Correlation-based ambience extraction from stereo recordings”, Proceedings of the AES 123rd Convention; New York, NY, Oct. 5-8, 2007, 15 pages. |
Pulkki, Ville , “Directional audio coding in spatial sound reproduction and stereo upmixing”, AES 28th International Conference, Piteå, Sweden, Jun. 30 to Jul. 2, 2006, pp. 1-8. |
Usher, John et al., “Enhancement of spatial sound quality: A new reverberation-extraction audio upmixer”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, No. 7, Sep. 2007, pp. 2141-2150. |
Walther, A. et al., “Direct-ambient decomposition and upmix of surround sound signals”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, Oct. 16-19, 2011, pp. 277-280. |
Number | Date | Country | |
---|---|---|---|
20150380002 A1 | Dec 2015 | US |
Number | Date | Country | |
---|---|---|---|
61772708 | Mar 2013 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2013/072170 | Oct 2013 | US |
Child | 14846660 | US |