The present application relates to parametric coding of spatial audio or stereo signals.
Spatial or 3D audio is a generic formulation which denotes various kinds of multi-channel audio signals. Depending on the capturing and rendering methods, the audio scene is represented by a spatial audio format. Typical spatial audio formats defined by the capturing method (microphones) are for example denoted as stereo, binaural, ambisonics, etc. Spatial audio rendering systems (headphones or loudspeakers) are able to render spatial audio scenes with stereo (left and right channels 2.0) or more advanced multichannel audio signals (2.1, 5.1, 7.1, etc.).
Recent technologies for the transmission and manipulation of such audio signals allow the end user to have an enhanced audio experience with higher spatial quality often resulting in a better intelligibility as well as an augmented reality. Spatial audio coding techniques, such as MPEG Surround or MPEG-H 3D Audio, generate a compact representation of spatial audio signals which is compatible with data rate constraint applications such as streaming over the internet. The transmission of spatial audio signals is however limited when the data rate constraint is strong and therefore post-processing of the decoded audio channels is also used to enhanced the spatial audio playback. Commonly used techniques are for example able to blindly up-mix decoded mono or stereo signals into multi-channel audio (5.1 channels or more).
In order to efficiently render spatial audio scenes, the spatial audio coding and processing technologies make use of the spatial characteristics of the multi-channel audio signal. In particular, the time and level differences between the channels of the spatial audio capture are used to approximate the inter-aural cues which characterize our perception of directional sounds in space. Since the inter-channel time and level differences are only an approximation of what the auditory system is able to detect (i.e. the inter-aural time and level differences at the ear entrances), it is of high importance that the inter-channel time difference is relevant from a perceptual aspect. The inter-channel time and level differences are commonly used to model the directional components of multi-channel audio signals, while the inter-channel cross-correlation—that models the inter-aural cross-correlation (IACC)—is used to characterize the width of the audio image. Especially for lower frequencies the stereo image may as well be modeled with inter-channel phase differences (ICPD).
It should be noted that the binaural cues relevant for spatial auditory perception are called inter-aural level difference (ILD), inter-aural time difference (ITD) and inter-aural coherence or correlation (IC or IACC). When considering general multichannel signals, the corresponding cues related to the channels are inter-channel level difference (ICLD), inter-channel time difference (ICTD) and inter-channel coherence or correlation (ICC). In the following description the terms “inter-channel cross-correlation”, “inter-channel correlation” and “inter-channel coherence” are used interchangeably. Since the spatial audio processing mostly operates on the captured audio channels, the “C” is sometimes left out and the terms ITD, ILD and IC are often used also when referring to audio channels.
In
Since the encoded parameters are used to render spatial audio for the human auditory system, it is important that the inter-channel parameters are extracted and encoded with perceptual considerations for maximized perceived quality.
Stereo and multi-channel audio signals are complex signals difficult to model especially when the environment is noisy or reverberant or when various audio components of the mixtures overlap in time and frequency i.e. noisy speech, speech over music or simultaneous talkers, etc.
When the ICTD parameter estimation becomes unreliable, the parametric representation of the audio scene becomes unstable and gives poor spatial rendering quality. Also, since the ICTD compensation is often carried out as a part of the down-mix stage, an unstable estimate will give a challenging and complex down-mix signal to be encoded.
The object of the embodiments is to increase the stability of the ICTD parameter, thereby improving both the down-mix signal that is encoded by the mono codec and the perceived stability in the spatial audio rendering in the decoder.
According to an aspect, it is provided a method for increasing stability of an inter-channel time difference (ICTD) parameter in parametric audio coding, wherein a multi-channel audio input signal comprising at least two channels is received. The method comprises obtaining an ICTD estimate, ICTDest(m), for an audio frame m and a stability estimate of said ICTD estimate, and determining whether the obtained ICTD estimate, ICTDest(m), is valid. If the ICTDest(m) is not found valid, and a determined sufficient number of valid ICTD estimates have been found in preceding frames, a hang-over time is determined using the stability estimate. A previously obtained valid ICTD parameter, ICTD(m−1), is selected as an output parameter, ICTD(m), during the hang-over time. The output parameter, ICTD(m), is set to zero if valid ICTDest(m) is not found during the hang-over time.
According to another aspect, an apparatus is provided for parametric audio coding. The apparatus is configured to receive a multi-channel audio input signal comprising at least two channels, and to obtain an ICTD estimate, ICTDest(m), for an audio frame m. The apparatus is configured to determine whether the obtained ICTD estimate, ICTDest(m), is valid and to obtain a stability estimate of said ICTD estimate. The apparatus is further configured to determine a hang-over time using the stability estimate if the ICTDest(m) is not found valid and a determined sufficient number of valid ICTD estimates have been found in preceding frames, and to select a previously obtained valid ICTD parameter, ICTD(m−1), as an output parameter, ICTD(m), during the hang-over time, and to set the output parameter, ICTD(m), to zero if valid ICTDest(m) is not found during the hang-over time.
According to another aspect, a computer program is provided. The computer program comprises instructions which, when executed on at least one processor, cause the at least one processor to obtain an ICTD estimate, ICTDest(m), for an audio frame m and a stability estimate of said ICTD estimate, and to determine whether the obtained ICTD estimate, ICTDest(m), is valid. If the ICTDest(m) is not found valid, and a determined sufficient number of valid ICTD estimates have been found in preceding frames, to determine a hang-over time using the stability estimate, and to select a previously obtained valid ICTD parameter, ICTD(m−1), as an output parameter, ICTD(m), during the hang-over time, and to set the output parameter, ICTD(m), to zero if valid ICTDest(m) is not found during the hang-over time.
According to another aspect, a method comprises obtaining a long term estimate of the stability of the ICTD parameter by averaging an ICC measure, and when reliable ICTD estimates cannot be obtained, using this stability estimate to determine a hysteresis period, or hang-over time, when a previously obtained reliable ICTD estimate is used. If reliable ICTD estimates are not obtained within the hysteresis period, the ICTD is set to zero.
For a more complete understanding of example embodiments of the present invention, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:
An example embodiment of the present invention and its potential advantages are understood by referring to
The conventional parametric approach of estimating the ICTD relies on the cross-correlation function (CCF) rxy which is a measure of similarity between two waveforms x[n] and y[n], and is generally defined in the time domain as:
rxy[n,τ]=E[x[n]y[n+τ]], (1)
where τ is the time-lag parameter and E[·] the expectation operator. For a signal frame of length N the cross-correlation is typically estimated as:
rxy[τ]=Σn=0N-1x[n]y[n+τ]. (2)
The ICC is conventionally obtained as the maximum of the CCF which is normalized by the signal energies as follows:
The time lag τ corresponding to the ICC is determined as the ICTD between the channels x and y. By assuming x[n] and y[n] are zero outside the signal frame, the cross-correlation function can equivalently be expressed as a function of the cross-spectrum of the frequency spectra X[k] and Y[k] (with discrete frequency index k) as:
rxy[τ]=DFT−1(X[k]Y*[k]) (4)
where X[k] is the discrete Fourier transform (DFT) of the time domain signal x[n], i.e.
and the DFT−1(⋅) or IDFT(⋅) denotes the inverse discrete Fourier transform. Y*[k] is the complex conjugate of the DFT of y(n).
For the case when y[n] is purely a delayed version of x[n], the cross-correlation function is given by:
where * denotes convolution and δ(τ-τ0) is the Kronecker delta function, i.e. it is equal to one at τ0 and zero otherwise. This means that the cross-correlation function between x and y is the delta function spread by the convolution with the autocorrelation function for x[n]. For signal frames with several delay components, e.g. several talkers, there will be peaks at each delay present between the signals, and the cross-correlation becomes:
rxy[τ]=rxx[τ]*Σiδ(τ−τi). (7)
The delta functions might then be spread into each other and make it difficult to identify the several delays within the signal frame. There are however generalized cross-correlation (GCC) functions that do not have this spreading. The GCC is generally defined as:
rxyGCC[τ]=DFT−1(ψ[k]X[k]Y*[k]) (8)
where ψ[k] is a frequency weighting. Especially for spatial audio, the phase transform (PHAT) has been utilized due to its robustness for reverberation in low noise environments. The phase transform is basically the absolute value of each frequency coefficient, i.e.
This weighting will thereby whiten the cross-spectrum such that the power of each component becomes equal. With pure delay and uncorrelated noise in the signals x[n] and y[n] the phase transformed GCC (GCC-PHAT) becomes just the Kronecker delta function δ(τ-τ0), i.e.:
The present method is based on an adaptive hang-over time, also called a hang-over period, that depends on the long-term estimate of the ICC. In an embodiment of the method a long term estimate of the stability of the ICTD parameter is obtained by averaging an ICC measure. When reliable estimates cannot be obtained, the stability estimate is used to determine a hysteresis period, or hang-over time, when a previously obtained reliable estimate is used. If reliable estimates are not obtained within the hysteresis period, the ICTD is set to zero.
Considering a system designated to obtain spatial representation parameters for an audio input consisting of two or more audio channels. Each channel is segmented into time frames m. For a multichannel approach, the spatial parameters are typically obtained for channel pairs, and for a stereo setup this pair is simply the left and right channel. Hereafter it is focused on the spatial parameters for a single channel pair x[n, m] and y[n,m], where n denotes sample number and m denotes frame number.
A cross-correlation measure and an ICTD estimate is obtained for each frame m. After the ICC(m) and ICTDest(m) for the current frame have been obtained, a decision is made whether ICTDest(m) is valid, i.e. relevant/useful/reliable, or not.
If the ICTD is found valid, the ICC is filtered to obtain an estimate of the peak envelope of the ICC. The output ICTD parameter ICTD(m) is set to the valid estimate ICTDest(m). In the following, the terms “ICTD measure”, “ICTD parameter” and “ICTD value” are used interchangeably for ICTD(m). Further, the hang-over counter NHO is set to zero to indicate no hang-over state.
If the ICTD is not found valid, it is determined whether a sufficient number of valid ICTD measurements have been found in the preceding frames, i.e. whether ICTD_count=ICTD_maxcount. If a sufficient number of valid ICTD measurements have been found in the preceding frames, a hysteresis period, or hang-over time, is calculated. If ICTDcount<ICTDmaxcount, insufficient number of consecutive ICTD estimates have been registered in the past frames or the current state is a hang-over state. Then it is determined whether a current state is a hang-over state. If the current state is not a hang-over state, then ICTD(m) is set to 0. If the current state is a hang-over state then the previous ICTD value will be selected, i.e. ICTD(m)=ICTD(m−1).
The general steps of the ICTD/ICC processing are illustrated in
As illustrated in
Other measures such as the peak of the normalized cross-correlation function may also be used, i.e.
Further, in block 405, an ICTD estimate, ICTDest(m), is obtained. Preferably, the estimates for ICC and ICTD will be obtained using the same cross-correlation method to consume the least amount of computational power. The T that maximizes the cross-correlation may be selected as the ICTD estimate. Here, the GCC PHAT is used.
Typically the search range for T would be limited to the range of ICTDs that needs to be represented, but it is also limited by the length of the audio frame and/or the length of the DFT used for the correlation computation (see N in equation (5)). This means that the audio frame length and DFT analysis windows need to be long enough to accommodate the longest time difference τmax that needs to be represented, which means that N>2τmax. As an example, for the ability to represent a distance between a pair of microphones of 1.5 meters, assuming speed of sound is 340 m/s and using a sample rate of 32000 samples/second, the search range would be [−τmax,τmax] where:
After the ICC(m) and ICTDest(m) for the current frame have been obtained, a decision in block 407 is made whether ICTDest(m) is valid or not. This may be done by comparing the relative peak magnitude of a cross-correlation function to a threshold ICCthres(m) based on the cross-correlation function, e.g. rxyPHAT [τ, m] or rxy[τ, m], such that ICC(m)>ICCthres(m) means the ICTD is valid.
Valid(ICDTest(m))=ICC(m)>ICCthres(m) (15)
Such a threshold can for instance be formed by a constant Cthres multiplied by the standard deviation estimate of the cross-correlation function, where a suitable value may be Cthres=5.
Another method is to sort the search range and use the value at e.g. the 95 percentile multiplied with a constant.
where sort( ) is a function that sorts the input vector in ascending order.
If the ICTD is found valid, the steps of block 409, outlined in
If α1 ∈ [0,1] is set relatively high (e.g. α1=0.9) and α2 ∈ [0,1] is set relatively low (e.g. α2=0.1), the filtering operation will tend to follow the peak values of the ICC, forming an envelope of the signal. The motivation is to have an estimate of the last highest ICCs when coming to a situation where the ICC has dropped to a low level (and not just indicate the last few values in the transition to a low ICC). The counter ICTD_count is incremented to keep track of the number of consecutive valid ICTDs. Then, in block 425, the ICTD_count is set to ICTD_maxcount if it is determined in block 423 that the ICTD_maxcount is exceeded or if the system is currently in an ICTD hang-over state and NHO>0. The former criterion is there to prevent the counter for wrapping around in a limited precision integer number. The latter criterion would capture the event that a valid ICTD is found during a hang-over period. Setting the ICTD_count to ICTD_maxcount will trigger a new hang-over period, which may be desirable in this case. Finally, in block 427, the output ICTD measure ICTD(m) is set to the valid estimate ICTDest(m). The hang-over counter NHO is also set to zero to indicate that a current state is not a hang-over state.
If the ICTD is not found valid, the steps of block 411, outlined in
The hang-over time NHO is adaptive and depends on the ICC such that if the recent ICC estimates have been low (corresponding to low ICCLP(m)), the hang-over time should be long, and vice versa. That is, ICCLP(m) ICCLP(m−1) and
NHO=g(ICCLP(m)) (22)
g(ICCLP(m))=max(0,min(NHOmax,└c+d·ICCLP(m)┘)) (23)
where the constants NHOmax, c and d may be set to e.g.
and └·┘ J denotes the floor function which truncates/rounds down to the nearest integer. The max( ) and min( ) functions both take two arguments and return the largest and smallest argument, respectively. An illustration of this function can be seen in
In general, any parameter indicating the correlation, i.e. coherence or similarity, between the channels may be used as a control parameter ICC(m), but the mapping function described in equation (22) has to be adapted to give suitable number of hang-over frames for the low/high correlation cases. Experimentally, a low correlation situation should give around 3-8 frames of hang-over, while a high correlation case should give 0 frames of hang-over.
If ICTDcount<ICTDmaxcount, this means either that insufficient number of consecutive ICTD estimates have been registered in the past frames, or that the current state is a hang-over state. In block 435 it is determined whether NHO>0. If NHO=0, then ICTD(m) is set to 0 in block 439. If, on the other hand, NHO>0, the current state is a hang-over state and the previous ICTD value will be selected, i.e. ICTD(m)=ICTD(m−1), in block 437. In this case the hang-over counter is also decremented, NHO:=NHO-1. (The assignment operator ‘:=’ is used to indicate that the old value of NHO is overwritten with the new one.) Finally, in block 440, ICTD_count and ICCLP(m) are set to zero.
The method described here may be implemented in a microprocessor or on a computer. It may also be implemented in hardware in a parameter hysteresis/hang-over logic unit as shown in
By way of example, the software or computer program 930 may be realized as a computer program product, which is normally carried or stored on a computer-readable medium, preferably non-volatile computer-readable storage medium. The computer-readable medium may include one or more removable or non-removable memory devices including, but not limited to a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc (CD), a Digital Versatile Disc (DVD), a Blue-ray disc, a Universal Serial Bus (USB) memory, a Hard Disk Drive (HDD) storage device, a flash memory, a magnetic tape, or any other conventional memory device.
In an embodiment, a device comprises obtaining units for obtaining a cross-correlation measure and an ICTD estimate, and a decision unit for deciding whether ICTDest(m) is valid or not. The device further comprises an obtaining unit for obtaining an estimate of the peak envelope of the ICC, and a determining units for determining whether a sufficient number of valid ICTD measurements have been found in the preceding frames and for determining whether a current state is a hang-over state. The device further comprises an output unit for outputting ICTD measure.
According to embodiments of the present invention, the method for increasing stability of an inter-channel time difference (ICTD) parameter in parametric audio coding comprises receiving a multi-channel audio input signal comprising at least two channels. Obtaining an ICTD estimate, ICTDest(m), for an audio frame m, determining whether the obtained ICTD estimate, ICTDest(m), is valid and obtaining a stability estimate of said ICTD estimate. If the ICTDest(m) is not found valid, and a determined sufficient number of valid ICTD estimates have been found in preceding frames, determining a hang-over time using the stability estimate, selecting a previously obtained valid ICTD parameter, ICTD(m−1), as an output parameter, ICTD(m), during the hang-over time; and setting the output parameter, ICTD(m), to zero if valid ICTDest(m) is not found during the hang-over time.
In an embodiment the stability estimate is an inter channel correlation (ICC) measure between a channel pair for an audio frame m.
In an embodiment the stability estimate is a low-pass filtered inter-channel correlation, ICCLP(m).
In an embodiment the stability estimate is calculated by averaging the ICC measure, ICC(m).
In an embodiment the hang-over time is adaptive. For instance, the hang-over is applied with increasing number of frames for decreasing ICCLP(m).
In an embodiment a Generalized Cross Correlation with Phase Transform is used for obtaining the ICC measure for the frame m.
In an embodiment ICTDest(m) is determined to be valid if the inter-channel correlation measure, ICC(m), is larger than a threshold ICCthres(m).
For instance, the validity of the obtained ICTD estimate, ICTDest(m), is determined by comparing a relative peak magnitude of a cross-correlation function to a threshold, ICCthres (m), based on the cross correlation function. ICCthres (m) may be formed by a constant multiplied by a value of the cross-correlation at a predetermined position in an ordered set of cross correlation values for frame m.
In an embodiment the sufficient number of valid ICTD estimates is 2.
Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on a memory, a microprocessor or a central processing unit. If desired, part of the software, application logic and/or hardware may reside on a host device or on a memory, a microprocessor or a central processing unit of the host. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media.
This application is a continuation of U.S. application Ser. No. 16/082,137, having a 371c date of Sep. 4, 2018, which is a 35 U.S.C. § 371 National Stage of International Patent Application No. PCT/EP2017/055430, filed Mar. 8, 2017, designating the United States and claiming priority to U.S. provisional application No. 62/305,683, filed on Mar. 9, 2016. The above identified applications are incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
20110206209 | Ojala | Aug 2011 | A1 |
20130301835 | Briand | Nov 2013 | A1 |
20130304481 | Briand | Nov 2013 | A1 |
Number | Date | Country |
---|---|---|
2 381 439 | Oct 2011 | EP |
2013149672 | Oct 2013 | WO |
Entry |
---|
Faller et al., “Improved Time Delay Analysis/Synthesis for Parametric Stereo Audio Coding”, AES Convention 120 (May 1, 2006), XP040507647. (9 pages). |
Faller et al., “Parametric Multichannel Audio Coding:Synthesis of Coherence Cues”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 14., No. 1 (Jan. 2006). (12 pages). |
International Search Report and Written Opinion dated Apr. 24, 2017 issued in International Application No. PCT/EP2017/055430. (10 pages). |
Extended European Search Report issued in European Application No. 19 18 9961, dated Sep. 5, 2019 (8 pages). |
Number | Date | Country | |
---|---|---|---|
20210027793 A1 | Jan 2021 | US |
Number | Date | Country | |
---|---|---|---|
62305683 | Mar 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16082137 | US | |
Child | 17066541 | US |