The present application is related to stereo processing or, generally, multi-channel processing, where a multi-channel signal has two channels such as a left channel and a right channel in the case of a stereo signal or more than two channels, such as three, four, five or any other number of channels.
Stereo speech and particularly conversational stereo speech has received much less scientific attention than storage and broadcasting of stereophonic music. Indeed in speech communications monophonic transmission is still nowadays mostly used. However with the increase of network bandwidth and capacity, it is envisioned that communications based on stereophonic technologies will become more popular and bring a better listening experience.
Efficient coding of stereophonic audio material has been for a long time studied in perceptual audio coding of music for efficient storage or broadcasting. At high bitrates, where waveform preserving is crucial, sum-difference stereo, known as mid/side (M/S) stereo, has been employed for a long time. For low bit-rates, intensity stereo and more recently parametric stereo coding has been introduced. The latest technique was adopted in different standards as HeAACv2 and Mpeg USAC. It generates a down-mix of the two-channel signal and associates compact spatial side information.
Joint stereo coding are usually built over a high frequency resolution, i.e. low time resolution, time-frequency transformation of the signal and is then not compatible to low delay and time domain processing performed in most speech coders. Moreover the engendered bit-rate is usually high.
On the other hand, parametric stereo employs an extra filter-bank positioned in the front-end of the encoder as pre-processor and in the back-end of the decoder as post-processor. Therefore, parametric stereo can be used with conventional speech coders like ACELP as it is done in MPEG USAC. Moreover, the parameterization of the auditory scene can be achieved with minimum amount of side information, which is suitable for low bit-rates. However, parametric stereo is as for example in MPEG USAC not specifically designed for low delay and does not deliver consistent quality for different conversational scenarios. In conventional parametric representation of the spatial scene, the width of the stereo image is artificially reproduced by a decorrelator applied on the two synthesized channels and controlled by Inter-channel Coherence (ICs) parameters computed and transmitted by the encoder. For most stereo speech, this way of widening the stereo image is not appropriate for recreating the natural ambience of speech which is a pretty direct sound since it is produced by a single source located at a specific position in the space (with sometimes some reverberation from the room). By contrast, music instruments have much more natural width than speech, which can be better imitated by decorrelating the channels.
Problems also occur when speech is recorded with non-coincident microphones, like in A-B configuration when microphones are distant from each other or for binaural recording or rendering. Those scenarios can be envisioned for capturing speech in teleconferences or for creating a virtually auditory scene with distant speakers in the multipoint control unit (MCU). The time of arrival of the signal is then different from one channel to the other unlike recordings done on coincident microphones like X-Y (intensity recording) or M-S (Mid-Side recording). The computation of the coherence of such non time-aligned two channels can then be wrongly estimated which makes fail the artificial ambience synthesis.
Known references related to stereo processing are U.S. Pat. Nos. 5,434,948, 8,811,621.
Document WO 2006/089570 A1 discloses a near-transparent or transparent multi-channel encoder/decoder scheme. A multi-channel encoder/decoder scheme additionally generates a waveform-type residual signal. This residual signal is transmitted together with one or more multi-channel parameters to a decoder. In contrast to a purely parametric multi-channel decoder, the enhanced decoder generates a multi-channel output signal having an improved output quality because of the additional residual signal. On the encoder-side, a left channel and a right channel are both filtered by an analysis filterbank. Then, for each subband signal, an alignment value and a gain value are calculated for a subband. Such an alignment is then performed before further processing. On the decoder-side, a de-alignment and a gain processing is performed and the corresponding signals are then synthesized by a synthesis filterbank in order to generate a decoded left signal and a decoded right signal.
It has been found that such known procedures do not provide an optimum for audio signals and, specifically, for speech signals where there is more than one speaker, i.e., in a conference scenario or a conversational speech scene.
According to an embodiment, an apparatus for encoding a multi-channel signal including at least two channels may have: a parameter determiner for determining a broadband alignment parameter and a plurality of narrowband alignment parameters from the multichannel signal; a signal aligner for aligning the at least two channels using the broadband alignment parameter and the plurality of narrowband alignment parameters to acquire aligned channels; a signal processor for calculating a mid-signal and a side signal using the aligned channels; a signal encoder for encoding the mid-signal to acquire an encoded mid-signal and for encoding the side signal to acquire an encoded side signal; and an output interface for generating an encoded multi-channel signal including the encoded mid-signal, the encoded side signal, information on the broadband alignment parameter and information on the plurality of narrowband alignment parameters.
According to another embodiment, a method for encoding a multi-channel signal including at least two channels, may have the steps of: determining a broadband alignment parameter and a plurality of narrowband alignment parameters from the multichannel signal; aligning the at least two channels using the broadband alignment parameter and the plurality of narrowband alignment parameters to acquire aligned channels; calculating a mid-signal and a side signal using the aligned channels; encoding the mid-signal to acquire an encoded mid-signal and encoding the side signal to acquire an encoded side signal; and generating an encoded multi-channel signal including the encoded mid-signal, the encoded side signal, information on the broadband alignment parameter and information on the plurality of narrowband alignment parameters.
Another embodiment may have an encoded multichannel signal including an encoded mid-signal, an encoded side signal, information on a broadband alignment parameter and information on a plurality of narrowband alignment parameters.
According to another embodiment, an apparatus for decoding and encoded multi-channel signal including an encoded mid-signal, an encoded side signal, information on a broadband alignment parameter and information on a plurality of narrowband alignment parameters, may have: a signal decoder for decoding the encoded mid-signal to acquire a decoded mid-signal and for decoding the encoded side signal to acquire a decoded side signal; a signal processor for calculating a decoded first channel and decoded second channel from the decoded mid-signal and the decoded side signal; and a signal de-aligner for de-aligning the decoded first channel and the decoded second channel using the information on the broadband alignment parameter and the information on the plurality of narrowband alignment parameters to acquire a decoded multi-channel signal.
According to another embodiment, a method for decoding and encoded multi-channel signal including an encoded mid-signal, an encoded side signal, information on a broadband alignment parameter and information on a plurality of narrowband alignment parameters, may have the steps of: decoding the encoded mid-signal to acquire a decoded mid-signal and decoding the encoded side signal to acquire a decoded side signal; calculating a decoded first channel and decoded second channel from the decoded mid-signal and the decoded side signal; and de-aligning the decoded first channel and the decoded second channel using the information on the broadband alignment parameter and the information on the plurality of narrowband alignment parameters to acquire a decoded multi-channel signal.
According to another embodiment, a non-transitory digital storage medium having a computer program stored thereon to perform the method for encoding a multi-channel signal including at least two channels, the method including: determining a broadband alignment parameter and a plurality of narrowband alignment parameters from the multichannel signal; aligning the at least two channels using the broadband alignment parameter and the plurality of narrowband alignment parameters to acquire aligned channels; calculating a mid-signal and a side signal using the aligned channels; encoding the mid-signal to acquire an encoded mid-signal and encoding the side signal to acquire an encoded side signal; and generating an encoded multi-channel signal including the encoded mid-signal, the encoded side signal, information on the broadband alignment parameter and information on the plurality of narrowband alignment parameters; when said computer program is run by a computer.
According to another embodiment, a non-transitory digital storage medium having a computer program stored thereon to perform the method for decoding and encoded multi-channel signal including an encoded mid-signal, an encoded side signal, information on a broadband alignment parameter and information on a plurality of narrowband alignment parameters, the method including: decoding the encoded mid-signal to acquire a decoded mid-signal and decoding the encoded side signal to acquire a decoded side signal; calculating a decoded first channel and decoded second channel from the decoded mid-signal and the decoded side signal; and de-aligning the decoded first channel and the decoded second channel using the information on the broadband alignment parameter and the information on the plurality of narrowband alignment parameters to acquire a decoded multi-channel signal; when said computer program is run by a computer.
An apparatus for encoding a multi-channel signal having at least two channels comprises a parameter determiner to determine a broadband alignment parameter on the one hand and a plurality of narrowband alignment parameters on the other hand. These parameters are used by a signal aligner for aligning the at least two channels using these parameters to obtain aligned channels. Then, a signal processor calculates a mid-signal and a side signal using the aligned channels and the mid-signal and the side signal are subsequently encoded and forwarded into an encoded output signal that additionally has, as parametric side information, the broadband alignment parameter and the plurality of narrowband alignment parameters.
On the decoder-side, a signal decoder decodes the encoded mid-signal and the encoded side signal to obtain decoded mid and side signals. These signals are then processed by a signal processor for calculating a decoded first channel and a decoded second channel. These decoded channels are then de-aligned using the information on the broadband alignment parameter and the information on the plurality of narrowband parameters included in an encoded multi-channel signal to obtain the decoded multi-channel signal.
In a specific implementation, the broadband alignment parameter is an inter-channel time difference parameter and the plurality of narrowband alignment parameters are inter channel phase differences.
The present invention is based on the finding that specifically for speech signals where there is more than one speaker, but also for other audio signals where there are several audio sources, the different places of the audio sources that both map into two channels of the multi-channel signal can be accounted for using a broadband alignment parameter such as an inter-channel time difference parameter that is applied to the whole spectrum of either one or both channels. In addition to this broadband alignment parameter, it has been found that several narrowband alignment parameters that differ from subband to subband additionally result in a better alignment of the signal in both channels.
Thus, a broadband alignment corresponding to the same time delay in each subband together with a phase alignment corresponding to different phase rotations for different subbands results in an optimum alignment of both channels before these two channels are then converted into a mid/side representation which is then further encoded. Due to the fact that an optimum alignment has been obtained, the energy in the mid-signal is as high as possible on the one hand and the energy in the side signal is as small as possible on the other hand so that an optimum coding result with a lowest possible bitrate or a highest possible audio quality for a certain bitrate can be obtained.
Specifically for conversational speech material, it appears that there are typically speakers being active at two different places. Additionally, the situation is such that, normally, only one speaker is speaking from the first place and then the second speaker is speaking from the second place or location. The influence of the different locations on the two channels such as a first or left channel and a second or right channel is reflected by different time of arrivals and, therefore, a certain time delay between both channels due to the different locations, and this time delay is changing from time to time. Generally, this influence is reflected in the two channel signals as a broadband de-alignment that can be addressed by the broadband alignment parameter.
On the other hand, other effects, particularly coming from reverberation or further noise sources can be accounted for by individual phase alignment parameters for individual bands that are superposed on the broadband different arrival times or broadband de-alignment of both channels.
In view of that, the usage of both, a broadband alignment parameter and a plurality of narrowband alignment parameters on top of the broadband alignment parameter result in an optimum channel alignment on the encoder-side for obtaining a good and very compact mid/side representation while, on the other hand, a corresponding de-alignment subsequent to a decoding on the decoder side results in a good audio quality for a certain bitrate or in a small bitrate for a certain required audio quality.
An advantage of the present invention is that it provides a new stereo coding scheme much more suitable for a conversion of stereo speech than the existing stereo coding schemes. In accordance with the invention, parametric stereo technologies and joint stereo coding technologies are combined particularly by exploiting the inter-channel time difference occurring in channels of a multi-channel signal specifically in the case of speech sources but also in the case of other audio sources.
Several embodiments provide useful advantages as discussed later on.
The new method is a hybrid approach mixing elements from a conventional M/S stereo and parametric stereo. In a conventional M/S, the channels are passively downmixed to generate a Mid and a Side signal. The process can be further extended by rotating the channel using a Karhunen-Loeve transform (KLT), also known as Principal Component Analysis (PCA) before summing and differentiating the channels. The Mid signal is coded in a primary code coding while the Side is conveyed to a secondary coder. Evolved M/S stereo can further use prediction of the Side signal by the Mid Channel coded in the present or the previous frame. The main goal of rotation and prediction is to maximize the energy of the Mid signal while minimizing the energy of the Side. M/S stereo is waveform preserving and is in this aspect very robust to any stereo scenarios, but can be very expensive in terms of bit consumption.
For highest efficiency at low bit-rates, parametric stereo computes and codes parameters, like Inter-channel Level differences (ILDs), Inter-channel Phase differences (I PDs), Inter-channel Time differences (ITDs) and Inter-channel Coherence (ICs). They compactly represent the stereo image and are cues of the auditory scene (source localization, panning, width of the stereo . . . ). The aim is then to parametrize the stereo scene and to code only a downmix signal which can be at the decoder and with the help of the transmitted stereo cues be once again spatialized.
Our approach mixed the two concepts. First, stereo cues ITD and IPD are computed and applied on the two channels. The goal is to represent the time difference in broadband and the phase in different frequency bands. The two channels are then aligned in time and phase and M/S coding is then performed. ITD and IPD were found to be useful for modeling stereo speech and are a good replacement of KLT based rotation in M/S. Unlike a pure parametric coding, the ambience is not more modeled by the ICs but directly by the Side signal which is coded and/or predicted. It was found that this approach is more robust especially when handling speech signals.
The computation and processing of ITDs is a crucial part of the invention. ITDs were already exploited in the conventional Binaural Cue Coding (BCC), but in a way that it was inefficient once ITDs change over time. For avoiding this shortcoming, specific windowing was designed for smoothing the transitions between two different ITDs and being able to seamlessly switch from one speaker to another positioned at different places.
Further embodiments are related to the procedure that, on the encoder-side, the parameter determination for determining the plurality of narrowband alignment parameters is performed using channels that have already been aligned with the earlier determined broadband alignment parameter.
Correspondingly, the narrowband de-alignment on the decoder-side is performed before the broadband de-alignment is performed using the typically single broadband alignment parameter.
In further embodiments, it is advantageous that, either on the encoder-side but even more importantly on the decoder-side, some kind of windowing and overlap-add operation or any kind of crossfading from one block to the next one is performed subsequent to all alignments and, specifically, subsequent to a time-alignment using the broadband alignment parameter. This avoids any audible artifacts such as clicks when the time or broadband alignment parameter changes from block to block.
In other embodiments, different spectral resolutions are applied. Particularly, the channel signals are subjected to a time-spectral conversion having a high frequency resolution such as a DFT spectrum while the parameters such as the narrowband alignment parameters are determined for parameter bands having a lower spectral resolution. Typically, a parameter band has more than one spectral line than the signal spectrum and typically has a set of spectral lines from the DFT spectrum. Furthermore, the parameter bands increase from low frequencies to high frequencies in order to account for psychoacoustic issues.
Further embodiments relate to an additional usage of a level parameter such as an inter-level difference or other procedures for processing the side signal such as stereo filling parameters, etc. The encoded side signal can represented by the actual side signal itself, or by a prediction residual signal being performed using the mid signal of the current frame or any other frame, or by a side signal or a side prediction residual signal in only a subset of bands and prediction parameters only for the remaining bands, or even by prediction parameters for all bands without any high frequency resolution side signal information. Hence, in the last alternative above, the encoded side signal is only represented by a prediction parameter for each parameter band or only a subset of parameter bands so that for the remaining parameter bands there does not exist any information on the original side signal.
Furthermore, it is advantageous to have the plurality of narrowband alignment parameters not for all parameter bands reflecting the whole bandwidth of the broadband signal but only for a set of lower bands such as the lower 50 percents of the parameter bands. On the other hand, stereo filling parameters are not used for the couple of lower bands, since, for these bands, the side signal itself or a prediction residual signal is transmitted in order to make sure that, at least for the lower bands, a waveform-correct representation is available. On the other hand, the side signal is not transmitted in a waveform-exact representation for the higher bands in order to further decrease the bitrate, but the side signal is typically represented by stereo filling parameters.
Furthermore, it is advantageous to perform the entire parameter analysis and alignment within one and the same frequency domain based on the same DFT spectrum. To this end, it is furthermore advantageous to use the generalized cross correlation with phase transform (GCC-PHAT) technology for the purpose of inter-channel time difference determination. In a embodiment of this procedure, a smoothing of a correlation spectrum based on an information on a spectral shape, the information being a spectral flatness measure is performed in such a way that a smoothing will be weak in the case of noise-like signals and a smoothing will become stronger in the case of tone-like signals.
Furthermore, it is advantageous to perform a special phase rotation, where the channel amplitudes are accounted for. Particularly, the phase rotation is distributed between the two channels for the purpose of alignment on the encoder-side and, of course, for the purpose of de-alignment on the decoder-side where a channel having a higher amplitude is considered as a leading channel and will be less affected by the phase rotation, i.e., will be less rotated than a channel with a lower amplitude.
Furthermore, the sum-difference calculation is performed using an energy scaling with a scaling factor that is derived from energies of both channels and is, additionally, bounded to a certain range in order to make sure that the mid/side calculation is not affecting the energy too much. On the other hand, however, it is to be noted that, for the purpose of the present invention, this kind of energy conservation is not as critical as in known procedures, since time and phase were aligned beforehand. Therefore, the energy fluctuations due to the calculation of a mid-signal and a side signal from left and right (on the encoder side) or due to the calculation of a left and a right signal from mid and side (on the decoder-side) are not as significant as in the known technology.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
The signal aligner may be configured to align the channels from the multi-channel signal using the broadband alignment parameter, before the parameter determiner 100 actually calculates the narrowband parameters. Therefore, in this embodiment, the signal aligner 200 sends the broadband aligned channels back to the parameter determiner 100 via a connection line 15. Then, the parameter determiner 100 determines the plurality of narrowband alignment parameters from an already with respect to the broadband characteristic aligned multi-channel signal. In other embodiments, however, the parameters are determined without this specific sequence of procedures.
Specifically, the multi-channel encoder further comprises a time-spectrum converter 150 for converting a time domain multi-channel signal into a spectral representation of the at least two channels within the frequency domain.
Furthermore, as illustrated at 152, the parameter determiner, the signal aligner and the signal processor illustrated at 100, 200 and 300 in
Furthermore, the multi-channel encoder and, specifically, the signal processor further comprises a spectrum-time converter 154 for generating a time domain representation of the mid-signal at least.
The spectrum time converter additionally may convert a spectral representation of the side signal also determined by the procedures represented by block 152 into a time domain representation, and the signal encoder 400 of
The time-spectrum converter 150 of
In step 156, each channel is windowed using the analysis window with overlap ranges. Specifically, each channel is widowed using the analysis window in such a way that a first block of the channel is obtained. Subsequently, a second block of the same channel is obtained that has a certain overlap range with the first block and so on, such that subsequent to, for example, five windowing operations, five blocks of windowed samples of each channel are available that are then individually transformed into a spectral representation as illustrated at 157 in
In step 158, which is performed by the parameter determiner 100 of
Specifically, the operations of the steps 304 and 305 result in a kind of cross fading from one block of the mid-signal or the side signal in the next block of the mid signal and the side signal is performed so that, even when any parameter changes occur such as the inter-channel time difference parameter or the inter-channel phase difference parameter occur, this will nevertheless be not audible in the time domain mid/side signals obtained by step 305 in
The new low-delay stereo coding is a joint Mid/Side (M/S) stereo coding exploiting some spatial cues, where the Mid-channel is coded by a primary mono core coder, and the Side-channel is coded in a secondary core coder. The encoder and decoder principles are depicted in
The stereo processing is performed mainly in Frequency Domain (FD). Optionally some stereo processing can be performed in Time Domain (TD) before the frequency analysis. It is the case for the ITD computation, which can be computed and applied before the frequency analysis for aligning the channels in time before pursuing the stereo analysis and processing. Alternatively, ITD processing can be done directly in frequency domain. Since usual speech coders like ACELP do not contain any internal time-frequency decomposition, the stereo coding adds an extra complex modulated filter-bank by means of an analysis and synthesis filter-bank before the core encoder and another stage of analysis-synthesis filter-bank after the core decoder. In the embodiment, an oversampled DFT with a low overlapping region is employed. However, in other embodiments, any complex valued time-frequency decomposition with similar temporal resolution can be used.
The stereo processing consists of computing the spatial cues: inter-channel Time Difference (ITD), the inter-channel Phase Differences (IPDs) and inter-channel Level Differences (ILDs). ITD and IPDs are used on the input stereo signal for aligning the two channels L and R in time and in phase. ITD is computed in broadband or in time domain while IPDs and ILDs are computed for each or a part of the parameter bands, corresponding to a non-uniform decomposition of the frequency space. Once the two channels are aligned a joint M/S stereo is applied, where the Side signal is then further predicted from the Mid signal. The prediction gain is derived from the I LDs.
The Mid signal is further coded by a primary core coder. In the embodiment, the primary core coder is the 3GPP EVS standard, or a coding derived from it which can switch between a speech coding mode, ACELP, and a music mode based on a MDCT transformation. ACELP and the MDCT-based coder may be supported by a Time Domain BandWdth Extension (TD-BWE) and or Intelligent Gap Filling (IGF) modules respectively.
The Side signal is first predicted by the Mid channel using prediction gains derived from ILDs. The residual can be further predicted by a delayed version of the Mid signal or directly coded by a secondary core coder, performed in the embodiment in MDCT domain. The stereo processing at encoder can be summarized by
In particular, the signal is received by an input interface 600. Connected to the input interface 600 are a signal decoder 700, and a signal de-aligner 900. Furthermore, a signal processor 800 is connected to a signal decoder 700 on the one hand and is connected to the signal de-aligner on the other hand.
In particular, the encoded multi-channel signal comprises an encoded mid-signal, an encoded side signal, information on the broadband alignment parameter and information on the plurality of narrowband parameters. Thus, the encoded multi-channel signal on line 50 can be exactly the same signal as output by the output interface of 500 of
However, importantly, it is to be noted here that, in contrast to what is illustrated in
Thus, the information on the alignment parameters can be the alignment parameters as used by the signal aligner 200 in
The input interface 600 of
The signal decoder is configured for decoding the encoded mid-signal and for decoding the encoded side signal to obtain a decoded mid-signal on line 701 and a decoded side signal on line 702. These signals are used by the signal processor 800 for calculating a decoded first channel signal or decoded left signal and for calculating a decoded second channel or a decoded right channel signal from the decoded mid signal and the decoded side signal, and the decoded first channel and the decoded second channel are output on lines 801, 802, respectively. The signal de-aligner 900 is configured for de-aligning the decoded first channel on line 801 and the decoded right channel 802 using the information on the broadband alignment parameter and additionally using the information on the plurality of narrowband alignment parameters to obtain a decoded multi-channel signal, i.e., a decoded signal having at least two decoded and de-aligned channels on lines 901 and 902.
In step 914, any further processing is performed that comprises using a windowing or any overlap-add operation or, generally, any cross-fade operation in order to obtain, at 915a or 915b, an artifact-reduced or artifact-free decoded signal, i.e., to decoded channels that do not have any artifacts although there have been, typically, time-varying de-alignment parameters for the broadband on the one hand and for the plurality of narrowbands on the other hand.
In particular, the signal processor 800 from
The signal processor furthermore comprises a mid/side to left/right converter 820 in order to calculate from a mid-signal M and a side signal S a left signal L and a right signal R.
However, importantly, in order to calculate L and R by the mid/side-left/right conversion in block 820, the side signal S is not necessarily to be used. Instead, as discussed later on, the left/right signals are initially calculated only using a gain parameter derived from an inter-channel level difference parameter ILD. Generally, the prediction gain can also be considered to be a form of an ILD. The gain can be derived from ILD but can also be directly computed. It is advantageous to not compute ILD anymore, but to compute the prediction gain directly and to transmit and use the prediction gain in the decoder rather than the ILD parameter.
Therefore, in this implementation, the side signal S is only used in the channel updater 830 that operates in order to provide a better left/right signal using the transmitted side signal S as illustrated by bypass line 821.
Therefore, the converter 820 operates using a level parameter obtained via a level parameter input 822 and without actually using the side signal S but the channel updater 830 then operates using the side 821 and, depending on the specific implementation, using a stereo filling parameter received via line 831. The signal aligner 900 then comprises a phased-de-aligner and energy scaler 910. The energy scaling is controlled by a scaling factor derived by a scaling factor calculator 940. The scaling factor calculator 940 is fed by the output of the channel updater 830. Based on the narrowband alignment parameters received via input 911, the phase de-alignment is performed and, in block 920, based on the broadband alignment parameter received via line 921, the time-de-alignment is performed. Finally, a spectrum-time conversion 930 is performed in order to finally obtain the decoded signal.
Specifically, the narrowband de-aligned channels are input into the broadband de-alignment functionality corresponding to block 920 of
When
Furthermore, the DFT operations in blocks 810 correspond to element 810 in
Subsequently,
Additionally, the spectrum is also divided into different parameter bands. Each parameter band has at least one and may have more than one spectral lines. Additionally, the parameter bands increase from lower to higher frequencies. Typically, the broadband alignment parameter is a single broadband alignment parameter for the whole spectrum, i.e., for a spectrum comprising all the bands 1 to 6 in the exemplary embodiment in
Furthermore, the plurality of narrowband alignment parameters are provided so that there is a single alignment parameter for each parameter band. This means that the alignment parameter for a band applies to all the spectral values within the corresponding band.
Furthermore, in addition to the narrowband alignment parameters, level parameters are also provided for each parameter band.
In contrast to the level parameters that are provided for each and every parameter band from band 1 to band 6, it is advantageous to provide the plurality of narrowband alignment parameters only for a limited number of lower bands such as bands 1, 2, 3 and 4.
Additionally, stereo filling parameters are provided for a certain number of bands excluding the lower bands such as, in the exemplary embodiment, for bands 4, 5 and 6, while there are side signal spectral values for the lower parameter bands 1, 2 and 3 and, consequently, no stereo filling parameters exist for these lower bands where wave form matching is obtained using either the side signal itself or a prediction residual signal representing the side signal.
As already stated, there exist more spectral lines in higher bands such as, in the embodiment in
Nevertheless,
As illustrated, the level parameter ILD is provided for each of 12 bands and is quantized to a quantization accuracy represented by five bits per band.
Furthermore, the narrowband alignment parameters IPD are only provided for the lower bands up to a boarder frequency of 2.5 kHz. Additionally, the inter-channel time difference or broadband alignment parameter is only provided as a single parameter for the whole spectrum but with a very high quantization accuracy represented by eight bits for the whole band.
Furthermore, quite roughly quantized stereo filling parameters are provided represented by three bits per band and not for the lower bands below 1 kHz since, for the lower bands, actually encoded side signal or side signal residual spectral values are included.
Subsequently, a processing on the encoder side is summarized with respect to
ILD parameters, i.e., level parameters and phase parameters (IPD parameters), are calculated for each parameter band on the shifted L and R representations as illustrated at step 171. This step corresponds to step 160 of
In the final step 175, the time domain mid-signal m and, optionally, the residual signal are coded as illustrated in step 175. This procedure corresponds to what is performed by the signal encoder 400 in
At the decoder in the inverse stereo processing, the Side signal is generated in the DFT domain and is first predicted from the Mid signal as:
=g·Mid
where g is a gain computed for each parameter band and is function of the transmitted Inter-channel Level Difference (ILDs).
The residual of the prediction Side−g·Mid can be then refined in two different ways:
The two types of coding refinement can be mixed within the same DFT spectrum. In the embodiment, the residual coding is applied on the lower parameter bands, while residual prediction is applied on the remaining bands. The residual coding is in the embodiment as depict in
1. Time-Frequency Analysis: DFT
It is important that the extra time-frequency decomposition from the stereo processing done by DFTs allows a good auditory scene analysis while not increasing significantly the overall delay of the coding system. By default, a time resolution of 10 ms (twice the 20 ms framing of the core coder) is used. The analysis and synthesis windows are the same and are symmetric. The window is represented at 16 kHz of sampling rate in
2. Stereo Parameters
Stereo parameters can be transmitted at maximum at the time resolution of the stereo DFT. At minimum it can be reduced to the framing resolution of the core coder, i.e. 20 ms. By default, when no transients is detected, parameters are computed every 20 ms over 2 DFT windows. The parameter bands constitute a non-uniform and non-overlapping decomposition of the spectrum following roughly 2 times or 4 times the Equivalent Rectangular Bandwidths (ERB). By default, a 4 times ERB scale is used for a total of 12 bands for a frequency bandwidth of 16 kHz (32 kbps sampling-rate, Super Wideband stereo).
3. Computation of ITD and Channel Time Alignment
The ITD are computed by estimating the Time Delay of Arrival (TDOA) using the Generalized Cross Correlation with Phase Transform (GCC-PHAT):
where L and R are the frequency spectra of the of the left and right channels respectively. The frequency analysis can be performed independently of the DFT used for the subsequent stereo processing or can be shared. The pseudo-code for computing the ITD is the following:
In block 451, a DFT analysis of the time domain signals for a first channel (I) and a second channel (r) is performed. This DFT analysis will typically be the same DFT analysis as has been discussed in the context of steps 155 to 157 in
A cross-correlation is then performed for each frequency bin as illustrated in block 452.
Thus, a cross-correlation spectrum is obtained for the whole spectral range of the left and the right channels.
In step 453, a spectral flatness measure is then calculated from the magnitude spectra of L and R and, in step 454, the larger spectral flatness measure is selected. However, the selection in step 454 does not necessarily have to be the selection of the larger one but this determination of a single SFM from both channels can also be the selection and calculation of only the left channel or only the right channel or can be the calculation of weighted average of both SFM values.
In step 455, the cross-correlation spectrum is then smoothed over time depending on the spectral flatness measure.
The spectral flatness measure may be calculated by dividing the geometric mean of the magnitude spectrum by the arithmetic mean of the magnitude spectrum. Thus, the values for SFM are bounded between zero and one.
In step 456, the smoothed cross-correlation spectrum is then normalized by its magnitude and in step 457 an inverse DFT of the normalized and smoothed cross-correlation spectrum is calculated. In step 458, a certain time domain filter may be performed but this time domain filtering can also be left aside depending on the implementation but is advantageous as will be outlined later on.
In step 459, an ITD estimation is performed by peak-picking of the filter generalized cross-correlation function and by performing a certain thresholding operation.
If a certain threshold is not obtained, then IDT is set to zero and no time alignment is performed for this corresponding block.
The ITD computation can also be summarized as follows. The cross-correlation is computed in frequency domain before being smoothed depending of the Spectral Flatness Measurement. SFM is bounded between 0 and 1. In case of noise-like signals, the SFM will be high (i.e. around 1) and the smoothing will be weak. In case of tone-like signal, SFM will be low and the smoothing will become stronger. The smoothed cross-correlation is then normalized by its amplitude before being transformed back to time domain. The normalization corresponds to the Phase-transform of the cross-correlation, and is known to show better performance than the normal cross-correlation in low noise and relatively high reverberation environments. The so-obtained time domain function is first filtered for achieving a more robust peak peaking. The index corresponding to the maximum amplitude corresponds to an estimate of the time difference between the Left and Right Channel (ITD). If the amplitude of the maximum is lower than a given threshold, then the estimated of ITD is not considered as reliable and is set to zero.
If the time alignment is applied in Time Domain, the ITD is computed in a separate DFT analysis. The shift is done as follows:
It requires an extra delay at encoder, which is equal at maximum to the maximum absolute ITD which can be handled. The variation of ITD over time is smoothed by the analysis windowing of DFT.
Alternatively the time alignment can be performed in frequency domain. In this case, the ITD computation and the circular shift are in the same DFT domain, domain shared with this other stereo processing. The circular shift is given by:
Zero padding of the DFT windows is needed for simulating a time shift with a circular shift. The size of the zero padding corresponds to the maximum absolute ITD which can be handled. In the embodiment, the zero padding is split uniformly on the both sides of the analysis windows, by adding 3.125 ms of zeros on both ends. The maximum absolute possible ITD is then 6.25 ms. In A-B microphones setup, it corresponds for the worst case to a maximum distance of about 2.15 meters between the two microphones. The variation in ITD over time is smoothed by synthesis windowing and overlap-add of the DFT.
It is important that the time shift is followed by a windowing of the shifted signal. It is a main distinction with the conventional Binaural Cue Coding (BCC), where the time shift is applied on a windowed signal but is not windowed further at the synthesis stage. As a consequence, any change in ITD over time produces an artificial transient/click in the decoded signal.
4. Computation of IPDs and Channel Rotation
The IPDs are computed after time aligning the two channels and this for each parameter band or at least up to a given ipd_max_band, dependent of the stereo configuration.
IPD[b]=angle(Σk=band
IPDs is then applied to the two channels for aligning their phases:
Where β=a tan 2(sin(IPDi[b]), cos(IPDi [b])+c), c=10ILD
5. Sum-Difference and Side Signal Coding
The sum difference transformation is performed on the time and phase aligned spectra of the two channels in a way that the energy is conserved in the Mid signal.
is bounded between 1/1.2 and 1.2, i.e. −1.58 and +1.58 dB. The limitation avoids artifact when adjusting the energy of M and S. It is worth noting that this energy conservation is less important when time and phase were beforehand aligned. Alternatively the bounds can be increased or decreased.
The side signal S is further predicted with M:
where c=10ILD
The residual signal S′(f) can be modeled by two means: either by predicting it with the delayed spectrum of M or by coding it directly in the MDCT domain in the MDCT domain.
6. Stereo Decoding
The Mid signal X and Side signal S are first converted to the left and right channels L and R as follows:
Li[k]=Mi[k]+gMi[k],for band_limits[b]≤k<band_limits[b+1],
Ri[k]=Mi[k]−gMi[k],for band_limits[b]≤k<band_limits[b+1],
where the gain g per parameter band is derived from the ILD parameter:
where c=10ILD
For parameter bands below cod_max_band, the two channels are updated with the decoded Side signal:
Li[k]=Li[k]+cod_gaini·Si[k],for 0≤k<band_limits[cod_max_band],
Ri[k]=Ri[k]−cod_gaini·Si[k],for 0≤k<band_limits [cod_max_band],
For higher parameter bands, the side signal is predicted and the channels updated as:
Li[k]=Li[k]+cod_predi[b]·Mi-1[k],for band_limits[b]≤k<band_limits[b+1],
Ri[k]=Ri[k]−cod _predi[b]·Mi-1[k],for band_limits[b]≤k<band_limits[b+1],
Finally, the channels are multiplied by a complex value aiming to restore the original energy and the inter-channel phase of the stereo signal:
where a is defined and bounded as defined previously, and where β=a tan 2(sin(IPDi[b]), cos(IPDi[b])+c), and where a tan 2(x,y) is the four-quadrant inverse tangent of x over y.
Finally, the channels are time shifted either in time or in frequency domain depending of the transmitted ITDs. The time domain channels are synthesized by inverse DFTs and overlap-adding.
Specific features of the invention relate to the combination of spatial cues and sum-difference joint stereo coding. Specifically, the spatial cues IDT and IPD are computed and applied on the stereo channels (left and right). Furthermore, sum-difference (M/S signals) are calculated and a prediction may be applied of S with M.
On the decoder-side, the broadband and narrowband spatial cues are combined together with sum-different joint stereo coding. In particular, the side signal is predicted with the mid-signal using at least one spatial cue such as ILD and an inverse sum-difference is calculated for getting the left and right channels and, additionally, the broadband and the narrowband spatial cues are applied on the left and right channels.
The encoder may have a window and overlap-add with respect to the time aligned channels after processing using the ITD. Furthermore, the decoder additionally has a windowing and overlap-add operation of the shifted or de-aligned versions of the channels after applying the inter-channel time difference.
The computation of the inter-channel time difference with the GCC-Phat method is a specifically robust method.
The new procedure is advantageous conventional technology since is achieves bit-rate coding of stereo audio or multi-channel audio at low delay. It is specifically designed for being robust to different natures of input signals and different setups of the multichannel or stereo recording. In particular, the present invention provides a good quality for bit rate stereos speech coding.
The procedures find use in the distribution of broadcasting of all types of stereo or multichannel audio content such as speech and music alike with constant perceptual quality at a given low bit rate. Such application areas are a digital radio, internet streaming or audio communication applications.
An inventively encoded audio signal can be stored on a digital storage medium or a non-transitory storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier or a non-transitory storage medium.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention
Number | Date | Country | Kind |
---|---|---|---|
16152450 | Jan 2016 | EP | regional |
16152453 | Jan 2016 | EP | regional |
This application is a continuation of copending International Application No. PCT/EP2017/051205, filed Jan. 20, 2017, which is incorporated herein by reference in its entirety, and additionally claims priority from European Applications Nos. EP 16 152 453.3, filed Jan. 22, 2016 and EP 16 152 450.9, filed Jan. 22, 2016, all of which are incorporated herein by reference in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
5434948 | Holt et al. | Jul 1995 | A |
6073100 | Goodridge, Jr. | Jun 2000 | A |
6138089 | Guberman | Oct 2000 | A |
6549884 | Laroche et al. | Apr 2003 | B1 |
7089180 | Heikkinen | Aug 2006 | B2 |
7240001 | Chen | Jul 2007 | B2 |
7630882 | Mehrotra | Dec 2009 | B2 |
7831434 | Mehrotra | Nov 2010 | B2 |
7917369 | Chen | Mar 2011 | B2 |
7953604 | Mehrotra | May 2011 | B2 |
8099292 | Thumpudi | Jan 2012 | B2 |
8255228 | Hilpert et al. | Aug 2012 | B2 |
8255229 | Koishida | Aug 2012 | B2 |
8255230 | Thumpudi | Aug 2012 | B2 |
8315880 | Kovesi et al. | Nov 2012 | B2 |
8630861 | Chen et al. | Jan 2014 | B2 |
8700388 | Edler et al. | Apr 2014 | B2 |
8762159 | Geiger et al. | Jun 2014 | B2 |
8793125 | Breebaart et al. | Jul 2014 | B2 |
8811621 | Schuijers | Aug 2014 | B2 |
20050157883 | Herre et al. | Jul 2005 | A1 |
20060190247 | Lindblom | Aug 2006 | A1 |
20090222272 | Seefeldt et al. | Sep 2009 | A1 |
20090313028 | Tammi et al. | Dec 2009 | A1 |
20110096932 | Schuijers | Apr 2011 | A1 |
20110106542 | Bayer et al. | May 2011 | A1 |
20110202355 | Grill et al. | Aug 2011 | A1 |
20120002818 | Heiko et al. | Jan 2012 | A1 |
20120033817 | Francois et al. | Feb 2012 | A1 |
20120045067 | Oshikiri | Feb 2012 | A1 |
20130121411 | Robillard et al. | May 2013 | A1 |
20130151262 | Lohwasser et al. | Jun 2013 | A1 |
20130226570 | Multrus et al. | Aug 2013 | A1 |
20130262130 | Ragot et al. | Oct 2013 | A1 |
20130301835 | Briand et al. | Nov 2013 | A1 |
20130332148 | Ravelli et al. | Dec 2013 | A1 |
20140032226 | Raju et al. | Jan 2014 | A1 |
20140140516 | Taleb et al. | May 2014 | A1 |
20150010155 | Virette et al. | Jan 2015 | A1 |
20150049872 | Virette et al. | Feb 2015 | A1 |
20160247515 | Koishida | Aug 2016 | A1 |
20170133023 | Disch et al. | May 2017 | A1 |
20180122385 | Chebiyyam et al. | May 2018 | A1 |
Number | Date | Country |
---|---|---|
1953736 | Aug 2008 | EP |
2229677 | Sep 2015 | EP |
2947656 | Nov 2015 | EP |
2453117 | Apr 2009 | GB |
2008530616 | Aug 2008 | JP |
2010020333 | Jan 2010 | JP |
2011522472 | Jul 2011 | JP |
2012521012 | Sep 2012 | JP |
2013528824 | Jul 2013 | JP |
2013538367 | Oct 2013 | JP |
2013543600 | Dec 2013 | JP |
2015518176 | Jun 2015 | JP |
2391714 | Jun 2010 | RU |
2420816 | Jun 2011 | RU |
2491657 | Aug 2013 | RU |
2542668 | Feb 2015 | RU |
2562384 | Sep 2015 | RU |
201334580 | Aug 2013 | TW |
2006089570 | Aug 2006 | WO |
2007052612 | May 2007 | WO |
2010084756 | Jul 2010 | WO |
2012020090 | Feb 2012 | WO |
2012105886 | Aug 2012 | WO |
2012110473 | Aug 2012 | WO |
2014043476 | Mar 2014 | WO |
2014044812 | Mar 2014 | WO |
2014161992 | Oct 2014 | WO |
2016108655 | Jul 2016 | WO |
2016142337 | Sep 2016 | WO |
Entry |
---|
Herre, J et al., “The Reference Model Architecture for MPEG Spatial Audio Coding”, Convention Paper Presented at the 118th Convention, Audio Engineering Society, New York, NY, US. No. 6447, May 28, 2005, 1-13. |
Vivette, David et al., “G.722 annex D and G.711.1 Annex F—New ITU-T stereo codecs”, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 528-532, May 26, 2013. |
Fuchs, Guillaume et al., “Low Delay LPC and MDCT-Based Audio Coding in the EVS Codec”, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Apr. 19, 2015; pp. 5723-5727; XP033064796, Apr. 19, 2015, 5723-5727. |
Helmrich, Christian R. et al., “Low-Delay Transform Coding Using the MPEG-H 3D Audio Codec”, AES Convention 139; Oct. 23, 2015; XP040672209, Oct. 23, 2015. |
Herre, J et al., “Spatial Audio Coding: Next-Generation Efficient and Compatible Coding”, Convention Paper Presented at the 117th Convention. Audio Engineering Society Convention Paper, New York, NY, U.S.A. No. 6186., Oct. 28, 2004, 1-13. |
Herre, J et al., “The Reference Model Architecture for MPEG Spatial Audio Coding”, Proc. 118th Convention of the Audio Engineering Society, ES, AES May 28, 2005, p. 1-13. |
Herre, Jurgen , “From joint stereo to spatial audio coding—recent progress and standardization”, Proceedings of the International Conference on Digital Audioeffects; Oct. 5, 2004; pp. 157-162; XP002367849, Oct. 5, 2004, 157-162. |
Jansson, Tomas , “UPTEC F11 034 Stereo Coding for ITU-T G.719 codec”, May 17, 2011; XP55114839; http://www.diva-portal.org/smash/get/diva2:417362/FULLTEXT01.pdf, May 17, 2011. |
Martin, Rainer et al., “Low Delay Analysis/Synthesis Schemes for Joint Speech Enhancement and Low Bit Rate Speech Coding”, 6th European Conference on Speech Communication and Technology, EUROSPEECH '99. Budapest, Hungary, Sept. 5-9, 1999; pp. 1463-1466; XP001075956, Sep. 5, 1999, 1463-1466. |
Valero, Maria L. et al., “A New Parametric Stereo and Multichannel Extension for MPEG-4 Enhanced Low Delay AAC (AAC-ELD)”, AES Convention 128; May 1, 2010; XP040509482, May 1, 2010. |
Wada, Ted S. et al., “Decorrelation by resampling in frequency domain for multichannel acoustic echo cancellation based on residual echo enhancement”, Applications of Signal Processing to Audio and Acoustics (WASPAA); Oct. 16, 2011; pp. 289-292; XP032011497, Oct. 16, 2011, 289-292. |
“Information technology—MPEG audio technologies—Part 3: Unified speech and audio coding”, ISO/IEC FDIS 23003-3:2011(E), ISO/IEC JTC 1/SC 29/WG 11, Sep. 20, 2011. |
Bosi, Marina, et al., “ ISO/IEC MPEG-2 advanced audio coding”, Journal of the Audio engineering society, 1997, vol. 45. No. 10, pp. 789-814., pp. 789-814. Uploaded in 2 parts. |
Number | Date | Country | |
---|---|---|---|
20180322883 A1 | Nov 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2017/051205 | Jan 2017 | US |
Child | 16034206 | US |