The present disclosure relates to sound coding, in particular but not exclusively to a method and device for multi-channel comfort noise injection in a decoded sound signal at the decoder of a sound codec, in particular but not exclusively a stereo sound codec.
In the present disclosure and the appended claims:
Historically, conversational telephony has been implemented with handsets having only one transducer to output sound only to one of the user's ears. In the last decade, users have started to use their portable handset in conjunction with a headphone to receive the sound over their two ears mainly to listen to music but also, sometimes, to listen to speech. Nevertheless, when a portable handset is used to transmit and receive conversational speech, the content is still mono but presented to the user's two ears when a headphone is used.
With the newest 3GPP (3rd Generation Partnership Project) speech coding Standard, designated Enhanced Voice Services (EVS), as described in Reference [1], of which the full content is incorporated herein by reference, the quality of the coded sound, for example speech and/or audio that is transmitted and received through a portable handset has been significantly improved. The next natural step is to transmit stereo information such that the receiver gets as close as possible to a real-life audio scene that is captured at the other end of the communication link.
Efficient stereo coding techniques have been developed and used for low bitrates. As a non-limitative example, the so-called parametric stereo coding constitutes one efficient technique for low bitrate stereo coding.
Parametric stereo encodes two, left and right channels as a mono signal using a common mono codec plus a certain amount of stereo side information (corresponding to stereo parameters) which represents a stereo image. The two input, left and right channels are down-mixed into a mono signal, for example by summing the left and right channels and dividing the sum by 2. The stereo parameters are then computed usually in transform domain, for example in the Discrete Fourier Transform (DFT) domain, and are related to so-called binaural or inter-channel cues. The binaural cues (References [2] and [3], of which the full content is incorporated herein by reference) comprise Interaural Level Difference (ILD), Interaural Time Difference (ITD) and Interaural Correlation (IC). Depending on the signal characteristics, stereo scene configuration, etc., some or all binaural cues are coded and transmitted to the decoder. Information about what binaural cues are coded and transmitted is sent as signaling information, which is usually part of the stereo side information. Also, the binaural cues can be quantized (coded) using the same or different coding techniques which results in a variable number of bits being used. In addition to the quantized binaural cues, the stereo side information may contain, usually at medium and higher bitrates, a quantized residual signal that results from the down-mixing, for example obtained by calculating a difference between the left and right channels and dividing the difference by 2. The binaural cues, residual signal and signalling information may be coded using an entropy coding technique, e.g. an arithmetic encoder, Additional information about arithmetic encoders may be found, for example, in Reference [1]. In general, parametric stereo coding is most efficient at lower and medium bitrates.
Further, in last years, the generation, recording, representation, coding, transmission, and reproduction of audio is moving towards an enhanced, interactive and immersive experience for the listener. The immersive experience can be described, for example, as a state of being deeply engaged or involved in a sound scene while sounds are coming from all directions. In immersive audio (also called 3D (Three-Dimensional) audio), the sound image is reproduced in all three dimensions around the listener, taking into consideration a wide range of sound characteristics like timbre, directivity, reverberation, transparency and accuracy of (auditory) spaciousness. Immersive audio is produced for a particular sound playback or reproduction system such as a loudspeaker-based system, an integrated reproduction system (sound bar) or headphones. Then, interactivity of a sound reproduction system may include, for example, an ability to adjust sound levels, change positions of sounds, or select different languages for the reproduction.
In recent years, 3GPP (3rd Generation Partnership Project) started working on developing a 3D sound codec for immersive services called IVAS (Immersive Voice and Audio Services), based on the EVS codec (See Reference [4] of which the full content is incorporated herein by reference).
The present disclosure relates to a method implemented in a multi-channel sound decoder for injecting multi-channel comfort noise in a decoded multi-channel sound signal, comprising: estimating background noise in a decoded mono down-mixed signal; and calculating, in response to the estimated background noise, comfort noise for each of a plurality of channels of the decoded multi-channel sound signal, and injecting the calculated comfort noise in the respective channels of the decoded multi-channel sound signal.
The present disclosure is also concerned with a device implemented in a multi-channel sound decoder for injecting comfort noise in a decoded multi-channel sound signal, comprising: an estimator of background noise in a decoded mono down-mixed signal; and an injector of comfort noise for calculating, in response to the estimated background noise, comfort noise for each of a plurality of channels of the decoded multi-channel sound signal and for injecting the calculated comfort noise in the respective channels of the decoded multi-channel sound signal.
The foregoing and other objects, advantages and features of the method and device for multi-channel comfort noise injection will become more apparent upon reading of the following non-restrictive description of illustrative embodiments thereof, given by way of example only with reference to the accompanying drawings.
In the appended drawings:
The present disclosure generally relates to multi-channel, for example stereo comfort noise injection techniques in a sound decoder.
A stereo comfort noise injection technique will be described, by way of non-limitative example only, with reference to a parametric stereo sound decoder in an IVAS coding framework referred to throughout the present disclosure as IVAS codec (or IVAS sound codec). However, it is within the scope of the present disclosure to incorporate such multi-channel comfort noise injection techniques in any other types of multi-channel sound decoder and codec.
Mobile communication scenarios involving stereophonic signal capture may use low-bitrate parametric stereo coding as described, for example, in References [2] or [3]. In a low-bitrate parametric stereo encoder, a single transmission channel is usually used to transmit the mono down-mixed sound signal. The down-mixing process is designed to extract a signal from a principal direction of incoming sound. The quality of representation of the mono down-mixed signal is to a large extent determined by the underlying core codec. Due to the limitations of the available bit budget the quality of the decoded mono down-mixed signal is often mediocre, especially in the presence of background noise as described in Reference [5], of which the full content is herein incorporated by reference. As a non-limitative example, in case of a CELP-based core codec, the available bit budget is distributed among coding of various components such as the spectral envelope, adaptive codebook, fixed codebook, adaptive-codebook gain, and fixed codebook gain of the excitation signal. In active segments of a noisy speech signal the amount of bits allocated to coding of the fixed codebook is not sufficient for a transparent representation thereof. Spectral holes can be observed in the spectrogram of the synthesized sound signal in certain frequency regions, for example between the formants. When listening to the synthesized sound signal the background noise is perceived as intermittent, thereby reducing the performance of the parametric stereo encoder.
A technical effect of the method and device according to the present disclosure for stereo comfort noise injection in a decoded sound signal at the decoder of a sound codec, in particular but not exclusively a parametric stereo decoder, is to reduce the negative effect of insufficient background noise representation in the codec. The decoded sound signal is analyzed during inactive segments where background noise is assumed to be present without speech. A long-term estimate of the spectral envelope of the background noise is calculated and stored in the memory of the decoder. A synthetically-made copy of the background noise is then generated in active segments of the decoded sound signal and injected in this decoded sound signal. The method and device for stereo comfort noise injection according to the present disclosure is different from the so-called “comfort noise addition” applied in, for example, the EVS codec (Reference [1]). The differences include, amongst others at least the following aspects:
The disclosed method and device for stereo comfort noise injection can be part of the parametric stereo decoder of an IVAS sound codec.
As already mentioned, the stereo comfort noise injection device and method are described, by way of non-limitative example only, with reference to a parametric stereo decoder in an IVAS sound codec.
Referring to
The demultiplexer 101 recovers from the received bitstream (a) the coded mono down-mixed signal 131, for example in time-domain and (b) the coded stereo parameters 132 such as the above mentioned ILD, ITD and/or IC binaural cues and possibly the above mentioned quantized residual signal resulting from the down-mixing.
The parametric stereo decoding method 150 of
According to a non-limitative example, the core decoder 102 may be a CELP (Code-Excited Linear Prediction)-based core codec. The core decoder 102 then uses CELP technology to obtain a decoded mono down-mixed signal 133, in time-domain, from the received coded mono down-mixed signal 131.
It is within the scope of the present disclosure to use other types of core decoder technologies such as ACELP (Algebraic Code-Excited Linear Prediction), TCX (Transform-Coded eXcitation) or GSC (Generic audio Signal Coder).
Additional information about CELP, ACELP, TCX and GSC decoders may be found, for example, in Reference [1].
Referring to
Obviously, the stereo parameters decoder 110 uses decoding technique(s) corresponding to those that have been used to code the stereo parameters 132.
For example, if the above-mentioned binaural cues, residual signal and signalling information are coded using an entropy coding technique, e.g. arithmetic coding, the decoder 110 uses corresponding entropy/arithmetic decoding techniques to recover these binaural cues, residual signal and signalling information.
Referring to
The calculator 104 transforms the time-domain, decoded mono down-mixed signal 133 into a frequency-domain mono down-mixed signal 135. For that purpose, the calculator 104 uses a frequency transform such as a Discrete Fourier Transform (DFT) or a Discrete Cosine Transform (DCT).
The parametric stereo decoding method 150 comprises an operation 155 of stereo up-mixing the frequency-domain mono down-mixed signal 135 from the frequency transform calculator 104 and the decoded stereo parameters 145 from the stereo parameters decoder 110 to produce frequency-domain left channel 136 and right channel 137 of the decoded stereo sound signal. To perform operation 154, the parametric stereo decoder 100 comprises a stereo up-mixer 105.
An example of stereo up-mixing of the frequency-domain mono down-mixed signal 135 from the frequency transform calculator 104 and the decoded stereo parameters 145 from the stereo parameters decoder 110 to produce frequency-domain left channel 136 and right channel 137 is described for example in Reference [2], Reference [3], and Reference [6], of which the full content is incorporated herein by reference.
The parametric stereo decoding method 150 comprises an operation 157 of inverse frequency transforming the up-mixed frequency-domain left 138 and right 139 channels. To perform operation 157, the parametric stereo decoder 100 comprises an inverse frequency transform calculator 107.
Specifically, the calculator 107 inverse transforms the frequency-domain left channel 138 and right channel 139 into time-domain left channel 140 and right channel 141. For example, if the calculator 104 uses a discrete Fourier transform, the calculator 107 uses an inverse discrete Fourier transform. If the calculator 104 uses a DCT transform, the calculator 107 uses an inverse DCT transform.
Additional information regarding parametric stereo encoders and decoders can be found, for example, in Reference [2], [3] and [6].
As described herein below, the parametric stereo decoding method 150 of
Referring to
The background noise estimator 103 of the parametric stereo decoder 100 of
The information about speech inactivity may be calculated in the parametric stereo encoder (not shown) of the IVAS sound codec using a Voice Activity Detection (VAD) algorithm similar to that used in the EVS codec (Reference [1]) and transmitted to the parametric stereo decoder 100 as a binary VAD flag fVAD in the bitstream received by the demultiplexer 101. Alternatively, the binary VAD flag fVAD can be coded as part of an encoder type parameter, for example as described in the EVS codec (Reference [1]). The encoder type parameter in the EVS codec is selected from the following set of signal classes: INACTIVE, UNVOICED, VOICED, GENERIC, TRANSITION and AUDIO. When the decoded encoder type parameter is INACTIVE the VAD flag fVAD is “0”. In all other cases the VAD flag is “1”. If the binary VAD flag fVAD is not transmitted in the bitstream and it cannot be deduced from the encoder type parameter, it can be calculated explicitly in the background noise estimator 103 by running the VAD algorithm on the decoded mono down-mixed signal 133. The VAD flag fVAD in the parametric stereo decoder 100 may be expressed using, for example, the following relation (1):
with n being an index of the sample of decoded mono down-mixed signal 133 and N the total number of samples in the current frame (length of the current frame). The decoded mono down-mixed signal 133 is denoted as md(n), n=0, . . . , N−1.
The estimation of the background noise envelope by analyzing the decoded mono down-mixed signal 133 during speech inactivity will be described herein after in section 3.1.1-3.1.5.
The background noise estimator 103 converts the decoded mono down-mixed signal 133 to frequency-domain using a DFT transform. The DFT transformation process 200 is illustrated in the schematic diagram of
To reduce the effects of spectral leakage occurring at frame borders, the decoded mono down-mixed signal 133 is first multiplied with a tapered window, for example the normalized sine window 204. The raw sine window ws(n) may be expressed using the following relation (2):
The sine window ws(n) is normalized (wsn(n)) using, for example, the following relation (3):
The decoded mono down-mixed signal 133 (md(n)) is windowed (mw(n)) with the normalized sine window wsn(n) using, for example, the following relation (4):
m
w(n)=md(n)wsn(n),n=0, . . . ,2N−1 (4)
The windowed decoded mono down-mixed signal mw(n) is then transformed with the DFT transform 201 using, for example, the following relation (5):
As the input, decoded mono down-mixed signal 133 is real, its spectrum (see 205 in
As can be seen from relation (6), the power spectrum (see 206 in
The normalized power spectrum P(k) is compressed in the frequency domain by compacting frequency bins into frequency bands. As a non-limitative example, let's assume that the decoded mono down-mixed signal 133 is sampled at a sampling frequency of 16 kHz and the length of a frame is 20 ms. The total number of samples in every frame is N=320 and the length of the FFT (Fast Fourier Transform used to calculate the DFT) transform is 2N=640. Let's denote the total number of frequency bands as B. The process 300 of compacting spectral bins in frequency bands is illustrated in
Human auditory system is more sensitive to spectral content at low frequencies. Therefore, in the example of partitioning scheme of
N(k)=P(k),k=0, . . . ,kBIN (7)
For frequencies higher than kBIN, the background noise estimator 103 compresses the bin-wise power spectrum by means of spectral averaging of the frequency bins of the power spectrum P(k) in the corresponding frequency band. This is done by first calculating a mean N0(b) of the power spectrum P(k) in each frequency band using, for example, the following relation (8):
where b represents the frequency band and the range klow(b),khigh(b) identifies the set of frequency bins of a bth frequency band, of which klow(b) is the lowest frequency bin and khigh(b) is the highest frequency bin. In the exemplary case of a number N=320 of frequency bins, the assignment of frequency bins to frequency bands is defined in Table 1, where kmid(b) represents the middle frequency bin of a frequency band b.
The above described spectral averaging of relation (8) tends to reduce the variance of the background noise. To compensate for the loss of variance, the background noise estimator 103 adds random gaussian noise to the mean power spectrum. This is done as follows. First, the background noise estimator 103 calculates a variance σ(b) of the random gaussian noise in each frequency band b using, for example, the following relation (9):
The random gaussian noise generated by the background noise estimator 103 has zero mean and a variance calculated using Equation (9) in each frequency band. The generated random gaussian noise is denoted as N (0,σb2). The addition N(b) of the generated random gaussian noise to the compressed power spectrum can then be expressed using relation (10):
N(b)=N0(b)+N(0,σb2),b=kBIN+1, . . . ,B−1 (10)
The values of the compressed power spectrum below 10−5 are limited. The addition of random gaussian noise to the mean power spectrum is only performed after an intialization procedure, which is explained later in the present disclosure.
The background noise estimator 103 smoothes the compressed power spectrum N(b) in the frequency domain by means of non-linear IIR filtering. The IIR filtering operation depends on the VAD flag fVAD. As a general rule, the smoothing is stronger during inactive segments and weaker during active segments of the decoded stereo sound signal. The smoothed compressed power spectrum is denoted as Ñ(b), b=0, . . . , B−1.
For inactive segments of the decoded stereo sound signal, when the VAD flag fVAD is “0” in the current frame, the IIR smoothing is performed using, for example, the following relation (11):
where the index m in brackets has been added to denote the current frame. In the first line of relation (11) fast downward update of the compressed power spectrum is performed in single-bin partitions using a forgetting factor α of 0.8. In the second line of relation (11) only slow upward update is performed for all bands of the compressed power spectrum using a factor α of 1.05. The third line of relation (11) represents the default IIR filter configuration using a forgetting factor α of 0.95 for all cases other than those described by the conditions of the first and second lines of relation (11).
For active segments of the decoded stereo sound signal, when the VAD flag fVAD is “1” in the current frame, the background noise estimator 103 performs IIR smoothing only in some selected frequency bands. The smoothing operation is performed with an IIR filter having a forgetting factor proportional to the ratio between the total energy of the compressed power spectrum and the total energy of the smoothed compressed power spectrum.
The total energy EN of the compressed power spectrum can be calculated using, for example, the following relation (12):
The total energy {tilde over (E)}N of the smoothed compressed power spectrum can be calculated using, for example, the following relation (13):
The ratio renr between the total energy EN of the compressed power spectrum and the total energy {tilde over (E)}N of the smoothed compressed power spectrum can be calculated using, for example, the following relation (14):
where ε is a small constant value added to avoid division by zero, for example ε=10−7.
If the energy ratio renr is lower than 0.5 then it means that total energy EN of the compressed power spectrum is significantly lower than the total energy {tilde over (E)}N of the smoothed compressed power spectrum. In this case, the smoothed compressed power spectrum Ñ[m](b) in the current frame m is updated using, for example, the following relation (15):
Ñ
[m](b)=renr·Ñ[m-1](b)+(1−renr)·N[m](b),b=0, . . . ,B−1 if N[m](b)<Ñ[m](b) (15)
Thus, in all bands where significant energy drop is detected in the current frame the energy of the smoothed compressed power spectrum Ñ[m](b) is updated rather quickly, in proportion to the energy ratio renr.
If the energy ratio renr is higher than or equal to 0.5 the smoothed compressed power spectrum Ñ[m](b) is updated only in frequency bands above 2275 Hz. This corresponds to b≥50 in this illustrative embodiment. First, the background noise estimator 103 calculates a short-term average of the smoothed compressed power spectrum Ñ[m](b) using, for example, the following relation (16):
Ñ
ST
[m](b)=0.9·Ñ[m-1](b)+0.1·N[m](b),b=50, . . . ,B−1 (16)
where ÑST[0](b)=0 for b=50, . . . , B−1. The short-term smoothed compressed power spectrum is updated in every frame, regardless of the value of renr. The background noise estimator 103 updates the smoothed compressed power spectrum Ñ[m](b) in frames where renr≥0.5 using, for example, the following relation (9):
Ñ
[m](b)=0.7·Ñ[m-1](b)+0.3·ÑST[m](b),b=50, . . . ,B−1 if ÑST[m](b)<Ñ[m-1](b) (17)
Again, only downward update (energy drop is detected in the current frame) is allowed but the update is slower compared to the case when renr<0.5.
The update of the smoothed compressed power spectrum Ñ[m](b), as described in this section 3.1.3, is modified during an initialization procedure, which will be explained in the next section of the present disclosure.
The background noise estimation operation 153 requires proper initialization.
The background noise estimator 103 uses a counter cCNI of consecutive inactive frames (fVAD=“0”) in which the smoothed compressed power spectrum Ñ[m](b) is updated. The counter cCNI is initialized to 0 (block 401 in
Referring to
As can be seen, the initialization procedure 400 is completed after the smoothed compressed power spectrum Ñ[m](b) has been updated in a given number of consecutive inactive frames. This is controlled by the parameter cMAX. As a non-limitative example, the parameter cMAX is set to 5. Setting the parameter cMAX to a higher value may lead to an initialization procedure 400 of the background noise estimation operation 153 which is more stable but which requires a longer period of time to complete the initialization. As the smoothed compressed power spectrum Ñ[m](b) is used for stereo comfort noise injection and also during Discontinuous Transmission (DTX) operation it is not advisable to extend the initialization period too much. Further information about the DTX operation can be found, for example, in Reference [1].
During the initialization procedure 400, the background noise estimator 103 updates (sub-operation 403) the smoothed compressed power spectrum Ñ[m](b) with the successive IIR filter using, for example, the following relation (18):
in which [m] is the frame index and ÑST[0](b)=0 for b=0, . . . , B−1. Thus, the forgetting factor α=1/(cCNI+1) is proportional to the counter cCNI. Therefore to the number of inactive frames in which the smoothed compressed power spectrum Ñ[m](b) has been updated. With this initialization procedure 400, the smoothed compressed power spectrum Ñ[m](b) contains meaningful spectral information about the background noise. In case it happens, for example, that DTX operation is detected in the decoder before the initialization procedure is completed, it is still possible to use the smoothed compressed power spectrum Ñ[m](b) as an estimate of the background noise.
Similarly to power spectrum compression as illustrated in
{tilde over (P)}(k)=Ñ(k),k=0, . . . ,kBIN (19)
For frequencies higher than kBIN, the background noise estimator 103 expands the band-wise compressed power spectrum by means of linear interpolation in the logarithmic domain as described in Reference [1]. For that purpose, the background noise estimator 103 first calculates a multiplicative increment βmult using, for example, the following relation (20):
where b identifies the frequency band and kmid(b) the middle bin of the bth band. The expanded power spectrum is then calculated for all b=kBIN+1, . . . , B−1 using, for example, the following relation (21):
{tilde over (P)}(k)=Ñ(b−1)·[βmult(b)](k-k
In relations (20) and (21), the frame index [m] has been omitted for simplicity.
As the expanded power spectrum {tilde over (P)}(k) is calculated according to relations (19) and (21) during inactive frames, it represents an estimation of the background noise in the decoded mono down-mixed signal 133.
Referring back to
The stereo Comfort Noise Injection (CNI) technology of operation 156 is based on the Comfort Noise Addition (CNA) technology, originally developed and integrated in the 3GPP EVS Codec (Reference [1]). The purpose of the CNA in the EVS codec is to compensate for the loss of energy arising from ACELP-based coding of noisy speech signals (Reference [5]). The loss of energy is especially noticeable at low bitrates, when the number of available bits in the ACELP encoder is insufficient to encode the fixed contribution (fixed codebook index and gain) of the excitation. As a result, the energy of the decoded signal in spectral valleys between speech formants is lower than the energy in the original signal. This leads to an undesirable effect of “noise attenuation”, negatively perceived by the listeners. Addition of random noise with proper level and spectral shape efficiently covers the spectral valleys, thereby boosting the noise floor and resulting in an uninterrupted perception of the background noise. In the EVS decoder, comfort noise is generated and added to the decoded signal in the frequency domain.
It is possible to generate and inject the comfort noise into the decoded mono down-mixed signal 133 of the parametric stereo decoder 100. However, the decoded mono down-mixed signal 133 is converted into the left channel 136 and the right channel 137 during the stereo up-mixing operation 155. As the spatial properties of the dominant sound, represented by the decoded mono down-mixed signal 133, and the spatial properties of the surrounding (background) noise can be very different this could lead to undesirable spatial unmasking effects. To circumvent this problem the comfort noise is generated after the stereo up-mixing operation 155 and injected separately into the left channel 136 and the right channel 137. The spatial properties of the background noise are estimated directly in the decoder, during inactive segments.
Assuming decoder 100 running in a non-DTX operation mode, the spatial properties of the background noise can be estimated during inactive segments of the decoded stereo sound signal signaled by a VAD flag fVAD set to “0”. The key spatial parameter is the inter-channel coherence (ICC). As the estimation of the ICC parameter involves conversion of the decoded stereo signal (left channel and right channel) to frequency domain, it may be too complex to calculate such ICC parameter. A reasonable approximation of the ICC parameter is the inter-channel correlation (IC) parameter that can be calculated in the time domain. The IC parameter may be calculated by the stereo comfort noise injector 106 using, for example, the following relation (22):
where l(n) and r(n) are respectively the left channel and the right channel of the decoded stereo sound signal in time domain calculated from the left channel 136 and right channel 137 in frequency domain using the frequency transform inverse to that used in calculator 104, N is the number of samples in a current frame, [m] is the frame index, and the index LR refers to left (L) and right (R) to show that the parameter IC relates to correlation between the left and right channels.
A second spatial parameter estimated in the decoder 100 is the inter-channel level difference (ILD). The stereo comfort noise injector 106 may calculate the parameter ILD by expressing a ratio cLR between the energy of the left channel l(n) and the energy of the right channel r(n) of the decoded stereo sound signal in the current frame using, for example, the following relation (23):
and then calculate the ILD parameter using, for example, the following relation (24):
As both the IC and ILD spatial parameters are calculated from a same, single frame their fluctuation is high. Therefore, the stereo comfort noise injector 106 smooths the IC and ILD spatial parameters by means of IIR filtering. The smoothed inter-channel correlation (IC) parameter may be calculated using, for example, the following relation (25):
LR
[m]=0.95·LR[m-1]+0.05·ICLR[m] (25)
and the smoothed inter-channel level difference (ILD) parameter may be calculated using, for example, the following relation (26):
LR
[m]=0.9·LR[m-]+0.1·ILDLR[m] (26)
During the initialization procedure 400 of
LR
[m]
=IC
LR
[m]if fCNI=0 (27)
LR
[m]
=ILD
LR
[m],if fCNI=0
The initial values for LR[0] and LR[0] are “0”.
The stereo comfort noise injector 106 generates and injects the stereo comfort noise in the frequency domain. In the following, non-restrictive example of implementation:
The previous non-limitative implementation example where the decoded mono down-mixed signal is sampled at 16 kHz and the background noise is estimated in the frequency range of 0 to 8000 Hz will be followed. For successful background noise injection in the up-mix domain (left channel 136 and right channel 137), the sampling rate of the left channel 136 and the right channel 137 will be at least 16 kHz. If it is assumed, as a non-limitative example, that the left 136 and right 137 channels of the decoded stereo sound signal are sampled at 32 kHz with a number M=640 of samples by frame. This corresponds to a FFT length of 20 ms which is also the frame length in the parametric stereo decoder 100. Thus, the frequency resolution of the background noise spectrum P is 25 Hz whereas the frequency resolution of the spectrum of the left channel 136 and the right channel 137 of the decoded stereo sound signal is 50 Hz. The mismatch of frequency resolution can be resolved during stereo comfort noise generation by averaging the level of the background noise in two adjacent spectral bins as explained in the following description.
The stereo comfort noise injector 106 generates two random signals with Gaussian Probability Density Functions (PDF) using, for example, the following relations (28):
G
1(k)˜N(0,1)
G
2(k)˜N(0,1) (28)
for k=0, . . . , M−1, M being the number of samples by frame. The two random signals G1(k) and G2(k) are mixed together to create a left channel and a right channel of the stereo comfort noise. The mixing is designed to match the spatial properties of the estimated background noise represented by the smoothed inter-channel correlation (IC) parameter described in relation (25) and the smoothed inter-channel level difference (ILD) parameter described in relation (26). The stereo comfort noise injector 106 calculates a mixing factor γ using, for example, the following relation (29):
The spectral envelope of the stereo comfort noise (comfort noise for the left and right channels) is controlled with the expanded power spectrum (estimated background noise in the decoded mono down-mixed signal 133) calculated in relations (19) and (21). Also, the frequency resolution of the expanded power spectrum is reduced by a factor “2”.
The minimum and the maximum level in each pair of adjacent frequency bins of the expanded power spectrum {tilde over (P)}(k) may be expressed using, for example, the following relations (30):
{tilde over (P)}
min(k)=min({tilde over (P)}(2k),{tilde over (P)}(2k+1)),for k=0, . . . ,N/2−1
{tilde over (P)}
max(k)=max({tilde over (P)}(2k),{tilde over (P)}(2k+1)), for k=0, . . . ,N/2−1 (30)
where N is the number of frequency bins and k is the frequency bin index.
The stereo comfort noise injector 106 then carries out a reduction of the frequency resolution using, for example, the following relations (31):
Thus, according to relation (31) the level of the comfort noise for injection in the frequency domain left channel 136 and right channel 137 is set to the minimum level in two adjacent frequency bins of the expanded power spectrum {tilde over (P)}(k) if the ratio between the maximum {tilde over (P)}max(k) and minimum {tilde over (P)}min(k) values of the expanded power spectrum {tilde over (P)}(k) in adjacent frequency bins exceeds a threshold of 1.2. This prevents excessive comfort noise injection in signals with strong tilt of the estimated background noise. In all other situations, the level of the stereo comfort noise is set to an average level across the two adjacent frequency bins.
The stereo comfort noise injector 106 scales the level of the stereo comfort noise with a scaling factor rscale(k) calculated using a factor N/2 reflecting the new frame length and a global gain gscale using, for example, the following relation (32):
where N is the number of frequency bins, k is the frequency bin index, and gscale is the global gain that will be described herein after in the present disclosure.
The mixing of two random signals with Gaussian PDF can be described, for example, by the following pair of equations (33):
N
L(k)=rscale(k)[(1+LR[m])G1(k)+γG2(k)], for k=0, . . . ,N/2−1 (33)
N
R(k)=rscale(k)[(1−LR[m])G1(k)−γG2(k)], for k=0, . . . ,N/2−1
where NL(k) and NR(k) are the generated comfort noise signals for injection in the left 136 channel and right 137 channels, respectively. In Equation (33), the generated comfort noise signals NL(k) and NR(k) have the correct level and spatial characteristics corresponding to the estimated Inter-channel Level Difference (ILD) parameter and the inter-channel correlation (IC/ICC) parameter. The stereo comfort noise injector 106 finally injects the generated comfort noise signals NL(k) and NR(k) in the left 136 (L(k)) and right 137 (R(k)) channels of the decoded stereo sound signal using, for example, the following relation (34):
L(k)=L(k)+NL(k), for k=0, . . . ,N/2−1
R(k)=R(k)+NR(k), for k=0, . . . ,N/2−1 (34)
In the case of a parametric stereo encoder as described in Reference [6], it is possible to code and transmit the IC/ICC and ILD parameters in the bitstream. Then, the transmitted IC/ICC and ILD parameters are used by the stereo comfort noise injector 106 instead of the parameters estimated in Section 3.2.1. Usually, in a parametric stereo encoder, the parameters IC/ICC and ILD are calculated and encoded in frequency domain per critical bands.
The decoded IC/ICC and ILD parameters can be denoted, for example, as follows:
ICC
PS
[m](b),ILDLR[m](b),b=0, . . . ,BPS−1 (35)
where the subscript PS indicates Parametric Stereo and BPS represents the number of frequency bands b used by the parametric stereo encoder. Also, the maximum frequency of the parametric stereo encoder may be expressed as the last index of the last frequency band, as follows:
k
max_PS=max(k(BPS−1)) (36)
Similarly, the mixing factor γ expressed in relation (29) may be calculated per frequency bands with the decoded stereo parameters IC/ICC and ILD using, for example, the following relation (37):
where ICCPS[m](b) is the decoded inter-channel coherence parameter in the bth band, defined in relation (35) and ILDPS[m](b) is the decoded inter-channel level difference parameter in the bth band, defined in Equation (35).
The stereo comfort noise injector 106 then performs the mixing process using, for example, the following relation (38):
N
L(k)=rscale(k)[1+ILDPS[m](bk))G1(k)+γ(bk)G2(k)], for k=0, . . . , min(kmax_PS,N/2−1)
N
R(k)=rscale(k)[1−ILDPS[m](bk))G1(k)−γ(bk)G2(k)], for k=0, . . . , min(kmax_PS,N/2−1) (38)
where γ(bk) is the mixing factor of the bk-th frequency band containing the kth frequency bin. Thus, a single value of the mixing factor is used when generating comfort noise signal NL(k) and NR(k) in frequency bins belonging to a same frequency band, and that for each frequency band. The comfort noise signals NL(k) and NR(k) are generated only up to the maximum frequency bin supported by the parametric stereo encoder expressed by min(kmax_PS,N/2−1).
The stereo comfort noise injector 106 injects the generated comfort noise signals NL(k) and NR(k) in the left 136 (L(k)) and right 137 (R(k)) channels of the decoded stereo sound signal again using, for example, the relation (33).
When the IVAS sound codec operates in DTX mode, the background noise estimation described in Section 3.1 is not performed. Instead, the information about the spectral envelope of the background noise is decoded from a Silence Insertion Descriptor (SID) frame and converted into power spectrum representation. This can be done in various ways depending on the SID/DTX scheme used by the codec. For example, the TD-CNG or FD-CNG technology from the EVS codec (Reference [1]) may be used as they both contain information about background noise envelope.
Also, the IC/ICC and ILD spatial parameters may be transmitted as part of SID frames. In that case the decoded spatial parameters are used in stereo comfort noise generation and injection as described in Section 3.2.3.
To prevent abrupt changes in the level of the injected stereo comfort noise, the stereo comfort noise injector 106 applies a fade-in fade-out strategy for noise injection. For that purpose, a soft VAD parameter is used. This is achieved by a smoothing of the binary VAD flag fVAD using, for example, the following relation (39):
where {tilde over (V)}fact represents the soft VAD parameter, fVAD represents the non-smoothed binary VAD flag, and [m] if the frame index.
From relation (39), it can be seen that the soft VAD parameter is limited in the range from 0 to 1. The soft VAD parameter rises more quickly when the VAD flag fVAD changes from 0 to 1 and less quickly when it drops from 1 to 0. Thus, the fade-out period is longer than the fade-in period.
During the initialization procedure 400 of
{tilde over (V)}
fact
[m]=0, if fCNI=0 (40)
The initial value for {tilde over (V)}fact is 0.
The level of the stereo comfort noise is controlled globally with the global gain gscale used in relation (32). The stereo comfort noise injector 106 initializes the global gain gscale to “0” and updates the global gain gscale in each frame using, for example, the following relation (41) as follows:
g
scale
[m](k)=0.8·{tilde over (V)}fact[m] (41)
where {tilde over (V)}fact is the soft VAD parameter calculated in Equation (39). During the initialization period, when fCNI=0, the global gain gscale is reset to “0”. Thus, the global gain gscale closely follows the soft VAD parameter {tilde over (V)}fact[m], thereby applying a fade-in fade-out effect to the injected stereo comfort noise.
The parametric stereo decoder including the device for stereo comfort noise injection may be implemented as a part of a mobile terminal, as a part of a portable media player, or in any similar device. The parametric stereo decoder including the device for stereo comfort noise injection (identified as 500 in
The input 502 is configured to receive the bitstream (
The processor 506 is operatively connected to the input 502, to the output 504, and to the memory 508. The processor 506 is realized as one or more processors for executing code instructions in support of the functions of the various elements and operations of the above described parametric stereo decoder and decoding method, including the device and method for stereo comfort noise injection as shown in the accompanying figures and/or as described in the present disclosure.
The memory 508 may comprise a non-transient memory for storing code instructions executable by the processor 506, specifically, a processor-readable memory storing non-transitory instructions that, when executed, cause a processor to implement the elements and operations of the parametric stereo decoder and decoding method, including the device and method for stereo comfort noise injection. The memory 508 may also comprise a random access memory or buffer(s) to store intermediate processing data from the various functions performed by the processor 506.
Those of ordinary skill in the art will realize that the description of the parametric stereo decoder and decoding method, including the device and method for stereo comfort noise injection are illustrative only and are not intended to be in any way limiting. Other embodiments will readily suggest themselves to such persons with ordinary skill in the art having the benefit of the present disclosure. Furthermore, the disclosed parametric stereo decoder and decoding method, including the device and method for stereo comfort noise injection may be customized to offer valuable solutions to existing needs and problems of encoding and decoding sound, for example stereo sound.
In the interest of clarity, not all of the routine features of the implementations of the parametric stereo decoder and decoding method, including the device and method for stereo comfort noise injection are shown and described. It will, of course, be appreciated that in the development of any such actual implementation of the parametric stereo decoder and decoding method, including the device and method for stereo comfort noise injection, numerous implementation-specific decisions may need to be made in order to achieve the developer's specific goals, such as compliance with application-, system-, network- and business-related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Moreover, it will be appreciated that a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the field of sound processing having the benefit of the present disclosure.
In accordance with the present disclosure, the elements, processing operations, and/or data structures described herein may be implemented using various types of operating systems, computing platforms, network devices, computer programs, and/or general purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used. Where a method comprising a series of operations and sub-operations is implemented by a processor, computer or a machine and those operations and sub-operations may be stored as a series of non-transitory code instructions readable by the processor, computer or machine, they may be stored on a tangible and/or non-transient medium.
Elements and processing operations of the parametric stereo decoder and decoding method, including the device and method for stereo comfort noise injection as described herein may comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein.
In the parametric stereo decoder and decoding method, including the device and method for stereo comfort noise injection, the various processing operations and sub-operations may be performed in various orders and some of the processing operations and sub-operations may be optional.
Although the present disclosure has been described hereinabove by way of non-restrictive, illustrative embodiments thereof, these embodiments may be modified at will within the scope of the appended claims without departing from the spirit and nature of the present disclosure.
The present disclosure mentions the following references, of which the full content is incorporated herein by reference:
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2022/050342 | 3/9/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63181621 | Apr 2021 | US |