The present disclosure relates to sound coding, in particular but not exclusively to a method and device for audio band-width detection and a method and device for audio band-width switching in a sound codec.
In the present disclosure and the appended claims:
Historically, conversational telephony has been implemented with handsets having only one transducer to output sound only to one of the user's ears. In the last decade, users have started to use their portable handset in conjunction with a headphone to receive the sound over their two ears mainly to listen to music but also, sometimes, to listen to speech. Nevertheless, when a portable handset is used to transmit and receive conversational speech, the content is still mono but presented to the user's two ears when a headphone is used.
With the newest 3GPP (3rd Generation Partnership Project) speech coding standard, Codec for Enhanced Voice Services (EVS), as described in Reference [1], of which the full content is incorporated herein by reference, the quality of the coded sound, for example speech and/or audio that is transmitted and received through a portable handset has been significantly improved. The next natural step is to transmit stereo information such that the receiver gets as close as possible to a real life audio scene that is captured at the other end of the communication link.
In audio codecs, transmission of stereo information is normally used.
For conversational speech codecs, mono signal is the norm. When a stereo signal is transmitted, the bitrate often needs to be doubled since both the left and right channels of the stereo signal are coded using a mono codec. To reduce the bitrate, efficient stereo coding techniques have been developed and used. As non-limitative examples, the use of stereo coding techniques is discussed in the following paragraphs.
A first stereo coding technique is called parametric stereo. Parametric stereo encodes two, left and right channels as a mono signal using a common mono codec plus a certain amount of stereo side information (corresponding to stereo parameters) which represents a stereo image. The two input left and right channels are down-mixed into the mono signal, and the stereo parameters are then computed usually in transform domain, for example in the Discrete Fourier Transform (DFT) domain, and are related to so-called binaural or inter-channel cues. The binaural cues (Reference [3], of which the full content is incorporated herein by reference) comprise Interaural Level Difference (ILD), Interaural Time Difference (ITD) and Interaural Correlation (IC). Depending on the signal characteristics, stereo scene configuration, etc., some or all binaural cues are coded and transmitted to the decoder. Information about what binaural cues are coded and transmitted is sent as signaling information, which is usually part of the stereo side information. A particular binaural cue can be also quantized using different coding techniques which results in a variable number of bits being used. Then, in addition to the quantized binaural cues, the stereo side information may contain, usually at medium and higher bitrates, a quantized residual signal that results from the down-mixing. The residual signal can be coded using an entropy coding technique, e.g. an arithmetic encoder. In general, the parametric stereo coding is most efficient at lower and medium bitrates. Parametric stereo with parameters computed in the DFT domain will be referred to in this disclosure as DFT stereo.
Another stereo coding technique is a technique operating in time-domain. This stereo coding technique mixes the two input, left and right channels into so-called primary channel and secondary channel. For example, following the method as described in Reference [4], of which the full content is incorporated herein by reference, time-domain mixing can be based on a mixing ratio, which determines respective contributions of the two input, left and right channels upon production of the primary channel and the secondary channel. The mixing ratio is derived from several metrics, e.g. normalized correlations of the input left and right channels with respect to a mono version of the stereo sound signal or a long-term correlation difference between the two input left and right channels. The primary channel can be coded by a common mono codec while the secondary channel can be coded by a lower bitrate codec. The secondary channel coding may exploit coherence between the primary and secondary channels and might re-use some parameters from the primary channel. The time-domain stereo will be referred to in this disclosure as TD stereo. In general, TD stereo is most efficient at lower and medium bitrates for coding speech signals.
A third stereo coding technique is a technique operating in the Modified Discrete Cosine Transform (MDCT) domain. It is based on joint coding of both left and right channels while computing global ILD and Mid/Side (M/S) processing in whitened spectral domain. It uses several tools adapted from TCX (Transform Coded eXcitation) coding in MPEG (Moving Picture Experts Group) codecs as described for example in References [7] and [8] of which the full contents are incorporated herein by reference, e.g. TCX core coding, TCX LTP (Long-Term Prediction) analysis, TCX noise filling, Frequency-Domain Noise Shaping (FDNS), stereophonic Intelligent Gap Filling (IGF), and/or adaptive bit allocation between channels. In general, this third stereo coding technique is efficient to encode all kinds of audio content at medium and high bitrates. The MDCT domain stereo coding technique will be referred to in this disclosure as MDCT stereo.
Further, in last years, the generation, recording, representation, coding, transmission, and reproduction of audio is moving towards enhanced, interactive and immersive experience for the listener. The immersive experience can be described, for example, as a state of being deeply engaged or involved in a sound scene while sounds are coming from all directions. In immersive audio (also called 3D (Three-Dimensional) audio), the sound image is reproduced in all three dimensions around the listener, taking into consideration a wide range of sound characteristics like timbre, directivity, reverberation, transparency and accuracy of (auditory) spaciousness. Immersive audio is produced for a particular sound playback or reproduction system such as loudspeaker-based-system, integrated reproduction system (sound bar) or headphones. Then, interactivity of a sound reproduction system may include, for example, an ability to adjust sound levels, change positions of sounds, or select different languages for the reproduction.
There exist three fundamental approaches to achieve an immersive experience.
A first approach to achieve an immersive experience is a channel-based audio approach using multiple spaced microphones to capture sounds from different directions, wherein one microphone corresponds to one audio channel in a specific loudspeaker layout. Each recorded channel is then supplied to a loudspeaker in a given location. Examples of channel-based audio approaches are, for example, stereo, 5.1 surround, 5.1+4, etc. In general, channel-based audio is coded by multiple core coders where the number of core coders usually corresponds to the number of recorded channels. For example, the channels are coded by multiple stereo coders using e.g. TD stereo or MDCT stereo coding technique. The channel-based audio will be referred to in this disclosure as Multi-Channel (MC) format approach.
A second approach to achieve an immersive experience is a scene-based audio approach which represents a desired sound field over a localized space as a function of time by a combination of dimensional components. The sound signals representing the scene-based audio (SBA) are independent of the positions of the audio sources while the sound field is transformed to a chosen layout of loudspeakers at the renderer. An example of scene-based audio is ambisonics. There exist several SBA coding techniques while the most known is probably Directional Audio Coding (DirAC) as described for example in Reference [6] of which the full content is incorporated herein by reference. A DirAC encoder uses an analysis of ambisonics input signals in Complex Low Delay Filter Bank (CLDFB) domain, estimates spatial parameters (metadata) like direction and diffuseness grouped in time and frequency slots, and down-mixes input channels into a lower number of so-called transport channels (typically 1, 2, or 4 channels). A DirAC decoder then decodes spatial metadata, derives direct and diffuse signals from transport channels and renders them into loudspeaker or headphone setups to accommodate different listening configurations. Another example of SBA coding technique, targeting mostly mobile capture devices, is Metadata-Assisted Spatial Audio (MASA) format as described for example in Reference [9] of which the full content is incorporated herein by reference. In the MASA approach, the MASA metadata (e.g. direction, energy ratio, spread coherence, distance, surround coherence, all in several time-frequency slots) are generated in a MASA analyzer, quantized, coded, and passed into the bit-stream while MASA audio channel(s) are treated as mono or multi-channel transport signals coded by the core encoder(s). At the MASA decoder, MASA metadata then guide the decoding and rendering process to recreate the output spatial sound.
The third approach to achieve an immersive experience is an object-based audio approach which represents an auditory scene as a set of individual audio elements (for example singer, drums, guitar, etc.) accompanied by information such as their position, so they can be rendered (translated) by a sound reproduction system at their intended locations. This gives the object-based audio approach a great flexibility and interactivity because each object is kept discrete and can be individually manipulated. Each audio object consists of an audio stream, i.e. a waveform, with associated metadata and can be thus seen also as an Independent Stream with metadata (ISm).
Each of the above described audio approaches to achieve an immersive experience presents pros and cons. It is thus common that, instead of only one audio approach, several audio approaches are combined in a complex audio system to create an immersive auditory scene. An example can be an audio system that combines scene-based or channel-based audio with object-based audio, for example ambisonics with a few discrete audio objects.
In recent years, 3GPP (3rd Generation Partnership Project) started working on developing a 3D (Three-Dimensional) sound codec for immersive services called IVAS (Immersive Voice and Audio Services), based on the EVS codec (See Reference [5] of which the full content is incorporated herein by reference).
According to a first aspect, the present disclosure relates to a device for detecting, in an encoder part of a sound codec, an audio band-width of a sound signal to be coded, comprising: an analyser of the sound signal; and a final audio band-width decision module for delivering a final decision about the detected audio band-width; wherein, in the encoder part of the sound codec, the final audio band-width decision module is located upstream of the sound signal analyser.
According to a second aspect, the present disclosure provides a method for detecting, in an encoder part of a sound codec, an audio band-width of a sound signal to be coded, comprising: analysing the sound signal; and finally deciding about the detected audio band-width using the result of the analysis of the sound signal; wherein, in the encoder part of the sound codec, the final decision about the detected audio band-width is made upstream of the analysis of the sound signal.
The present disclosure is also concerned with a device for switching from a first audio band-width to a second audio band-width of a sound signal to be coded, comprising, in an encoder part of a sound codec: a final audio band-width decision module for delivering a final decision about a detected audio band-width of the sound signal to be coded; a counter of frames where audio band-width switching occurs, the counter of frames being responsive to the detected audio band-width final decision from the final audio band-width decision module; and an attenuator responsive to the counter of frames for attenuating the sound signal prior to encoding of the sound signal.
According to a still further aspect, the present disclosure provides a method for switching from a first audio band-width to a second audio band-width of a sound signal to be coded, comprising, in an encoder part of a sound codec: delivering a final decision about a detected audio band-width of the sound signal to be coded; counting frames where audio band-width switching occurs in response to the detected audio band-width final decision; and attenuating, in response to the count of frames, the sound signal prior to encoding of the sound signal.
The foregoing and other objects, advantages and features of the method and device for audio band-width detection and the method and device for audio band-width switching will become more apparent upon reading of the following non-restrictive description of illustrative embodiments thereof, given by way of example only with reference to the accompanying drawings.
In the Appended Drawings:
The present disclosure describes audio band-width detection and audio band-width switching techniques.
The audio band-width detection and audio band-width switching techniques are described, by way of non-limitative example only, with reference to an IVAS coding framework referred to throughout this disclosure as IVAS codec (or IVAS sound codec). However, it is within the scope of the present disclosure to incorporate such audio band-width detection and audio band-width switching techniques in any other sound codec.
Specifically, the present disclosure describes a method and device for audio band-width detection using an audio band-width detection algorithm implemented in the IVAS codec baseline, and a method and device for audio band-width switching using an audio band-width switching algorithm also implemented in the IVAS codec baseline.
The Audio Band-width Detection (BWD) algorithm in IVAS is similar to the BWD algorithm in EVS and it is applied in its original form in ISm, DFT stereo and TD stereo modes. However, no BWD was applied in the MDCT stereo mode. In the present disclosure, a new BWD is described which is used in the MDCT stereo mode (including higher-bitrate DirAC, higher-bitrate MASA, and multi-channel format). The goal is to introduce the BWD to modes where it was missing (i.e. to use BWD consistently in all operating points) in IVAS.
The present disclosure further describes the Audio Band-width Switching (BWS) algorithm used in the IVAS coding framework while keeping the computational complexity as low as possible.
Traditionally, speech and audio codecs (sound codecs) generally expect to receive an input sound signal with an effective audio band-width being close to the Nyquist frequency. When the effective audio band-width of the input sound signal is significantly lower than the Nyquist frequency, these traditional codecs usually do not work optimally, because they waste a portion of the available bit budget to represent empty frequency bands.
Today codecs are designed to be flexible in terms of coding miscellaneous audio material at a large range of bitrates and band-widths. An example of state-of-the-art speech and audio codec is the EVS codec standardized in 3GPP [1]. This codec consists of a multi-rate codec capable of efficiently compressing voice, music, and mixed content signals. In order to keep a high subjective quality for all audio material it comprises a number of different coding modes. These modes are selected depending on a given bitrate, input sound signal characteristics (e.g. speech/music, voiced/unvoiced), signal activity, and audio band-width. In order to select the best coding mode, the EVS codec uses BWD. BWD in the EVS codec is designed to detect changes in the effective audio band-width of the input sound signal. Consequently, the EVS codec can be flexibly re-configured to encode only the perceptually meaningful frequency content and distribute the available bit budget in an optimal manner. In the present disclosure, the BWD used in the EVS codec is further elaborated in the context of the IVAS coding framework.
Reconfiguration of the codec as a consequence of the BWD change improves the codec's performance. However, this reconfiguration might introduce artifacts if the reconfiguration and its related coding mode switching is not carefully and properly treated. The artifacts are usually related to an abrupt change of the high-frequency (HF) content (in general, HF is intended to designate frequency content above 8 kHz). The disclosed Band-Width Switching (BWS) algorithm thus smooths switching and ensures that the BWD change is seamless and pleasant and not annoying.
2.1 Background
As illustrated in
In
Additional information about the IVAS sound codec can be found, for example, in Reference [5].
The BWD algorithm in the IVAS codec 305 is based on computing energies in certain spectral regions and comparing them to certain thresholds. In the IVAS sound codec 305, the audio band-width detection method and device operate on the CLDFB values (ISm, TD stereo) or DFT values (DFT stereo). In the AMR-WB 10 (Adaptive MultiRage WideBand InterOperable) mode as described in Reference [1] in relation to the EVS codec, the audio band-width detection method and device use DCT transform values to determine the input sound signal audio band-width.
The BWD algorithm itself comprises several operations:
The above two first operations 1) and 2) are integrated into an operation 306 of BWD analysis performed by a BWD analyser 356 integrated to the sound signal core encoding stage 304, and the last operation 3) forms an operation 307 of final BWD decision performed by a final audio band-width decision module (processor) 357 integrated to the sound signal pre-processing stage 303. As can be seen in
In the description below, as a non-limitative example of implementation, the following audio band-widths/modes are defined: narrow-band (NB, 0-4 kHz), wide-band (WB, 0-8 kHz), super-wide-band (SWB, 0-16 kHz) and full-band (FB, 0-24 kHz).
2.2 BWD Signals
In order to keep the BWD algorithm computationally efficient, the method and device for audio band-width detection reuses as much as possible signal buffers and parameters available from the earlier EVS pre-processing stage (see Reference [1]). In the EVS primary mode this comprises complex modulated low delay filter bank (CLDFB) values, a local VAD parameter (i.e. voice activity decision without hangover), and a long-term estimate of the total noise energy as discussed below.
The CLDFB (see 308 in
On the other hand, the CLDFB module is not present in the EVS AMR-WB IO mode where the Discrete Cosine Transform (DCT) is computed to determine the input signal audio band-width in the BWD. The DCT values are obtained by first applying a Hanning window to, in the non-restrictive example of implementation, the 320 samples of the sound signal 320 sampled at the input sampling rate. Then the windowed signal is transformed to the DCT domain and finally is decomposed into several frequency sub-bands depending on the input sampling rate. It should be noted that a constant analysis window length is used over all sampling rates in order to keep the computational complexity reasonably low.
More details on BWD based on CLDFB is found in Reference [2], of which the full content is incorporated herein by reference.
In the MDCT stereo mode, the computationally demanding CLDFB is not needed which renders BWD based on CLDFB inefficient. Thus, a new BWD algorithm for MDCT stereo is disclosed herein, which saves a substantial amount of computational complexity of the CLDFB and BWD in the pre-processing stage 303.
The method and device for audio band-width detection in the MDCT stereo coding mode can lead to a higher quality, since bits are not assigned to the high-band part of the spectrum if it has no content or if the audio band-width is limited by a command-line or another external request. Moreover, the method and device for audio band-width detection are run continuously in order to ease a bitrate switching which involves switching between different stereo coding technologies. Further, the method and device for audio band-width detection in the MDCT stereo mode enables applying BWD in higher bitrate DirAC, higher bitrate MASA, and multichannel (MC) format.
The method and device for audio band-width detection in the MDCT stereo mode is described below.
2.3 BWD in MDCT Stereo
In order not to increase the computational complexity related to the BWD (including CLDFB or other transform), the BWD analyser 356 in the MDCT stereo mode is not applied in the front pre-processing stage 303 to the CLDFB values but is applied later in the TCX core encoder 358 to the present MDCT values.
The TCX core encoder 358 performs several operations: long MDCT based TCX transformation (TCX20)/short MDCT based TCX transformation (TCX10) switching decision, core signal analysis (TCX-LTP, MDCT, Temporal Noise Shaping (TNS), Linear Prediction Coefficients (LPC) analysis, etc.), envelope quantization and FDNS, fine quantization of the core spectrum, and IGF (many of these operations are also part of the EVS codec, as described in Section 5.3.3.2 of Reference [1]). The core signal analysis includes a windowing and an MDCT calculation which are applied based on the transform and overlap lengths.
The method and device for audio band-width detection uses the MDCT spectrum as an input to the BWD algorithm. In order to simplify the algorithm, the operation 306 of BWD analysis is performed only in frames which are selected as TCX20 frames and are not transition frames; this means that BWD analysis is performed in frames of a given duration and is skipped in frames shorter and longer than this given duration. This ensures that the length of the MDCT spectrum always corresponds to the length of the frame in samples at the input sampling rate. Also, no BWD is applied in the Low-Frequency Effects (LFE) channel in the MC format mode; the LFE channel contains only low frequencies, e.g. 0-120 Hz, and, thus, does not require a full-range core encoder. Also, as well known in the art, the input sound signal 310/320 is sampled at a given sampling rate and processed by groups of these samples called “frames” divided into a number of “sub-frames”.
In the case of the MDCT energy vector, there are nine frequency bands of interest whereby the width of each band is 1500 Hz. One to four frequency bands are assigned to each of the spectral regions as defined in Table 1.
In the above Table 1, nb (narrow-band), wb (wide-band), swb (super-wide-band) and fb (full-band), in lower-case letters, represent respective spectral regions, i is the index of the frequency band, idxstart is an energy band start index, and idxend is an energy band end index.
2.3.1 MDCT Spectrum Energy Computation
The operation 306 of BWD analysis is slightly adjusted from the EVS native BWD algorithm (see Reference [1]) in the present disclosure to take into account the fact that the MDCT spectrum of length equal to the frame length in samples at the input sampling rate must be considered. Thus, the DCT based path of the EVS native BWD algorithm (as used in the EVS AMR-WB 10 mode) is employed while the former DCT spectrum length of 320 samples (which is the same at all input sampling rates in EVS) is scaled proportionally to the input sampling rate in MDCT stereo mode of IVAS.
The energy Ebin(i) of the MDCT spectrum of the input sound signal 320 in the MDCT stereo mode is thus computed in the nine frequency bands as follows:
where i is the index of the frequency band, S(k) is the MDCT spectrum, idxstart is the energy band start index as defined in Table 1, idxend is the energy band end index as defined in Table 1, and the width of the energy band is bwidth=60 samples (which corresponds to 1500 Hz regardless of the sampling rate).
The above calculation is implemented in the source code as follows, wherein the mark “###” identifies portions of the IVAS source code used in the method and device for audio band-width detection that are new with respect to the EVS source code:
2.3.2 Mean and Maximum Energy Values Per Frequency Band
The BWD analyser 356 converts energy values Ebin(i) in the frequency bands to the log domain using, for example, the following relation:
E(i)=log10[Ebin(i)]i=0, . . . ,8, (1)
where i is the index of the frequency band.
The BWD analyser 356 uses the log energies E(i) per frequency band to calculate mean energy values per spectral region using, for example, the following relations:
Finally, the BWD analyser 356 uses the log energies E(i) per frequency band to calculate the maximum energy values per spectral region using, for example, the following relations:
where spectral regions nb, wb, swb and fb are defined in Table 1.
2.3.3 Long-Term Counters
The BWD analyser 356 updates long-term values of the mean energy values for the spectral regions nb, wb and swb using, for example, the following relations:
Ē
nb
=λ·E
nb+(1−λ)·Ēnb[−1],Ewb=λ·Ēwb+(1−λ)·Ēwb[−1],Eswb=λ·Ēswb+(1−λ)·Ēswb[−1] (4)
where λ=0.25 is an example of update factor and the superscript [−1] denotes a parameter value from the previous frame. The update takes place only if the local VAD decision indicates that the input sound signal 320 is active or if the long-term background noise level is higher than 30 dB. This ensures that the parameters are updated only in frames having a perceptually meaningful content. Reference is made to [2] for additional information about the parameters/concept such as the local VAD decision, active signal, and long-term background noise.
The BWD analyser 356 then compares the long-term energy mean values from Equation (4) to certain thresholds while taking also into account the current maximum values per spectral regions from Equation (3). Depending on the result of the comparisons, the BWD analyser 356 increases or decreases counters for each spectral region wb, swb and fb as illustrated in
2.3.4 Final Audio Band-Width Decision
In
The final audio band-width decision from
2.3.5 Newly Added Code
In the source code, the newly added code (marked by “###” sequence) may be as follows—the following excerpt is from function ivas_mdct_core_whitening_enc( ) of the IVAS sound codec:
Computation related to the BWD analysis operation 306 at the beginning of TCX core encoding (see 358) in a current frame has as a consequence that the final BWD decision operation 307 is postponed to the front pre-processing (see 303) of the next frame. Thus, the former EVS BWD algorithm is split into two parts (see 306 and 307); the BWD analysis operation 306 (i.e. computing energy values per frequency band and updating long-term counters) is done at the beginning of current TCX core coding and the final BWD decision operation 307 is done only in the next frame before the TCX core encoding starts.
2.3.6 BWD Information in CPE
In MDCT stereo coding, the final BWD decision from the decision module 357 about the input and thus coded audio band-width is done not separately for each of the two channels but as a joint decision for both channels. In other words, in MDCT stereo coding, both channels are always coded using the same audio band-width and the information about the coded audio band-width is transmitted only once per one Channel Pair Element (CPE) (CPE is a coding technique that encodes two channels by means of a stereo coding technique). If the final BWD decision is different between the two CPE channels, both CPE channels are coded using the broader audio band-width BW of the two channels. E.g. in case that the detected audio band-width BW is the WB band-width for the first channel and the SWB band-width for the second channel, the coded audio band-width BW of the first channel is rewritten to SWB band-width and the SWB band-width information is transmitted in the bit-stream. The only exception is a case when one of the MDCT stereo channels corresponds to the LFE channel, then the coded audio band-width of the other channel is set to the audio band-width of this channel. This is applied mostly in MC format mode when multiple MC channels are coded using several MDCT stereo CPEs.
The final audio band-width decision module 357 may use the logic of
Referring to
The audio band-width information from blocks 405, 408 and 410 is coded by the MDCT core encoder 358 (
In the source code of the IVAS sound codec, the final BW decision logic may look like as follows, where the newly added code is marked by the “###” sequence:
The above function is run at the Core Codec configuration block, i.e. at the end of the front pre-processing, and before TCX core coding starts.
It is noted that the same principle of joint audio band-width information coding can be used in other stereo coding techniques which codes two channels using two core encoders such as in TD stereo.
3.1 Background
In the EVS codec, a change of the audio band-width BW may happen as a consequence of a bitrate change or a coded audio band-width change. When a change from wide-band (WB) to super-wide-band (SWB) occurs, or from SWB to WB, an audio band-width switching post-processing at the decoder is performed in order to improve the perceptual quality for end users. A smoothing is applied for switching from WB to SWB, and a blind audio band-width extension is employed for switching from SWB to WB. A summary of the EVS BWS algorithm is given in the following paragraph while more information can be found in Section 6.3.7 of Reference [1].
First, in EVS, an audio band-width switching detector receives transmitted BW information and detects, in response to such BW information, if there is an audio band-width switching or not (Section 6.3.7.1 of Reference [1]) and accordingly updates few counters. Then, in case of switching from SWB to WB, the High-Band (HB) part of the spectrum (HB>8 kHz) is estimated in next frames based on the last-frame SWB Band-Width Extension (BWE) technology. The HB spectrum is faded out in 40 frames while a time-domain signal at an output sampling rate is used to perform an estimation of SWB BWE parameters. On the other hand, in case of switching from WB to SWB, the HB part of the spectrum is faded in 20 frames.
3.2 Issues
In IVAS, the BWS technique as used in EVS can be implemented in the decoder, but it is never applied due to bitrate limitations in the EVS native BWS algorithm. Moreover, the EVS native BWS algorithm does not support a BWS in the TCX core. Finally, the EVS native BWS algorithm cannot be applied in DFT stereo CNG (Comfort Noise Generation) frames because the time-domain signal is not available to perform the algorithm estimation thereon.
3.3 BWS in IVAS
In the IVAS sound codec, a new and different BWS algorithm is thus implemented.
First, such BWS algorithm is implemented at the encoder part of the IVAS sound codec. This choice has an advantage of a very low complexity foot-print of the IVAS BWS algorithm compared to the EVS native one.
Another design choice is that the BWS algorithm in IVAS is implemented only for switching from a lower BW to a higher BW (for example from WB to SWB). In this direction, the switching is relatively fast (see above Section 2.3.4) and the resulting, abrupt HF content change can be annoying. The new and different BWS algorithm is thus designed to smooth such switching. On the other hand, no special treatment is implemented for switching from a higher BW to a lower BW because in this direction there is practically no important HF content in the spectrum, so the change of the spectrum content is not unnaturally abrupt and annoying.
3.4 Proposed BWS
The proposed BWS algorithm used by the method 500 and device 550 of
3.4.1 BWS Technique
The BWS mechanism of the method and device for audio band-width switching of
First, the calculator 552 updates a counter of frames cntbandwidth_sw where audio band-width switching occurs and attenuation is applied at the end of the pre-processing for each IVAS transport channel based on the final BWD decision 307 as follows.
The calculator 552 initially set the value of the counter of frames Cntbwidth_sw to an initialization value of “0”. When there is detected—as a response to a final BWD decision from the final audio band-width decision module 357—a BW change from a lower audio band-width to a higher audio band-width, typically from WB to SWB or FB, the value of the counter of frames is increased by 1. In the following frames, the counter is increased by 1 in every frame until it reaches its maximum value Btran as defined herein after. When the counter reaches its maximum value Btran, the counter is then reset to 0 and a new detection of a BW switching can occur.
In the source code, the newly added code (marked by a “###” sequence) may be as follows. The code excerpt is found at the end of function core_switching_pre_enc( ) of the IVAS sound codec:
Next, when counter cntbwidth_sw, updated or not by the calculator 552, is larger than 0 as determined by the comparator 553, the attenuator 554 applies to the sound signal in frame i an attenuation factor βi (507) defined for example as follows:
where cntbandwidth_sw is the above mentioned audio band-width switching frame counter (bwidth_sw_cnt in the source code above) and Btran (macro BWS_TRAN_PERIOD in the source code above) is a BWS transition period which corresponds to a number of frames where the attenuation is applied after BW switching from a lower BW to a higher BW. The constant Btran was found experimentally and was set to 5 in the IVAS framework.
In TCX and HQ core frames (HQ stands for High Quality MDCT coder in EVS, see Section 5.3.4 of Reference [1]), the high-band gain of the spectrum XM(k) of length L as defined in Section 5.3.2 of Reference [1] is controlled and the high-band (HB) part of the spectrum XM(k), right after the time-to-frequency domain transformation, is updated (faded-in) by the attenuator 554 using, for example, the following relation:
X′
M(k+LWB)=βi*XM(k+LWB),i=0, . . . ,Btran−1,
where LWB is the length of spectrum corresponding to the WB audio band-width, i.e. LWB=320 samples in the example of IVAS with a frame length of 20 ms (normal HQ, or TCX20 frame), LWB=80 samples in transient frames, LWB=160 samples in TCX10 frames, and k is the sample index in the range [0, K−LWB−1] where K is the length of the whole spectrum in particular transform sub-mode (normal, transient, TCX20, TCX10).
In ACELP core with time-domain BWE (TBE) frames, the attenuator 554 applies the attenuation factor βi to the SWB gain shapes parameters of the HB part of the spectrum before these parameters are additionally processed. The temporal gain shapes parameters gs(j) are defined in Section 5.2.6.1.14.2 of Reference [1] and consist of four values. Thus, in an example of implementation:
gs′(j)=βi*gs(j),i=0, . . . ,Btran−1,
where j=0, . . . , 3 is the gain shape number.
In ACELP core with frequency-domain BWE (FD-BWE) frames, the high-band gain of the transformed original input signal XM(k) of length L as defined in Section 5.2.6.2.1 of Reference [1] is controlled and the HB part of the MDCT spectrum is updated by the attenuator 554 using, for example, the following relation:
X′
M(k+LWB)=βi*XM(k+LWB),i=0, . . . ,Btran−1
Note that NB coding is not considered in IVAS and SWB to FB switching is not treated as its subjective and objective impact is negligible. However, the same principles as above can be used to cover all BWS scenarios.
The attenuated sound signal from the attenuator 554 is then encoded in the core encoder 555. If the counter cntbwidth_sw, updated or not by the calculator 552, is not larger than 0 as determined by the comparator 553, then the sound signal is encoded in the core encoder 555 without attenuation.
4. Hardware Implementation
The encoder part of an IVAS sound codec 305 using the audio band-width detection method and device and the audio band-width switching method and device may be implemented as a part of a mobile terminal, as a part of a portable media player, or in any similar device. The encoder part of an IVAS sound codec 305 using the audio band-width detection method and device and the audio band-width switching method and device (identified as 800 in
The input 802 is configured to receive the input sound signal 320 of
The processor 806 is operatively connected to the input 802, to the output 804, and to the memory 808. The processor 806 is realized as one or more processors for executing code instructions in support of the functions of the various components of the encoder part of an IVAS sound codec 305 using the audio band-width detection method and device and the audio band-width switching method and device as illustrated in
The memory 808 may comprise a non-transient memory for storing code instructions executable by the processor(s) 806, specifically, a processor-readable memory comprising/storing non-transitory instructions that, when executed, cause a processor(s) to implement the operations and components of the above described encoder part of an IVAS sound codec 305 using the audio band-width detection method and device and the audio band-width switching method and device as described in the present disclosure. The memory 808 may also comprise a random access memory or buffer(s) to store intermediate processing data from the various functions performed by the processor(s) 806.
Those of ordinary skill in the art will realize that the description of the encoder part of an IVAS sound codec 305 using the audio band-width detection method and device and the audio band-width switching method and device is illustrative only and is not intended to be in any way limiting. Other embodiments will readily suggest themselves to such persons with ordinary skill in the art having the benefit of the present disclosure. Furthermore, the disclosed encoder part of an IVAS sound codec 305 using the audio band-width detection method and device and the audio band-width switching method and device may be customized to offer valuable solutions to existing needs and problems of encoding and decoding sound.
In the interest of clarity, not all of the routine features of the implementations of the encoder part of an IVAS sound codec 305 using the audio band-width detection method and device and the audio band-width switching method and device are shown and described. It will, of course, be appreciated that in the development of any such actual implementation of the encoder part of an IVAS sound codec 305 using the audio band-width detection method and device and the audio band-width switching method and device, numerous implementation-specific decisions may need to be made in order to achieve the developer's specific goals, such as compliance with application-, system-, network- and business-related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Moreover, it will be appreciated that a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the field of sound processing having the benefit of the present disclosure.
In accordance with the present disclosure, the components/processors/modules, processing operations, and/or data structures described herein may be implemented using various types of operating systems, computing platforms, network devices, computer programs, and/or general purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used. Where a method comprising a series of operations and sub-operations is implemented by a processor, computer or a machine and those operations and sub-operations may be stored as a series of non-transitory code instructions readable by the processor, computer or machine, they may be stored on a tangible and/or non-transient medium.
The encoder part of an IVAS sound codec 305 using the audio band-width detection method and device and the audio band-width switching method and device as described herein may use software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein.
In the encoder part of an IVAS sound codec 305 using the audio band-width detection method and device and the audio band-width switching method and device as described herein, the various operations and sub-operations may be performed in various orders and some of the operations and sub-operations may be optional.
Although the present disclosure has been described hereinabove by way of non-restrictive, illustrative embodiments thereof, these embodiments may be modified at will within the scope of the appended claims without departing from the spirit and nature of the present disclosure
The present disclosure mentions the following references, of which the full content is incorporated herein by reference:
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2021/051442 | 10/14/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63092178 | Oct 2020 | US |