The present disclosure relates to unified time-domain/frequency-domain coding device and method using a mixed time-domain and frequency-domain coding mode for coding an input sound signal, and corresponding decoder device and decoding method.
In the present disclosure and the appended claims:
A state-of-the-art conversational codec can represent with a very good quality a clean speech signal with a bitrate of around 8 kbps and approach transparency at a bitrate of 16 kbps. However, at bitrates below 16 kbps, low processing delay conversational codecs, most often coding an input speech signal in time-domain, are not suitable for generic audio signals, like music and reverberant speech. To overcome this drawback, switched codecs have been introduced, basically using a time-domain approach for coding speech-dominated input sound signals and a frequency-domain approach for coding generic audio signals. However, such switched solutions typically require longer processing delay, needed both for speech-music classification and for calculating a transform to frequency-domain.
To overcome the above drawback related to longer processing delay, a more unified time-domain and frequency-domain coding model has been proposed in U.S. Pat. No. 9,015,038 (See Reference [1] of which the full content is incorporated herein by reference). This unified time-domain and frequency-domain coding model is part of the EVS (Enhanced Voice Services) sound codec standardized by 3GPP (3rd Generation Partnership Project) as described in Reference [2], of which the full content is incorporated herein by reference. In recent years, 3GPP started working on developing a 3D (Three-Dimensional) sound codec for immersive services called IVAS (Immersive Voice and Audio Services), based on the EVS codec (See reference [3] of which the full content is incorporated herein by reference).
To make the coding model even more efficient for a specific kind of signal, a coding mode has been added to efficiently allocate the available bits between time-domain and frequency-domain and between low and high frequency. The additional coding mode is triggered by a new speech/music classifier of which the output allows for an unclear category for signals that cannot be clearly classified as music nor speech (See Reference [4] of which the full content is incorporated herein by reference).
The present disclosure relates to a unified time-domain/frequency-domain coding method for coding an input sound signal. The method comprises: classifying the input sound signal into one of a plurality of sound signal categories, wherein the sound signal categories comprise an unclear signal type category showing that the nature of the input sound signal is unclear; selecting one of a plurality of coding sub-modes for coding the input sound signal if the input sound signal is classified in the unclear signal type category; and mixed time-domain/frequency-domain coding the input sound signal using the selected coding sub-mode.
The present disclosure also relates to a unified time-domain/frequency-domain coding method for coding an input sound signal, comprising: classifying the input sound signal into one of a plurality of sound signal categories, wherein the sound signal categories comprise an unclear signal type category showing that the nature of the input sound signal is unclear; and mixed time-domain/frequency-domain coding the input sound signal in response to classification of the input sound signal in the unclear signal type category. Mixed time-domain/frequency-domain coding the input sound signal comprises a frequency band selection and bit allocation for selecting frequency bands to quantize and for distributing a bit budget available to quantization between the selected frequency bands.
According to the present disclosure, there is further provided a unified time-domain/frequency-domain coding device for coding an input sound signal, comprising: a classifier of the input sound signal into one of a plurality of sound signal categories, wherein the sound signal categories comprise an unclear signal type category showing that the nature of the input sound signal is unclear; a selector of one of a plurality of coding sub-modes for coding the input sound signal if the input sound signal is classified in the unclear signal type category; and a mixed time-domain/frequency-domain encoder for coding the input sound signal using the selected coding sub-mode.
The present disclosure is still further concerned with a unified time-domain/frequency-domain coding device for coding an input sound signal, comprising: a classifier of the input sound signal into one of a plurality of sound signal categories, wherein the sound signal categories comprise an unclear signal type category showing that the nature of the input sound signal is unclear; and a mixed time-domain/frequency-domain encoder for coding the input sound signal in response to classification of the input sound signal in the unclear signal type category. The mixed time-domain/frequency-domain encoder comprises a selector of frequency bands and allocator of bits for selecting frequency bands to quantize and for distributing a bit budget available to quantization between the selected frequency bands.
The present disclosure provides a sound signal decoding method comprising: receiving a bitstream conveying information usable to reconstruct a mixed time-domain/frequency-domain excitation representative of a sound signal classified in an unclear signal type category showing that the nature of the sound signal is unclear, wherein the information includes one of a plurality of coding sub-modes used for coding the sound signal classified in the unclear signal type category; reconstructing the mixed time-domain/frequency-domain excitation in response to the information conveyed in the bitstream, including the coding sub-mode used for coding the input sound signal; converting the mixed time-domain/frequency-domain excitation to time-domain; and filtering the mixed time-domain/frequency-domain excitation converted to time-domain through a synthesis filter to produce a synthesized version of the sound signal.
The present disclosure proposes a sound signal decoding method comprising: receiving a bitstream conveying information usable to reconstruct a mixed time-domain/frequency-domain excitation representative of a sound signal (a) classified in an unclear signal type category showing that the nature of the sound signal is unclear and (b) coded using (i) frequency bands selected for quantization and (ii) a bit budget available to quantization distributed between the frequency bands; reconstructing the mixed time-domain/frequency-domain excitation in response to the information conveyed in the bitstream, wherein reconstructing the mixed time-domain/frequency-domain excitation comprises selecting the frequency bands used for quantization and the distribution of the bit budget available to quantization between the frequency bands; converting the mixed time-domain/frequency-domain excitation to time-domain; and filtering the mixed time-domain/frequency-domain excitation converted to time-domain through a synthesis filter to produce a synthesized version of the sound signal.
In accordance with the present disclosure, there is provided a sound signal decoder comprising: a receiver of a bitstream conveying information usable to reconstruct a mixed time-domain/frequency-domain excitation representative of a sound signal classified in an unclear signal type category showing that the nature of the sound signal is unclear, wherein the information includes one of a plurality of coding sub-modes used for coding the sound signal classified in the unclear signal type category; a re-constructor of the mixed time-domain/frequency-domain excitation in response to the information conveyed in the bitstream, including the coding sub-mode used for coding the input sound signal; a converter of the mixed time-domain/frequency-domain excitation to time-domain; and a synthesis filter for filtering the mixed time-domain/frequency-domain excitation converted to time-domain to produce a synthesized version of the sound signal.
The present disclosure is still further concerned with a sound signal decoder comprising: a receiver of a bitstream conveying information usable to reconstruct a mixed time-domain/frequency-domain excitation representative of a sound signal (a) classified in an unclear signal type category showing that the nature of the sound signal is unclear and (b) coded using (i) frequency bands selected for quantization and (ii) a bit budget available to quantization distributed between the frequency bands; a re-constructor of the mixed time-domain/frequency-domain excitation in response to the information conveyed in the bitstream, wherein the re-constructor selects the frequency bands used for quantization and the distribution of the bit budget available to quantization between the frequency bands; a converter of the mixed time-domain/frequency-domain excitation to time-domain; and a synthesis filter for filtering the mixed time-domain/frequency-domain excitation converted to time-domain to produce a synthesized version of the sound signal.
The foregoing and other features will become more apparent upon reading of the following non-restrictive description of illustrative embodiments of the unified time-domain/frequency-domain coding method, the unified time-domain/frequency-domain coding device, the decoding method and decoder device, given by way of example only with reference to the accompanying drawings.
In the appended drawings:
The present disclosure proposes a unified time-domain and frequency-domain coding model which improves synthesis quality for generic audio signals such as, for example, music and/or reverberant speech, without increasing the processing delay and the bitrate. This unified time-domain and frequency-domain coding model comprises:
To achieve a low processing delay and low bitrate conversational sound codec that improves the synthesis quality of generic audio signals such as, for example, music and/or reverberant speech, the frequency-domain coding mode is integrated as close as possible to a CELP (Code-Excited Linear Prediction) time-domain coding mode. For that purpose, the frequency-domain coding mode uses a frequency transform performed in the LP (Linear Prediction) residual domain. This allows switching nearly without artifact from one frame, for example a 20 ms frame, to another. As well known in the art of sound codecs, the input sound signal is sampled at a given sampling rate and processed by groups of these samples called “frames”, usually divided into a number of “sub-frames”. Here, the integration of the two (2) time-domain and frequency-domain coding modes is sufficiently close to allow dynamic reallocation of the bit budget to another coding mode if it is determined that the current coding mode is not sufficiently efficient.
One feature of the proposed unified time-domain and frequency-domain coding model is a variable time support of the time-domain component, which varies from a quarter frame (sub-frame) to a complete frame on a frame-by-frame basis. As a non-limitative illustrative example, a frame may represent 20 ms of input sound signal. Such a frame corresponds to 320 samples of the input sound signal if the inner sampling rate of the sound codec is 16 kHz or to 256 samples per frame if the inner sampling rate of the codec is 12.8 kHz. Then a sub-frame (quarter of a frame in the present example) represents 80 or 64 samples depending on the inner sampling rate of the sound codec. In the present non-restrictive illustrative embodiment, the inner sampling rate of the sound codec is 12.8 kHz giving a frame length of 256 samples and a sub-frame length of 64 samples of the input sound signal.
The variable time support makes it possible to capture major temporal events with a minimum bitrate to create a basic time-domain excitation contribution. At very low bitrate, the time support is usually the entire frame. In that case, the time-domain contribution of the excitation is composed only of the adaptive codebook; corresponding adaptive-codebook (pitch) information and gain are then transmitted once per frame. When more bitrate is available, it is possible to capture more temporal events by shortening the time support and increasing the bitrate allocated to the time-domain coding mode. Eventually, when the time support is sufficiently short (shorter than a quarter of a frame (sub-frame)), and the available bitrate is sufficiently high, the time-domain contribution of the excitation may include, for each sub-frame, the adaptive-codebook contribution with the corresponding adaptive-codebook gain, a fixed-codebook contribution with a corresponding fixed-codebook gain, or both the adaptive-codebook and fixed-codebook contributions with the corresponding gains. Alternatively, it is also possible to transport, for each half of a frame (sub-frame), an adaptive-codebook contribution with the corresponding adaptive-codebook gain and a fixed-codebook contribution with the corresponding fixed-codebook gain; this has the advantage of not consuming too much bitrate while still being able to code temporal events. Parameters describing codebook indices and gains are then transmitted for each sub-frame.
At low bitrate, conversational sound codecs are incapable of coding properly higher frequencies. This causes an important degradation of the synthesis quality when the input sound signal includes music and/or reverberant speech. To solve this issue, a feature is added to compute the efficiency of the time-domain excitation contribution. In some cases, whatever the input bitrate and the time frame support are, the time-domain excitation contribution is not valuable. In those cases, all the bits are reallocated to the next step of frequency-domain coding. But most of the time, the time-domain excitation contribution is valuable up only to a certain frequency (herein after the “cut-off frequency”). In these cases, the time-domain excitation contribution is filtered out above the cut-off frequency. The filtering operation permits to keep valuable information coded with the time-domain excitation contribution and remove the non-valuable information above the cut-off frequency. In a non-restrictive illustrative embodiment, the filtering is performed in frequency-domain by setting the frequency bins above a certain frequency (cut-off frequency) to zero.
The variable time support in combination with the variable cut-off frequency makes the bit allocation inside the unified time-domain and frequency-domain coding model very dynamic. The bitrate after the quantization of the LP filter can be allocated entirely to the time domain or entirely to the frequency domain or somewhere in between. The bitrate allocation between the time and frequency domains is conducted as a function of the number of sub-frames used for the time-domain excitation contribution, of the available bit budget, and of the cut-off frequency computed. To make the unified time-domain and frequency-domain coding model even more efficient for a specific kind of input sound signal, specific coding sub-modes are added to efficiently allocate the available bits between the time domain, the frequency domain and between low and high frequencies. These added specific coding sub-modes are determined using a new speech/music audio classifier producing an output allowing for an unclear signal category (signals that cannot be clearly classified as music nor speech).
To create a total excitation which will match more efficiently the input LP residual, the frequency-domain coding mode is applied. A feature is that frequency-domain coding is performed on a vector which contains a difference between a frequency representation (frequency transform) of the input LP residual and a frequency representation (frequency transform) of the filtered time-domain excitation contribution up to the cut-off frequency, and which contains a frequency representation (frequency transform) of the input LP residual itself above that cut-off frequency. A smooth spectrum transition is inserted between both segments just above the cut-off frequency. In other words, the high-frequency part of the frequency representation of the time-domain excitation contribution is first zeroed out above the cut-off frequency. A transition region between the unchanged part of the spectrum and the zeroed part of the spectrum of the time-domain excitation contribution is inserted just above the cut-off frequency to ensure a smooth transition between both parts of the spectrum. This modified spectrum of the time-domain excitation contribution is then subtracted from the frequency representation of the input LP residual. The resulting spectrum thus corresponds to the difference of both spectra below the cut-off frequency, and to the frequency representation of the LP residual above it, with some transition region. The cut-off frequency, as mentioned hereinabove, can vary from one frame to another.
Whatever the frequency quantization method (frequency-domain coding mode) chosen, there is always a possibility of pre-echo especially with long windows. In the herein disclosed technique, the used windows are square windows, so that the extra window length compared to the coded input sound signal is zero (0), i.e. no overlap-add is used. While this corresponds to the best window to reduce any potential pre-echo, some pre-echo may still be audible on temporal attacks. Many techniques exist to solve such pre-echo problem but the present disclosure proposes a simple feature for cancelling this pre-echo problem. This feature is based on a memory-less time-domain coding mode which is derived from the “Transition Mode” of ITU-T Recommendation G.718; Reference [5], sections 6.8.1.4 and 6.8.4.2 of which the full content is incorporated herein by reference. The idea behind this feature is to take advantage of the fact that the proposed unified time-domain and frequency-domain coding model is integrated to the LP residual domain, which allows for switching without artifact almost at any time. When an input sound signal is considered as generic audio (music and/or reverberant speech) and when a temporal attack is detected in a frame, then this frame only is encoded with the memory-less time-domain coding mode. This memory-less time-domain coding mode will take care of the temporal attack thus avoiding the pre-echo that could be introduced when using frequency-domain coding of that frame.
In the proposed unified time-domain and frequency-domain coding model, the above mentioned adaptive codebook, one or more fixed codebooks (for example an algebraic codebook, a Gaussian codebook, etc.), i.e. the so called time-domain codebooks, and the frequency-domain quantization (frequency-domain coding mode) can be seen as a codebook library, and the bits can be distributed among all the available codebooks, or a subset thereof. This means for example that if the input sound signal is a clean speech, all the bits will be allocated to the time-domain coding mode, basically reducing the coding to the legacy CELP scheme. On the other hand, for some music segments, all the bits allocated to encode the input LP residual are sometimes best spent in the frequency-domain, for example in transform-domain. Furthermore, specific cases can be added in which (a) the time-domain uses a larger part of the total available bitrate to code more time-domain events while still maintaining bits to code some of the frequency information or (b) low frequency content is prioritized over high frequency content and vice versa.
As indicated in the foregoing description, temporal support for the time-domain and frequency-domain coding modes does not need to be the same. While the bits spent on the different time-domain coding operations (adaptive and algebraic codebook searches) are usually distributed on a sub-frame basis (typically a quarter of a frame, or 5 ms of time support), the bits allocated to the frequency-domain coding mode are distributed on a frame basis (typically 20 ms of time support) to improve frequency resolution.
The bit budget allocated to the time-domain CELP coding mode can be also dynamically controlled depending on the input sound signal. In some cases, the bit budget allocated to the time-domain CELP coding mode can be zero, effectively meaning that the entire bit budget is attributed to the frequency-domain coding mode. The choice of working in the LP residual domain both for the time-domain and the frequency-domain coding modes has two (2) main benefits. First, this is compatible with the time-domain CELP coding mode, proved efficient in speech signals coding. Consequently, no artifact is introduced due to the switching between the two types of coding modes (time-domain and frequency-domain coding modes). Second, lower dynamics of the LP residual with respect to the original input sound signal, and its relative flatness, make easier the use of a square window for the frequency transforms thus permitting use of a non-overlapping window.
In a non limitative example where the inner sampling rate of the codec is 12.8 kHz (meaning 256 samples per frame), similarly as in the ITU-T recommendation G.718 (Reference [5]), the length of the sub-frames used in the time-domain CELP coding mode can vary from a typical ¼ of the frame length (5 ms) to a half frame (10 ms) or a complete frame length (20 ms). The sub-frame length decision is based on the available bitrate and on an analysis of the input sound signal, particularly the spectral dynamics of this input sound signal. The sub-frame length decision can be performed in a closed loop manner. To save on complexity, it is also possible to base the sub-frame length decision in an open loop manner. The sub-frame length decision can be also controlled by the nature of the input sound signal as detected by a signal classifier, for example a speech/music classifier. The sub-frame length can be changed from frame to frame.
Once the length of the sub-frames is chosen in a current frame, a standard closed-loop pitch analysis is performed and the first contribution to the excitation signal is selected from the adaptive codebook. Then, depending on the available bit budget and the characteristics of the input sound signal (for example in the case of an input speech signal), a second contribution from one or several fixed codebooks can be added before conversion in the transform domain. The resulting excitation contribution is the time-domain excitation contribution. On the other hand, at very low bitrates and in the case of a generic audio signal, it is often better to skip the fixed codebook stage and use all the remaining bits for the transform-domain coding. The transform-domain coding can be for example a frequency-domain coding mode. As described above, the sub-frame length can be one fourth of the frame, one half of the frame, or one frame long. The fixed-codebook contribution is used only if the sub-frame length is equal to ¼ of the frame length. In case the sub-frame length is decided to be half a frame or the entire frame long, then only the adaptive-codebook contribution is used to represent the time-domain excitation contribution, and all remaining bits are allocated to the frequency-domain coding mode. Alternatively, an additional coding mode will be described where the fixed codebook can be used when the sub-frame length is equal to half the frame length. This addition has been made to improve the quality of particular kinds of input sound signals containing a temporal event while keeping an acceptable bit budget to code the frequency-domain excitation contribution.
Once the computation of the time-domain excitation contribution is completed, its efficiency needs to be assessed and quantized. If the gain of the coding in time-domain is very low, it is more efficient to remove the time-domain excitation contribution altogether and to use all the bits for the frequency-domain coding mode. On the other hand, for example in the case of a clean input speech signal, the frequency-domain coding mode is not needed, and all the bits are allocated to the time-domain coding mode. But often the coding in time-domain is efficient only up to a certain frequency. This frequency corresponds to the above mentioned cut-off frequency of the time-domain excitation contribution. Determination of such cut-off frequency ensures that the entire time-domain coding is helping to get a better final synthesis rather than working against the frequency-domain coding.
The cut-off frequency can be estimated in the frequency domain. To compute the cut-off frequency, the spectrums of both the LP residual and the time-domain excitation contribution are first split into a predefined number of frequency bands in each of which a number of frequency bins are defined. The number of frequency bands and the number of frequency bins covered by each frequency band can vary from one implementation to another. For each of the frequency bands, a normalized correlation is computed between the frequency representation of the time-domain excitation contribution and the frequency representation of the LP residual, and the correlation is smoothed between adjacent frequency bands. As a non-limitative example, the per-band correlations are lower limited to 0.5 and normalized between 0 and 1, and an average correlation is then computed as the average of the correlations for all the frequency bands. For the purpose of a first estimation of the cut-off frequency, the average correlation is then scaled between 0 and half the internal sampling rate (half the internal sampling rate corresponding to the normalized correlation value of 1). At very low bitrate or for the additional coding sub-modes as described herein below, the average correlation is doubled before finding the cut-off frequency. This is done for cases where it is known that the time-domain excitation contribution would be needed even if the correlation is not very high because of the low bitrate being used, or because the type of input sound signal would not allow for a high correlation. The first estimation of the cut-off frequency is then found as the upper bound of the frequency band being closest to the value of the scaled average correlation. In an example of implementation, sixteen (16) frequency bands at a 12.8 kHz internal sampling rate are defined for correlation computation.
Taking advantage of the psychoacoustic property of the human ear, the reliability of the estimation of the cut-off frequency may be improved by comparing the estimated position of the 8th harmonic frequency of the pitch to the cut-off frequency estimated by the correlation computation. If this position is higher than the cut-off frequency estimated by the correlation computation, the cut-off frequency is modified to correspond to the position of the 8th harmonic frequency of the pitch. If one of the additional coding sub-modes is used, the cut-off frequency has a minimum value above or equal to, for example, 2775 Hz (7th band). The final value of the cut-off frequency is then quantized and transmitted to a distant decoder. In an example of implementation, 3 or 4 bits are used for such quantization, giving 8 or 16 possible cut-off frequencies depending on the bitrate.
Once the cut-off frequency is known, frequency quantization of the frequency-domain excitation contribution is performed. First the difference between the frequency representation (frequency transform) of the input LP residual and the frequency representation (frequency transform) of the time-domain excitation contribution is determined. Then a new vector is created, consisting of this difference up to the cut-off frequency, and a smooth transition to the frequency representation of the input LP residual for the remaining spectrum. A frequency quantization is then applied to the whole new vector. In an example of implementation, the quantization consists of coding the sign and the position of dominant (most energetic) spectral pulses. The number of pulses to be quantized per frequency band is related to the bitrate available for the frequency-domain coding mode. If the available bits are insufficient to cover all the frequency bands, the remaining bands are filled with noise only.
Frequency quantization of a frequency band using the quantization method described in the previous paragraph does not guarantee that all frequency bins within this band are quantized. This is especially true at low bitrates where the number of spectral pulses quantized per frequency band is relatively low. To prevent the apparition of audible artifacts due to these non-quantized bins, some noise is added to fill these gaps. As at low bitrates the quantized spectral pulses should dominate the spectrum rather than the inserted noise, the noise spectrum amplitude corresponds only to a fraction of the amplitude of the pulses. The amplitude of the added noise in the spectrum is higher when the bit budget available is low (allowing more noise) and lower when the bit budget available is high.
In the frequency-domain coding mode, gains are computed for each frequency band to match the energy of the non-quantized signal to the quantized signal. The gains are vector quantized and applied per band to the quantized signal. When, for example, the unified time-domain and frequency-domain coding model changes the bit allocation from a time-domain only coding mode to a mixed time-domain/frequency-domain coding mode, the per band excitation spectrum energy of the time-domain only coding mode does not match the per band excitation spectrum energy of the mixed time-domain/frequency-domain coding mode. This energy mismatch can create some switching artifacts especially at low bitrate. To reduce any audible degradation created by this bit reallocation, a long-term gain can be computed for each band and can be applied to correct the energy of each frequency band for a few frames after the switching from the time-domain only coding mode to the mixed time-domain/frequency-domain coding mode.
After the completion of the frequency-domain coding mode, the total excitation is found by adding the frequency-domain excitation contribution to the frequency representation (frequency transform) of the time-domain excitation contribution and then the sum of these two (2) excitation contributions is transformed back to time-domain to form a total excitation. Finally, the synthesized signal is computed by filtering the total excitation through a LP synthesis filter.
In one embodiment, while the CELP coding memories are updated on a sub-frame basis using only the time-domain excitation contribution, the total excitation is used to update those memories at frame boundaries.
In another possible implementation, the CELP coding memories are updated on a sub-frame basis and also at the frame boundaries using only the time-domain excitation contribution. This results in an embedded structure where the frequency-domain coded signal constitutes an upper quantization layer independent from the core CELP layer. In this particular case, the fixed codebook is always used in order to update the adaptive codebook content. However, the frequency-domain coding mode can apply to the whole frame. This embedded approach works for bit rates around 12 kbps and higher.
The unified time-domain/frequency-domain CELP coding device 100 comprises a pre-processor 102 (
The pre-processor 102 conducts a first level of analysis to classify the input sound signal 101 between speech and non-speech (generic audio (music or reverberant speech)), for example in a manner similar to that described in Reference [6], of which the full content is incorporated herein by reference, or with any other reliable speech/non-speech discrimination methods.
After this first level of analysis, the pre-processor 102 performs a second level of analysis of input signal parameters to allow the use of time-domain CELP coding (no frequency-domain coding) on some sound signals with strong non-speech characteristics, but that are still better encoded with a time-domain approach. When an important variation of energy occurs, this second level of analysis allows the unified time-domain/frequency-domain CELP coding device 100 to switch into a memory-less time-domain coding mode, generally called Transition Mode in Reference [7], of which the full content is incorporated herein by reference.
During this second level of analysis, the signal classifier 204 calculates and uses a variation σC of a smoothed version Cst of an open-loop pitch correlation from the open-loop pitch analyzer 203, a current total frame energy Etot (total energy of the input sound signal in the current frame) and a difference between the current total frame energy and the previous total frame energy Ediff. First, the signal classifier 204 computes the variation of the smoothed open loop pitch correlation using, for example, the following relation:
where:
When, during the first level of analysis, the signal classifier 204 classifies a frame as non-speech, the following verifications are performed by the signal classifier 204 to determine, in the second level of analysis, if it is really safe to use a mixed time-domain/frequency-domain coding mode. Sometimes, it is however better to encode the current frame with the time-domain coding mode only, using one of the time-domain approaches estimated by the pre-processing function of the time-domain coding mode. In particular, it might be better to use the memory-less time-domain coding mode to reduce at a minimum any possible pre-echo that can be introduced with a mixed time-domain/frequency-domain coding mode.
As a non-limitative implementation of a first verification whether the mixed time-domain/frequency-domain coding mode should be used, the signal classifier 204 calculates a difference between the current total frame energy and the previous frame total energy. When the difference Ediff between the current total frame energy Etot and the previous frame total energy is higher than, for example, 6 dB, this corresponds to a so-called “temporal attack” in the input sound signal 101. In such a situation, the speech/non-speech decision and the selected coding mode are overwritten and a memory-less time-domain coding mode is forced. More specifically, the unified time-domain/frequency-domain CELP coding device 100 comprises a time/time-frequency coding selector 103 (
As a non-limitative implementation of second verification whether the mixed time-domain/frequency-domain coding mode should be used, when the difference Ediff between the current total frame energy Etot and the previous frame total energy is below or equal to 6 dB, but:
Otherwise, the time/time-frequency coding selector 103 selects the mixed time-domain/frequency-domain coding mode as disclosed in the following description.
The second verification can be summarized, for example when the non-speech input sound signal is music, using the following pseudo code:
where Etot is the current total frame energy expressed as:
where x(i) represents the samples of the input sound signal in the current frame, N is the number of samples of the input sound signal by frame, and Ediff is the difference between the current total frame energy Etot and the last previous frame total energy.
Specifically, the unified time-domain/frequency-domain CELP coding method 750 comprises an operation 752 of pre-processing the input sound signal 101 as described in Reference [4] to obtain the parameters required to classify this input sound signal. To perform operation 752, the mixed time-domain/frequency-domain CELP coding device 700 comprises the pre-processor 702.
The unified time-domain/frequency-domain CELP coding method 750 comprises an operation 751 of classifying the input sound signal 101 into speech, music and unclear signal type categories using the parameters from pre-processor 702 in a manner similar to that also described in Reference [4], or using any other reliable speech/music and unclear signal type discrimination methods. The unclear signal type category shows that the nature of the input sound signal 101 is unclear and, in particular, that the input sound signal 101 is not classified as speech nor music. To perform operation 751, the unified time-domain/frequency-domain CELP coding device 700 comprises a sound signal classifier 701.
If the sound signal classifier 701 classifies the input sound signal 101 into the music category, a frequency-domain encoder 703 performs an operation 753 of coding the input sound signal 101 using frequency-domain coding as described, for example, in Reference [2]. The frequency-domain encoded music signal can then be synthesized in a music synthesis operation 754 performed by a synthesizer 704 to recover the music signal.
In the same manner, if the sound signal classifier 701 classifies the input sound signal 101 into the speech category, a time-domain encoder 705 performs an operation 755 of coding the input sound signal 101 using time-domain coding as described, for example, in Reference [2]. The time-domain encoded speech signal can then be synthesized in a synthesis filtering operation 756 performed by a synthesizer 706 including a synthesis filter to recover the speech signal.
Accordingly, the unified time-domain/frequency-domain coding device 700 and method 750 maximise the performances of time-domain coding only and frequency-domain coding only by respectively limiting their usage to input sound signals having clear speech characteristics and input sound signals having clear music characteristics. This increases the overall quality of all types of input sound signals at low to medium bitrates.
Coding sub-modes have been designed as part of the unified time-domain and frequency-domain coding model to efficiently code input sound signals that are not classified as speech nor music (unclear signal type category). Two (2) bits are used to signal three (3) coding sub-modes identified by corresponding sub-mode flags. A fourth sub-mode allows for a backward interoperability to the legacy unified time-domain and frequency-domain coding model (EVS).
As illustrated in
The coding sub-modes are identified by a sub-mode flag Ftfsm. In the non-limitative implementation of
The selected coding sub-mode, for example the sub-mode flag Ftfsm, is transmitted into the bitstream to a distant decoder. The path chosen inside the decoder depends of signaling bits included in the bitstream. Once the decoder detects the presence of a frame coded using mixed time-domain/frequency-domain coding, the sub-mode flag Ftfsm is decoded from the bitstream. If the detected sub-mode flag Ftfsm is “0”, then the EVS backward interoperable legacy unified time-domain and frequency-domain coding model will be used to decode the remaining part of the bitstream. On the other hand, if the sub-mode flag Ftfsm is different from “0”, sub-mode decoding is followed. The decoder will replicate the procedure followed by the encoder, in particular the bit distribution between time-domain and frequency-domain and the bit allocation in the different frequency bands as described later in section 6.2.
In typical CELP, input sound signal samples are processed in frames of 10-30 ms and these frames are divided into sub-frames for adaptive codebook and fixed codebook analysis. For example, a frame of 20 ms (256 samples when the internal sampling rate is 12.8 kHz) can be used and divided into 4 sub-frames of 5 ms. A variable sub-frame length is a feature used to integrate time-domain and frequency-domain into one coding mode. The sub-frame length can vary from a typical ¼ of the frame length to half of the frame length or a complete frame length. Of course, the use of another number of sub-frames (sub-frame length) can possibly be implemented.
The parameter analysis operation 152 of the unified time-domain/frequency-domain CELP coding method 150 comprises, as illustrated in
The decision as to the length of the sub-frames (the number of sub-frames), or the time support, is determined by the calculator 210 based on the available bitrate and on the input sound signal analysis, in particular the high spectral dynamic of the input sound signal 101 from the analyzer 209 and the open-loop pitch analysis including the smoothed open loop pitch correlation Cst from analyzer 203. The high spectral dynamic analyzer 209 is responsive to the information from the spectral analyzer 202 to determine high spectral dynamic of the input sound signal 101. The high spectral dynamic is computed, for example as described in ITU-T recommendation G.718, Reference [5], section 6.7.2.2, as an input spectrum without noise floor giving a representation of the input spectrum dynamic. When the average spectral dynamic of the input sound signal 101 in the frequency band between 4.4 kHz and 6.4 kHz as determined by the analyzer 209 is below, for example, 9.6 dB and the last frame was considered as having a high spectral dynamic, the input sound signal 101 is no longer considered as having high spectral dynamic. In that case, more bits can be allocated to the frequencies below, for example, 4 kHz, by adding more sub-frames to the time-domain coding mode or by forcing more pulses in the lower frequency part of the frequency-domain coding mode.
On the other hand, if an increase of the average spectral dynamic of the input sound signal 101 against the average spectral dynamic of the last frame that was not considered as having a high spectral dynamic as determined by the analyser 209 is greater than, for example, 4.5 dB, the input sound signal 101 is considered as having high spectral dynamic content above, for example, 4 kHz. In that case, depending on the available bitrate, some additional bits are used for coding the high frequencies of the input sound signal 101 to allow one or more frequency pulses coding.
The sub-frame length as determined by the calculator 210 (
While the case with one or two sub-frames limits the time-domain coding to an adaptive codebook contribution only (with coded pitch lag and pitch gain), i.e. no fixed codebook is used in that case, the case with four (4) sub-frames allow for adaptive and fixed codebook contributions if the available bit budget is sufficient. The four (4) sub-frame case is allowed at bitrates starting from around 16 kbps up. Because of bit budget limitations, the time-domain excitation contribution consists only of the adaptive codebook contribution at lower bitrates. A fixed-codebook contribution can be added at higher bit rates, for example starting at 24 kbps. For all cases the time-domain coding efficiency will be evaluated afterward to decide up to which frequency (the above mentioned cut-off frequency) such time-domain coding is valuable.
The alternative implementation of
The sound signal classifier 701 determines that the number of sub-frames is four (4) unless the sub-mode flag Ftfsm is set to “1” or “2” (selection of the first or second coding sub-mode), meaning that the content of the input sound signal 101 is closer to speech (“speech” like characteristics or likelihood of a temporal attack is/are detected in the input sound signal 101) and the available bitrate is below 15 kbps. Specifically:
In the unified time-domain/frequency-domain CELP coding device 100 and method 150 (
When the mixed time-domain/frequency-domain coding mode is used, a closed-loop pitch analysis followed, if needed, by a fixed algebraic codebook search are performed. For that purpose, the mixed time-domain/frequency domain coding method 170/770 comprises an operation 155 of calculating the time-domain excitation contribution. To perform operation 155, the mixed time-domain/frequency domain encoder 120/720 comprises a calculator of time-domain excitation contribution 105. The calculator 105 itself comprises an analyzer 211 (
When the closed-loop pitch analysis has been completed in operation 261 and a fixed-codebook contribution is used, the calculator of time-domain excitation contribution 105 comprises a fixed algebraic codebook 212 searched during an operation 262 of fixed codebook search to find the best fixed-codebook parameters usually comprising a fixed-codebook index and a fixed-codebook gain. The fixed-codebook index and gain form the fixed-codebook contribution. The fixed-codebook index is encoded and transmitted to the distant decoder. The fixed-codebook gain is also quantized and transmitted to the distant decoder. The fixed-algebraic codebook and searching thereof are believed to be well known to those of ordinary skill in the art of CELP coding and, therefore, will not be further described in the present disclosure.
The adaptive-codebook index and gain and, if used, the fixed-codebook index and gain form the time-domain CELP excitation contribution.
During the frequency-domain coding of the mixed time-domain/frequency-domain coding mode, two signals are represented in transform-domain, for example in frequency-domain. In one embodiment, the time-to-frequency transform can be achieved using a 256 points type II (or type IV) DCT (Discrete Cosine Transform) giving a resolution of 25 Hz with an inner sampling rate of 12.8 kHz but any other suitable transform could be used. In the case another transform is used, the frequency resolution (defined above), the number of frequency bands and the number of frequency bins per band (defined further below) might need to be revised accordingly.
As indicated in the foregoing description, in the unified time-domain/frequency-domain CELP coding device 100 and method 150 (
where res(n) is the input LP residual, etd(n) is the time-domain excitation contribution, and N is the frame length. In a possible implementation, the frame length is 256 samples for a corresponding inner sampling rate of 12.8 kHz. The time-domain excitation contribution is given by the following relation:
where v(n) is the adaptive-codebook contribution, b is the adaptive-codebook gain, c(n) is the fixed-codebook contribution, and g is the fixed-codebook gain. It should be noted that the time-domain excitation contribution may consist only of the adaptive codebook contribution as described in the foregoing description.
With sound signal samples classified as generic audio (
An operation 265 of estimating the cut-off frequency of the time-domain excitation contribution is first completed by the calculator 215 (
For this illustrative example, the number of frequency bins j per band Bb, the cumulative frequency bins per band CBb, and the normalized cross-correlation Cc(i) per frequency band i are defined, for example, as follows, for a 20 ms frame at 12.8 kHz internal sampling rate:
where Bb is the number of frequency bins j per band Bb, CBb is the cumulative frequency bins per band, Cc(i) is the normalized cross-correlation per frequency band i, S′f
The calculator of cut-off frequency 215 comprises a smoother 304 (
where, in an illustrative embodiment,
The calculator of cut-off frequency 215 further comprises a calculator 305 (
The calculator 215 of cut-off frequency also comprises a cut-off frequency module 306 (
In the above relations, ftc
At low bitrate, where the normalized average
The precision of the cut-off frequency may be improved by adding the following component to the computation. For that purpose, the cut-off frequency module 306 comprises an extrapolator 410 (
where Fs=12800 Hz is the internal sampling rate or frequency, Nsub is the number of sub-frames in a frame, and T(i) is the adaptive-codebook index or pitch lag for sub-frame i.
The cut-off frequency module 306 comprises a finder 409 (
(h8
The index of that band will be called i8
The cut-off frequency module 306 finally comprises a selector 411 (
f
tc=max(Lf(i8
When coding sub-modes are used, in the case of the unified time-domain/frequency-domain coding device 700 and method 750 of
f
tc=maxmax(Lf(i8
As illustrated in
As a non-limitative, illustrative example, when the cut-off frequency ftc from the selector 411 is below or equal to 775 Hz, the analyzer 415 considers that the cost of the time-domain excitation contribution is too high. The selector 416 then selects all the frequency bins of the frequency representation of the time-domain excitation contribution to be zeroed and the zeroer 417 forces to zero all the frequency bins and also force the cut-off frequency ftc to zero. All bits allocated to the time-domain excitation contribution are then reallocated to the frequency-domain coding mode. Otherwise, the analyzer 415 forces the selector 416 to choose the high-frequency bins above the cut-off frequency ftc for being zeroed by the filter (zeroer) 418.
Finally, the calculator 215 of cut-off frequency comprises a quantizer 309 (
f
tcQ={0,1175,1575,1975,2375,2775,3175,3575}
Many mechanisms could be used by the selector 411 to stabilize the choice of the final cut-off frequency ftc to prevent the quantized version ftcQ to switch between 0 and 1175 in inappropriate signal segment. To achieve this, as a non-restrictive example, the analyzer 415 is responsive to the long-term average pitch gain Glt 412 from the closed loop pitch analyzer 211 (
where Col is the open-loop pitch correlation 413 and Cst corresponds to the smoothed version of the open-loop pitch correlation 414 defined as Cst=0.9·Col+0.1·Cst. Further, Glt (item 412 of
Once the cut-off frequency ftc of the time-domain excitation contribution is determined, frequency-domain coding is performed. To perform such frequency-domain coding, the mixed time-domain/frequency domain coding method 170/770 comprises a subtracting operation 159, a frequency quantizing operation 160 and an adding operation 161. The mixed time-domain/frequency domain encoder 120/720 comprises a subtractor or calculator 109, a frequency quantizer 110 and an adder 111 to perform the operations 159, 160 and 161, respectively.
The subtractor or calculator 109 (
The downscaled part of the difference vector fd resulting from application of the downscale factor 603 can be performed with any type of fade out function, it can be shortened to only a few frequency bins, but it could also be omitted when the available bit budget is judged sufficient to prevent energy oscillation artifacts when the cut-off frequency ftc is changing. For example, with a 25 Hz resolution, corresponding to 1 frequency bin fbin=25 Hz in 256 points DCT at 12.8 kHz internal sampling rate, the difference vector can be built as:
where fres, fexc and ftc have been defined in the foregoing description.
In the unified time-domain/frequency-domain CELP coding method 750 as illustrated in
Specifically,
To make the best possible use of the bits available for the frequency quantization, the band selection and bit allocation operation 757 comprises a first operation 951 of pre-fixing a fraction of the available bit budget (see 900) for quantizing the lower frequencies of the difference vector fd as a function of the quantized cut-off frequency ftcQ from the cut-off frequency finder and filter 108. To perform operation 951, an estimator 901 uses, for example, the following relation:
where PBlf is the fraction of the available bits allocated to frequency quantizing of the lower frequencies of the difference vector fd. In this example, the lower frequencies refer to the first five (5) frequency bands, or the first two (2) kHz. The term Lf(ftcQ) refers to the number of frequency bins up to the quantized cut-off frequency ftcQ.
Then, the estimator 901 adjusts the fraction of the available bits allocated to frequency quantizing of the lower frequencies PBlf based on the coding sub-mode flag Ftfsm. If the coding sub-mode flag Ftfsm is set to “2” (
Another parameter that affects the overall number of bits per frequency band available for frequency quantizing the difference vector fd, is an estimated maximum number NBmx of frequency bands of this difference vector fd to quantize. In the presently described illustrative example, at an internal sampling rate of 12.8 kHz, the maximum total number Ntt of frequency bands is sixteen (16).
When the coding sub-modes are used, the band selection and bit allocation operation 757 comprises an operation 952 of estimating the maximum number NBmx of frequency bands of the difference vector fd to quantize. To perform operation 952, an estimator 902 sets, if the coding sub-mode flag Ftfsm is set to “1” (first coding sub-mode being selected), the maximum number NBmx of frequency bands to “10”. If the coding sub-mode flag Ftfsm is set to “2” (second coding sub-mode being selected), then the estimator 902 sets the maximum number NBmx of frequency bands to “9”. If the coding sub-mode flag Ftfsm is set to “3” (third coding sub-mode being selected), then the estimator 902 sets the maximum number NBmx of frequency bands to “13”. The estimator 902 then readjusts the maximum number NBmx of frequency bands to quantize as a function of the bit budget available for the frequency quantization of the difference vector fd using, for example, the following relations:
where BF represents the number of bits available for frequency quantization of the difference vector fd (see 900), BT is the total bitrate available to code the channel under processing (see 900), Ftfsm is the sub-mode flag (see 900), and Ntt is the maximum total number of frequency bands.
The estimator 902 can further reduce the maximum number of frequency bands of the difference vector fd to quantize in relation to the number of bits allocated to quantizing of middle and higher frequency bands of the difference vector fd. For the purpose of such limitation, the last lower frequency band and the first frequency band thereafter are assumed to have a similar number of bits mb or roughly 17% of the bits PB1f allocated to frequency quantizing of the lower frequencies. For the last frequency band to be quantized, a minimum number of 4.5 bits mp is used to quantize at least one (1) frequency pulse. If the available bitrate BT is greater than or equal to 15 kbps, then the minimum number of bits mp will be nine (9) to allow for the quantizing of more pulses per frequency band. However, if the total available bitrate BT is below 15 kbps but the sub-mode flag Ftfsm is set to “3”, meaning content having similarities to music, then the number of bits mp of the last frequency band to be frequency quantized will be 6.75 to allow for a more precise quantization. Then, the estimator 902 computes a corrected maximum number of frequency bands N′Bmx using, for example, the following relation:
where N′Bmx corresponds to the corrected maximum number of frequency bands to quantize, NBmx is the estimated maximum number of frequency bands, the number “5” represents the minimum number of frequency bands, BF represents the number of bits available for frequency quantization of the difference vector fd, PBlf is the fraction of bits allocated to quantizing of the five (5) lower frequency bands, mp is the minimum number of bits allocated to frequency quantize a frequency band, and mb the number of bits allocated to quantizing the first frequency band after the five (5) lower frequency bands.
After the computation of the maximum number of frequency bands, the estimator 902 may perform an additional verification such that mp remains lower or equal to mb. While this additional verification is an optional step, at low bitrate, it helps to allocate the bits more efficiently between the frequency bands of the difference vector fd.
The band selection and bit allocation operation 757 comprises an operation 953 of calculating low frequency bits. To perform operation 953, a calculator 903 is provided. If the computation of the maximum number of frequency bands N′Bmx leads to a smaller number of frequency bands to quantize, the calculator 903 re-allocates the portion of bits previously allocated to the higher frequency bands such that is no longer relevant to quantizing of the lower frequency bands using, for example, the following relation:
where BLF corresponds to the bits allocated to the five (5) lower frequency bands, BF corresponds to the number of bits available for frequency quantizing the lower frequencies of the difference vector fd, PBlf is the above mentioned fraction of bits from estimator 901 allocated, for example, to frequency quantizing of the five (5) lower frequency bands, mp is the minimum number of bits allocated to quantize a frequency band, and mb the number of bits allocated to quantizing the first frequency band after the five (5) lower frequency bands.
The band selection and bit allocation operation 757 comprises an operation 954 of frequency band characterization. To perform operation 954, the band selector and bit allocator 707 comprises a frequency band characterizer 904 which, once the bitrate is distributed between the lower frequency bands and the rest of the frequency bands, performs a dual sorting of the frequency bands, to decide the importance of each band. The first sorting comprises finding whether one or more bands have a lower energy compared to their neighbor frequency bands. When it happens, the characterizer 904 marks these bands such that only the pre-determined minimum number of bits mp can be allocated to frequency quantizing these low energy frequency bands, even if the available bit budget is high. The second sorting comprises performing a position sorting of the middle and higher energy frequency bands, for example in decreasing energy order. These first and second sorting (dual sorting) are not performed for the lower frequency bands but are performed up to the maximum number of frequency bands N′Bmx. The operation 954 of frequency band characterization can be summarized as follows:
where Ppb(i) is set to “1” for frequency bands where only the minimum number of bits mp will be used, EP
The energy E(i) of each frequency band of the difference vector fd is computed in a calculator 708 and corresponding operation 758 of
The band selection and bit allocation operation 757 comprises an operation 955 of final distribution of bits per frequency band. To perform operation 955, the band selector and bit allocator 707 comprises a bits per frequency band final distributor 905.
Once the frequency bands have been characterized, the distributor 905 allocates the bitrate or number of bits BF available to frequency quantize the difference vector fd among selected frequency bands.
In the non-limitative example, for the first five (5) lower frequency bands, the distributor 905 linearly distributes the bits BLF allocated to frequency quantize the lower frequencies, with the first lowest frequency band receiving 23% of the bits BLF and the fifth (5th) lower frequency band receiving the last 17% of the bits BLF. In this manner, the lower frequencies of the spectrum of the difference vector fd can be quantized with sufficient accuracy to recover a better quality synthesis of the input sound signal 101.
The distributor 905 distributes the remaining bits BF allocated to frequency quantize the difference vector fd over the other, middle and higher frequency bands as a linear function but again taking into consideration the previous frequency band energy characterization (operation 954) such that more bits can be allocated to higher energy frequency bands and less bits to the frequency bands having a lower energy compared to the energy of its neighbor frequency bands and, thereby, making a more relevant use of the available bits by quantizing with more precision more important portions of the spectrum of the difference vector fd. As a non-limitative example, the following relation illustrates how the bit distribution (operation 955) can be performed:
where Bp(i) represents the number of bits allocated per frequency band i, BF represents the number of bits available to frequency quantize the difference vector fd, BLF corresponds to the bitrate or bits allocated to the five (5) lower frequency bands, mp is the minimum number of bits to quantize a frequency pulse in a frequency band, Ppb(i) contains the position where the minimum number mp of bits will be used, and N′Bmx is the maximal number of frequency bands to be quantized.
If, after operation 955, there are some bits not allocated, the distributor 905 will allocate them to the lower frequency bands. As a non-limitative example, the distributor 905 will allocate one remaining bit per frequency band starting from the fifth (5th) band and going back to the first band and repeating this procedure if needed to allocate all the remaining bits.
Later, the distributor 905 may have to floor, truncate or round the number of bits per frequency band depending on the algorithm being used to perform the quantizing of the frequency pulses and potential fixed-point implementation.
The mixed time-domain/frequency-domain CELP coding method 170/770 comprises an operation of frequency quantizing 160 (
The difference vector fd can be quantized using several methods. In every case, frequency pulses have to be searched for and quantized. In one possible implementation, the frequency quantizer 110 searches for the most energetic pulses of the difference vector fd across the spectrum. The method to search the pulses can be as simple as splitting the spectrum into frequency bands and allowing a certain number of pulses per frequency band. The number of pulses per frequency bands depends on the bit budget available and on the position of the frequency band inside the spectrum. Typically, more pulses are allocated to the lower frequencies.
Depending on the bitrate available, the quantization of the frequency pulses can be performed by the frequency quantizer 110 using different techniques. In one embodiment, at bitrate below 12 kbps, a simple search and quantization scheme can be used to code the position and sign of the pulses. This scheme is described herein below as a non-limitative example.
For frequencies lower than 3175 Hz, the simple search and quantization scheme uses an approach based on factorial pulse coding (FPC) which is described in the literature, for example in Reference [8], of which the full content is incorporated herein by reference.
More specifically, referring to
As illustrated in
The searcher 609 searches frequency pulses through all the frequency bands for the frequencies lower than 3175 Hz. The FPC coder 610 then processes the frequency pulses. The finder 611 determines the most energetic pulses for frequencies equal to and larger than 3175 Hz, and the quantizer 612 codes the position and sign of the found, most energetic pulses. If more than one (1) pulse is allowed within a frequency band then the amplitude of the pulse previously found is divided by 2 and the search is again conducted over the entire frequency band. Each time a pulse is found, its position and sign are stored for quantization and the bit packing stage. The following pseudo code illustrates, as a non-limitative example, this simple search and quantization scheme:
where NBD is the number of frequency bands (NBD=16 in the illustrative example), Np is the number of pulses i to be coded in a frequency band k, Bb is the number of frequency bins per frequency band, CBb is the cumulative frequency bins per band as defined previously in Section 5), pp represents the vector containing the pulse position found, ps represents the vector containing the sign of the pulse found and pmax represents the energy of the pulse found.
At bitrates above 12 kbps, the selector 504 determines that all the spectrum is to be quantized using FPC (
Then, the FPC processor 608 or the quantizer of position and sign of pulses 612 obtains the quantized difference vector fdQ by adding the number of pulses nb_pulses with the pulse sign ps to each of the position pp found. For each frequency band the quantized difference vector fdQ can be written using, for example, the following pseudo code:
The frequency bands are quantized with more or less precision; the quantization method described in the previous section does not guarantee that all frequency bins within the frequency bands are quantized. This is especially the case at low bitrates where the number of pulses quantized per frequency band is relatively low. To prevent the apparition of audible artifacts due to these unquantized frequency bins, the frequency quantizer 110 comprises a noise filler 507 (
The noise filler 507 comprises an adder 613 (
In the illustrative embodiment, in the estimator 614, the noise level is directly related to the coding bitrate. For example, at 6.60 kbps the estimator 614 sets the noise level N′L to 0.4 times the amplitude of the frequency pulses coded in a specific frequency band and progressively down to a value of 0.2 times the amplitude of the frequency pulses coded in a frequency band at 24 kbps. The adder 613 injects the noise only to section(s) of the spectrum where a certain number of consecutives frequency bins has a very low energy, for example when the cumulative bins energy of half of a frequency band is below 0.5. For a specific frequency band i, the noise is injected for example as follows:
where, for a band i, CBb is the cumulative number of frequency bins per frequency band, Bb is the number of frequency bins in a specific band i, N′L is the level of the added noise, and rand is a random number generator which is limited between −1 to 1.
Referring to
Once the quantized difference vector fdQ, including the noise fill if needed, is found, calculator 615 computes the gain per band for each frequency band. The per band gain for a specific band Gb(i) is defined as the ratio between the energy of the unquantized difference vector fd to the energy of the quantized difference vector fdQ in the log domain using, for example, the following relations:
where CBb and Bb are defined hereinabove in Section 5).
The per band gain quantizer 616 vector quantizes the per band frequency gains. Prior to vector quantization, at low bitrate, the last gain (corresponding to the last frequency band) is quantized separately, and the remaining fifteen (15) per band gains (when, for example, a number 16 of frequency bands is used) are divided by the quantized last gain. Then, the normalized fifteen (15) remaining gains are vector quantized by the quantizer 616. At higher bitrate, the mean of the per band gains is quantized first and then removed from all per band gains of the, for example, sixteen (16) frequency bands prior the vector quantization of those per band gains. The vector quantization being used can be a standard minimization in the log domain of the distance between the vector containing the per band gains and the entries of a specific codebook.
In the frequency-domain coding mode, gains are computed in the calculator 615 for each frequency band to match the energy of the unquantized vector fd to the quantized vector fdQ. The gains are vector quantized in quantizer 616 and applied per frequency band (operation 559) to the quantized vector fdQ through a multiplier 509 (
Alternatively, it is also possible to use the FPC coding scheme at rate below 12 kbps for the whole spectrum by selecting only some of the frequency bands to be quantized. Before performing the selection of the frequency bands, the energy Ed of the frequency bands of the unquantized difference vector fd, are quantized using quantizer 616. The energy is computed using, for example, the following relation:
where CBb and Bb are defined hereinabove in Section 5).
To perform the quantization of the frequency band energy E′d, first the average energy over the first 12 frequency bands out of the sixteen bands being used is quantized and subtracted from all the sixteen (16) band energies. Then all the frequency bands are vectors quantized per group of 3 or 4 bands. The vector quantization being used can be a standard minimization in the log domain of the distance between the vector containing the gains per band and the entries of a specific codebook. If not enough bits are available, it is possible to only quantize the first 12 frequency bands and to extrapolate the last four (4) frequency bands using an average of the previous three (3) frequency bands or by any other methods.
Once the energy of frequency bands of the unquantized difference vector are quantized, it becomes possible to sort the energy in decreasing order in such a way that it would be replicable on the decoder side. During the sorting, all the energy bands below 2 kHz are always kept and then only the most energetic bands will be passed to the FPC scheme for coding frequency pulse amplitudes and signs. With this approach the FPC scheme codes a smaller vector but covering a wider frequency range. In others words, it takes less bits to cover important energy events over the entire spectrum.
In the particular case of implementation of the unified time-domain/frequency-domain coding device 700 and method 750 of
After the pulse quantization process, a noise fill similar to what has been described earlier is performed. Then, a gain adjustment factor Ga is computed per frequency band to match the energy EdQ of the quantized difference vector fdQ to the quantized energy E′d of the unquantized difference vector fd. Then this per band gain adjustment factor is applied to the quantized difference vector fdQ. This can be expressed as follows:
After the completion of the frequency-domain coding stage, the total time-domain/frequency domain excitation is found. For that purpose, the mixed time-domain/frequency-domain CELP coding method 170/770 comprises an operation 161 of adding, using an adder 111 (
The unified time-domain/frequency domain coding method 150/750 comprises an operation 163/756 of producing a synthesized signal by filtering the total time-domain/frequency domain excitation from the IDCT 220 through a LP synthesis filter 113/706 (
The quantized positions and signs of the frequency pulses forming the quantized difference vector fdQ are transmitted to the distant decoder (not shown).
In one non-limitative embodiment, while the CELP coding memories are updated on a sub-frame basis using only the time-domain excitation contribution, the total time-domain/frequency-domain excitation is used to update those memories at frame boundaries. In another possible implementation, the CELP coding memories are updated on a sub-frame basis and also at the frame boundaries using only the time-domain excitation contribution. This results in an embedded structure where the frequency-domain quantized signal constitutes an upper quantization layer independent of the core CELP layer. This presents advantages in certain applications. In this particular case, the fixed codebook is always used to maintain good perceptual quality, and the number of sub-frames is always four (4) for the same reason. However, the frequency-domain analysis can apply to the whole frame. This embedded approach works for bit rates around 12 kbps and higher.
The decoder device 1100 comprises a receiver (not shown) for receiving the bitstream 1101 from the unified time-domain/frequency-domain coding device 700.
If the sound signal coded by the unified time-domain/frequency-domain coding device 700 has been classified as “music”, this is indicated in the bitstream 1101 by corresponding signaling bits and detected by the decoder device 1100 (see 1102). The received bitstream 1101 is then decoded by a “music” decoder 1103, for example a frequency-domain decoder.
If the sound signal coded by the unified time-domain/frequency-domain coding device 700 has been classified as “speech”, this is indicated in the bitstream 1101 by corresponding signaling bits and detected by the decoder device 1100 (see 1104). The received bitstream 1101 is then decoded by a “speech” decoder 1105, for example a time-domain decoder using ACELP (Algebraic Code-Excited Linear Prediction) or more generally CELP (Code-Excited Linear Prediction).
If the sound signal coded by the unified time-domain/frequency-domain coding device 700 has not been classified either as “music” or “speech” (see 1102 and 1104) and the bitrate available for coding the sound signal was equal to or lower than 9.2 kbps (see 1106), this is indicated in the bitstream by the sub-mode flag Ftfsm set to “0”. The received bitstream 1101 is then decoded using the backward coding mode, i.e. the legacy unified time-domain and frequency-domain coding model of
Finally, if the sound signal coded by the unified time-domain/frequency-domain coding device 700 has not been classified either as “music” or “speech” (see 1102 and 1104) and the bitrate available for coding the sound signal was higher than 9.2 kbps (see 1106), this is indicated in the bitstream 1101 by a sub-mode flag Ftfsm set to “1”, “2” or “3”. The received bitstream 1101 is then decoded using the sound signal decoder 1200 and corresponding sound signal decoding method 1250 of
As mentioned in the foregoing description, the adaptive-codebook index T and the adaptive-codebook gain b are quantized and transmitted, and therefore received in the bitstream by the receiver (not shown). In the same manner, when used, the fixed-codebook index and the fixed-codebook gain are also quantized and transmitted to the decoder, and therefore received in the bitstream 1101 by the receiver (not shown). The sound signal decoding method 1250 comprises an operation 1256 of calculating a decoded time-domain excitation contribution using the adaptive-codebook index and gain and, if used, the fixed-codebook index and gain as commonly made in the art of CELP coding. To perform operation 1256, the sound signal decoder 1200 comprises a calculator 126 of the decoded time-domain excitation contribution.
The sound signal decoding method 1250 also comprises an operation 1257 of calculating a frequency transform of the decoded time-domain excitation contribution using the same procedure as in operation 156 using a DCT transform. To perform operation 1257, the sound signal decoder 1200 comprises a calculator 1207 of the frequency transform of the decoded time-domain excitation contribution.
As mentioned in the foregoing description, a quantized version ftcQ of the cut-off frequency is transmitted to the decoder, and therefore received in the bitstream 1101 by the receiver (not shown). The sound signal decoding method 1250 comprises an operation 1258 of filtering the frequency transform of the time-domain excitation contribution from the calculator 1207 using the decoded cut-off frequency ftcQ recovered from the bitstream 1101 and a procedure which is the same or similar to previously described filtering operation 266. For completing operation 1258, the sound signal decoder 1200 comprises a filter 1208 of the frequency transform of the time-domain excitation contribution using the recovered cut-off frequency ftcQ. Filter 1208 has the same, or to the least a similar structure as filter 216 of
The filtered frequency transform of the time-domain excitation contribution from filter 1208 is supplied to a positive input of an adder 1209 performing a corresponding adding operation 1259.
The sound signal decoding method 1250 comprises an operation 1260 of calculating the decoded energy and gain per frequency band of the difference vector fd. To perform operation 1260, the sound signal decoder 1200 comprises a calculator 1210. Specifically, the calculator 1210 de-quantizes, using procedures inverse to those as described in the present disclosure for the quantization, the quantized energy per frequency band and quantized gain per frequency band received in the bitstream 1101 by the receiver (not shown) from the unified time-domain/frequency-domain coding device 700.
The sound signal decoding method 1250 comprises an operation 1261 of recovering the frequency quantized difference vector fdQ. To perform operation 1261, the sound signal decoder 1200 comprises a calculator 1211. The calculator 1211 extracts from the bitstream 1101 the quantized positions and signs of the frequency pulses and replicates the selection of the frequency bands to be used for quantization and the bit allocation in the different frequency bands as determined by the operation 757 and allocator 707 and employed by the unified time-domain/frequency-domain coding device 700 for coding the input sound signal. The calculator 1211 uses this replicated information to recover the frequency quantized difference vector fdQ from the extracted frequency pulse quantized positions and signs. Specifically, for that purpose, the sound signal decoder 1200 replicates the procedure used in the unified time-domain/frequency-domain coding device 700 as illustrated in
Specifically:
The sound signal decoding method 1250 comprises an operation 1259 of adding the recovered frequency quantized difference vector fdQ from calculator 1211 and the frequency-transformed and filtered time-domain excitation contribution fexcF from the filter 1208 to form the mixed time-domain/frequency-domain excitation.
As can be appreciated, the estimators 1201 and 1202, calculator 1203, characterizer 1204, distributor 1205, calculators 1206 and 1207, filter 1208, calculators 1210 and 1211, and adder 1212 form a re-constructor of the mixed time-domain/frequency-domain excitation using information conveyed in the bitstream 1101, including the sub-mode flag identifying of one of the coding sub-modes selected and used for coding the sound signal classified in the unclear signal type category.
In the same manner, the operations 1251-1261 form a method of reconstructing the mixed time-domain/frequency-domain excitation using the information conveyed in the bitstream 1101.
The sound signal decoder 1200 comprises a converter 1212 to perform an operation 1262 of transforming the mixed time-domain/frequency-domain excitation back to time-domain using for example the IDCT (Inverse DCT) 220.
Finally, the synthesized sound signal is computed in the decoder 1200 by an operation 1263 of filtering through a LP (Linear Prediction) synthesis filter 1213 the total excitation from the converter 1212. Of course, LP parameters required by the decoder 1200 to reconstruct the synthesis filter 1213 are transmitted from the unified time-domain/frequency-domain coding device 700 and extracted from the bitstream 1101 as well known in the art of CELP coding.
The unified time-domain/frequency-domain coding device 100/700 and the decoder device 1100 may be implemented as a part of a mobile terminal, as a part of a portable media player, or in any similar device. The device 100/700 and decoder device 1100 (identified as 1000 in
The input 1002 is configured to receive the input sound signal 101/bitstream 1101 of
The processor 1001 is operatively connected to the input 1002, to the output 1003, and to the memory 1004. The processor 1001 is realized as one or more processors for executing code instructions in support of the functions of the various components of the unified time-domain/frequency-domain coding device 100/700 for coding an input sound signal as illustrated in
The memory 1004 may comprise a non-transient memory for storing code instructions executable by the processor(s) 1001, specifically, a processor-readable memory comprising/storing non-transitory instructions that, when executed, cause a processor(s) to implement the operations and components of the unified time-domain/frequency-domain coding device 100/700 and method 150/750 and the decoder device 1100 and decoding method 1150 described in the present disclosure. The memory 1004 may also comprise a random access memory or buffer(s) to store intermediate processing data from the various functions performed by the processor(s) 1001.
Those of ordinary skill in the art will realize that the description of the unified time-domain/frequency-domain coding device 100/700 and method 150/750 and the decoder device 1100 and decoding method 1150 is illustrative only and is not intended to be in any way limiting. Other embodiments will readily suggest themselves to such persons with ordinary skill in the art having the benefit of the present disclosure. Furthermore, the disclosed unified time-domain/frequency-domain coding device 100/700 and method 150/750, decoder device 1100 and decoding method 1150 may be customized to offer valuable solutions to existing needs and problems of encoding and decoding sound.
In the interest of clarity, not all of the routine features of the implementations of the unified time-domain/frequency-domain coding device 100/700 and method 150/750 and the decoder device 1100 and decoding method 1150 are shown and described. It will, of course, be appreciated that in the development of any such actual implementation of the unified time-domain/frequency-domain coding device 100/700 and method 150/750 and the decoder device 1100 and decoding method 1150, numerous implementation-specific decisions may need to be made in order to achieve the developer's specific goals, such as compliance with application-, system-, network- and business-related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Moreover, it will be appreciated that a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the field of sound processing having the benefit of the present disclosure.
In accordance with the present disclosure, the components/processors/modules, processing operations, and/or data structures described herein may be implemented using various types of operating systems, computing platforms, network devices, computer programs, and/or general purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used. Where a method comprising a series of operations and sub-operations is implemented by a processor, computer or a machine and those operations and sub-operations may be stored as a series of non-transitory code instructions readable by the processor, computer or machine, they may be stored on a tangible and/or non-transient medium.
The unified time-domain/frequency-domain coding device 100/700 and method 150/750 and the decoder device 1100 and decoding method 1150 as described herein may use software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein.
In the unified time-domain/frequency-domain coding device 100/700 and method 150/750 and the decoder device 1100 and decoding method 1150 as described herein, the various operations and sub-operations may be performed in various orders and some of the operations and sub-operations may be optional.
Although the present disclosure has been described hereinabove by way of non-restrictive, illustrative embodiments thereof, these embodiments may be modified at will within the scope of the appended claims without departing from the spirit and nature of the present disclosure.
The present disclosure mentions the following references, of which the full content is incorporated herein by reference:
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2022/050006 | 1/5/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63135171 | Jan 2021 | US |