Transform-based audio codecs like AAC, MP3, or TCX generally introduce inter-harmonic quantization noise when processing harmonic audio signals, particularly at low bitrates.
This effect is further worsened when the transform-based audio codec operates at low delay, due to the worse frequency resolution and/or selectivity introduced by a shorter transform size and/or a worse window frequency response.
This inter-harmonic noise is generally perceived as a very annoying “warbling” artifact, which significantly reduces the performance of the transform-based audio codec when subjectively evaluated on highly tonal audio material like some music or voiced speech.
A common solution to this problem is to employ prediction-based techniques, prediction using autoregressive (AR) modeling based on the addition or subtraction of past input or decoded samples, either in the transform-domain or in the time-domain.
However, using such techniques in signals with changing temporal structure again leads to unwanted effects such as temporal smearing of percussive musical events or speech plosives or even the creation of impulse trails due to the repetition of a single impulse-like transient. Thus, special care has to be taken for signals that contain both transient and harmonic components or for signals where there is ambiguity between transients and trains of pulses (the latter belonging to a harmonic signal composed of individual pulses of very short duration; such signals are also known as pulse-trains).
Several solutions exist to improve the subjective quality of transform-based audio codecs on harmonics audio signals. All of them exploit the long-term periodicity (pitch) of very harmonic, stationary waveforms, and are based on prediction-based techniques, either in the transform-domain or in the time-domain. Most of the solutions are known as either long-term prediction (LTP) or pitch prediction, characterized by a pair of filters being applied to the signal: a pre-filter in the encoder (usually as a first step in the time or frequency domain) and a post-filter in the decoder (usually as a last step in the time or frequency domain). A few other solutions, however, apply only a single post-filtering process on the decoder side generally known as harmonic post-filter or bass-post-filter. All of these approaches, regardless of being pre- and post-filter pairs or only post-filters, will be denoted as a harmonic filter tool in the following.
Examples of transform-domain approaches are:
Examples of time-domain approaches applying both pre- and post-filtering are:
Examples of time-domain approaches where only post-filtering is applied are:
An example of a transient detector is:
Relevant literature on psychoacoustics:
Springer, Dec. 14, 2006.
All the techniques described in the prior have decisions when to enable the prediction filter based on a single threshold decision (e.g. prediction gain [5] or pitch gain [4] or harmonicity which is basically proportional to the normalized correlation [6]). Furthermore, OPUS [7] employs hysteresis that increases the threshold if the pitch is changing and decreases the threshold if the gain in the previous frame was above a predefined fixed threshold. OPUS [7] also disables the long-term (pitch) predictor if a transient is detected in some specific frame configurations. The reason for this design seems to stem from the general belief that, in a mix of harmonic and transient signal components, the transient dominates the mix, and activating LTP or pitch prediction upon it would, as discussed earlier, subjectively cause more harm than improvement. However, for some mixtures of waveforms which will be discussed hereafter, activating the long-term or pitch predictor on transient audio frames significantly increases the coding quality or efficiency and thus is beneficial. Furthermore, it may be beneficial to, when activating the predictor, vary its strength based on instantaneous signal characteristics other than a prediction gain, the only approach in the state of the art.
Accordingly, it is an object of the present invention to provide a concept for a harmonicity-dependent controlling of a harmonic filter tool of an audio codec which results in an improved coding efficiency, e.g. improved objective coding gain or better perceptual quality or the like.
According to an embodiment, an apparatus for performing a harmonicity-dependent controlling of a harmonic filter tool of an audio codec may have: a pitch estimator configured to determine a pitch of an audio signal to be processed by the audio codec; a harmonicity measurer configured to determine a measure of harmonicity of the audio signal using the pitch; a temporal structure analyzer configured to determine, depending on the pitch, at least one temporal structure measure measuring a characteristic of a temporal structure of the audio signal; a controller configured to control the harmonic filter tool depending on the temporal structure measure and the measure of harmonicity.
According to an embodiment, an audio encoder or audio decoder may have a harmonic filter tool and the apparatus for performing a harmonicity-dependent controlling of the harmonic filter tool as mentioned above.
According to an embodiment, a system may have: an apparatus for performing a harmonicity-dependent controlling of a harmonic filter tool as mentioned above, wherein the controller is configured to control the harmonic filter tool at units of frames, and the temporal structure analyzer is configured to sample an energy of the audio signal at a sample rate higher than a frame rate of the frames so as to acquire energy samples of the audio signal and to determine the at least one temporal structure measure on the basis of the energy samples; and a transient detector configured to detect transients in an audio signal to be processed by the audio codec on the basis of the energy samples.
Another embodiment may have a transform-based encoder having the system as mentioned above, configured to switch a transform block and/or overlap length depending on the detected transients.
Another embodiment may have an audio encoder having the system as mentioned above, configured to support switching between a transform coded excitation mode and a code excited linear prediction mode depending on the detected transients.
According to an embodiment, a method for performing a harmonicity-dependent controlling of a harmonic filter tool of an audio codec may have the steps of: determining a pitch of an audio signal to be processed by the audio codec; determining a measure of harmonicity of the audio signal using the pitch; determining, depending on the pitch, at least one temporal structure measure measuring a characteristic of a temporal structure of the audio signal; controlling the harmonic filter tool depending on the temporal structure measure and the measure of harmonicity.
Another embodiment may have a non-transitory digital storage medium having a computer program stored thereon to perform the method for performing a harmonicity-dependent controlling of a harmonic filter tool of an audio codec, which method may have the steps of: determining a pitch of an audio signal to be processed by the audio codec; determining a measure of harmonicity of the audio signal using the pitch; determining, depending on the pitch, at least one temporal structure measure measuring a characteristic of a temporal structure of the audio signal; controlling the harmonic filter tool depending on the temporal structure measure and the measure of harmonicity; when said computer program is run by a computer.
It is a basic finding of the present application that the coding efficiency of an audio codec using a controllable—switchable or even adjustable—harmonic filter tool may be improved by performing the harmonicity-dependent controlling of this tool using a temporal structure measure in addition to a measure of harmonicity in order to control the harmonic filter tool. In particular, the temporal structure of the audio signal is evaluated in a manner which depends on the pitch. This enables to achieve a situation-adapted control of the harmonic filter tool such that in situations where a control made solely based on the measure of harmonicity would decide against or reduce the usage of this tool although using the harmonic filter tool would, in that situation, increase the coding efficiency, the harmonic filter tool is applied, while in other situations where the harmonic filter tool may be inefficient or even destructive, the control reduces the appliance of the harmonic filter tool appropriately.
Embodiments of the present application are set out below with respect to the figures among which
The following description starts with a first detailed embodiment of a harmonic filter tool control. A brief survey of thoughts, which led to this first embodiment, are presented. These thoughts, however, also apply to the subsequently explained embodiments. Thereinafter, generalizing embodiments are presented, followed by specific concrete examples for audio signal portions in order to more concretely outline the effects resulting from embodiments of the present application.
The decision mechanism for enabling or controlling a harmonic filter tool of, for example, a prediction based technique, is, based on a combination of a harmonicity measure such as a normalized correlation or prediction gain and a temporal structure measure, e.g. temporal flatness measure or energy change.
The decision may, as outlined below, not be dependent just on the harmonicity measure from the current frame, but also on a harmonicity measure from the previous frame and on a temporal structure measure from the current and, optionally, from the previous frame.
The decision scheme may be designed such that the prediction based technique is enabled also for transients, whenever using it would be psychoacoustically beneficial as concluded by a respective model.
Thresholds used for enabling the prediction based technique may be, in one embodiment, dependent on the current pitch instead on the pitch change.
The decision scheme allows, for example, to avoid repetition of a specific transient, but allow prediction based technique for some transients and for signals with specific temporal structures where a transient detector would normally signal short transform blocks (i.e. the existence of one or more transients).
The decision technique presented below may be applied to any of the prediction-based methods described above, either in the transform-domain or in the time-domain, either pre-filter plus post-filter or post-filter only approaches. Moreover, it can be applied to predictors operating band-limited (with lowpass) or in subbands (with bandpass characteristics).
The overall objective regarding the activating of LTP, pitch prediction, or harmonic post-filtering is that both of the following conditions are achieved:
Determining whether there is an objective benefit to using the filter usually performed by means of autocorrelation and/or prediction gain measures on the target signal and is well known [1-7].
The measurement of a subjective benefit is also straightforward at least for stationary signals, since perceptual improvement data obtained through listening tests are typically proportional to the corresponding objective measures, i.e. the abovementioned correlation and/or prediction gain.
Identifying or predicting the existence of artifacts caused by the filtering, though, may use more sophisticated techniques than simple comparisons of objective measures like frame type (long transforms for stationary vs. short transforms for transient frames) or prediction gain to certain thresholds, as is done in the state of the art. Essentially, in order to prevent artifacts one has to ensure that the changes the filtering causes in the target waveform do not significantly exceed a time-varying spectro-temporal masking threshold anywhere in time or frequency. The decision scheme in accordance with some of the embodiments presented below, thus, uses the following filter decision and control scheme consisting of three algorithmic blocks to be executed in series for each frame of the audio signal to be coded and/or subjected to the filtering:
In other embodiments described later, the three-block structure is a little bit modified.
In other words, harmonicity and T/F envelope measures are obtained in corresponding blocks, which are subsequently used to derive psychoacoustic excitation patterns of both the input and filtered output frames, and finally the filter gain is adapted such that a masking threshold, given by a ratio between the “actual” and the “original” envelope, is not significantly exceeded. To appreciate this, it should be noted that an excitation pattern in this context is very similar to a spectrogram-like representation of the signal being examined, but exhibits temporal smoothing modeled after certain characteristics of human hearing and manifesting itself as “post-masking”.
In order to avoid expensive computations of excitation patterns in the proposed filter-activation decision scheme, low-complexity envelope measures are used as estimates of the characteristics of the excitation patterns. It was found that in the T/F envelope measurement block, data such as segmental energies (SE), temporal flatness measure (TFM), maximum energy change (MEC) or traditional frame configuration info such as the frame type (long/stationary or short/transient) suffice to derive estimates of psychoacoustic criteria. These estimates then can be utilized in the filter gain computation block to determine, with high accuracy, an optimal filter gain to be employed for coding or transmission. In order to prevent a computationally intensive search for the globally optimal gain, a rate-distortion loop over all possible filter gains (or a sub-set thereof) can be substituted by one-time conditional operators. Such “cheap” operators serve to decide whether some filter gain, computed using data from the harmonicity and T/F envelope measurement blocks, shall be set to zero (decision not to use harmonic filtering) or not (decision to use harmonic filtering). Note that the harmonicity measurement block can remain unchanged. A step-by-step realization of this low-complexity embodiment is described hereafter.
As noted, the “initial” filter gain subjected to the one-time conditional operators is derived using data from the harmonicity and T/F envelope measurement blocks. More specifically, the “initial” filter gain may be equal to the product of the time-varying prediction gain (from the harmonicity measurement block) and a time-varying scale factor (from the psychoacoustic envelope data of the T/F envelope measurement block). In order to further reduce the computational load a fixed, constant scale factor such as 0.625 may be used instead of the signal-adaptive time-variant one. This typically retains sufficient quality and is also taken into account in the following realization.
A step-by-step description of a concrete embodiment for controlling of the filter tool is laid out now.
The input signal sHP(n) is input to the time-domain transient detector. The input signal sHP(n) is high-pass filtered. The transfer function of the transient detection's HP filter is given by
H
TD(z)=0.375−0.5z−1+0.125z−2 (1)
The signal, filtered by the transient detection's HP filter, is denoted as sTD(n). The HP-filtered signal sTD(n) is segmented into 8 consecutive segments of the same length. The energy of the HP-filtered signal sTD(n) for each segment is calculated as:
is the number of samples in 2.5 milliseconds segment at the input sampling frequency.
An accumulated energy is calculated using:
E
Acc=max(ETD(i−1),0.8125EAcc) (3)
An attack is detected if the energy of a segment ETD(i) exceeds the accumulated energy by a constant factor attackRatio=8.5 and the attackIndex is set to i:
E
TD(i)>attackRatio·EAcc (4)
If no attack is detected based on the criteria above, but a strong energy increase is detected in segment i, the attackIndex is set to i without indicating the presence of an attack. The attackIndex is basically set to the position of the last attack in a frame with some additional restrictions.
The energy change for each segment is calculated as:
The temporal flatness measure is calculated as:
The maximum energy change is calculated as:
MEC(Npast,Nnew)=max(Echng(−Npast),Echng(−Npast+1), . . . ,Echng(Nnew−1)) (7)
If index of Echng(i) or ETD(i) is negative then it indicates a value from the previous segment, with segment indexing relative to the current frame.
Npast is the number of the segments from the past frames. It is equal to 0 if the temporal flatness measure is calculated for the usage in ACELP/TCX decision. If the temporal flatness measure is calculate for the TCX LTP decision then it is equal to:
Nnew is the number of segments from the current frame. It is equal to 8 for non-transient frames. For transient frames first the locations of the segments with the maximum and the minimum energy are found:
If ETD(imin)>0.375ETD(imax) then Nnew is set to imax−3, otherwise Nnew is set to 8.
The overlap length and the transform block length of the TCX are dependent on the existence of a transient and its location.
The transient detector described above basically returns the index of the last attack with the restriction that if there are multiple transients then MINIMAL overlap is more advantageous than HALF overlap which is more advantageous than FULL overlap. If an attack at position 2 or 6 is not strong enough then HALF overlap is chosen instead of the MINIMAL overlap.
One pitch lag (integer part+fractional part) per frame is estimated (frame size e.g. 20 ms). This is done in 3 steps to reduce complexity and improves estimation accuracy.
a. First Estimation of the Integer Part of the Pitch Lag
A pitch analysis algorithm that produces a smooth pitch evolution contour is used (e.g. Open-loop pitch analysis described in Rec. ITU-T G.718, sec. 6.6). This analysis is generally done on a subframe basis (subframe size e.g. 10 ms), and produces one pitch lag estimate per subframe. Note that these pitch lag estimates do not have any fractional part and are generally estimated on a downsampled signal (sampling rate e.g. 6400 Hz). The signal used can be any audio signal, e.g. a LPC weighted audio signal as described in Rec. ITU-T G.718, sec. 6.5.
b. Refinement of the Integer Part of the Pitch Lag
The final integer part of the pitch lag is estimated on an audio signal x[n] running at the core encoder sampling rate, which is generally higher than the sampling rate of the downsampled signal used in a. (e.g. 12.8 kHz, 16 kHz, 32 kHz . . . ). The signal x[n] can be any audio signal e.g. a LPC weighted audio signal.
The integer part of the pitch lag is then the lag Tint that maximizes the autocorrelation function
with d around a pitch lag T estimated in step 1.a.
T−δ
1
≤d≤T+δ
2
c. Estimation of the Fractional Part of the Pitch Lag
The fractional part is found by interpolating the autocorrelation function C(d) computed in step 2.b. and selecting the fractional pitch lag Tfr which maximizes the interpolated autocorrelation function. The interpolation can be performed using a low-pass FIR filter as described in e.g. Rec. ITU-T G.718, sec. 6.6.7.
If the input audio signal does not contain any harmonic content or if a prediction based technique would introduce distortions in time structure (e.g. repetition of a short transient), then no parameters are encoded in the bitstream. Only 1 bit is sent such that the decoder knows whether he has to decode the filter parameters or not. The decision is made based on several parameters:
Normalized correlation at the integer pitch-lag estimated in step 3.b.
The normalized correlation is 1 if the input signal is perfectly predictable by the integer pitch-lag, and 0 if it is not predictable at all. A high value (close to 1) would then indicate a harmonic signal. For a more robust decision, beside the normalized correlation for the current frame (norm_corr(curr)) the normalized correlation of the past frame (norm_corr(prev)) can also be used in the decision, e.g.:
One example decision is shown in
The principle of the decision logic is depicted in the block diagram in
The “threshold” in
It is obvious from the examples above that the detection of a transient affects which decision mechanism for the long term prediction will be used and what part of the signal will be used for the measurements used in the decision, and not that it directly triggers disabling of the long term prediction.
The temporal measures used for the transform length decision may be completely different from the temporal measures used for the LTP decision or they may overlap or be exactly the same but calculated in different regions.
For low pitched signals the detection of transients is completely ignored if the threshold for the normalized correlation that depends on the pitch lag is reached.
The gain is generally estimated on the input audio signal at the core encoder sampling rate, but it can also be any audio signal like the LPC weighted audio signal. This signal is noted y[n] and can be the same or different than x[n].
The prediction yp[n] of y[n] is first found by filtering y[n] with the following filter
P(z)=B(z,Tfr)z−T
with Tint the integer part of the pitch lag (estimated in0) and B(z,Tfr) a low-pass FIR filter whose coefficients depend on the fractional part of the pitch lag Tfr (estimated in0).
One example of B(z) when the pitch lag resolution is ¼:
T
fr= 0/4 B(z)=0.0000z−2+0.2325z−1+0.5349z0+0.2325z1
T
fr=¼ B(z)=0.0152z−2+0.3400z−1+0.5094z0+0.1353z1
T
fr= 2/4 B(z)=0.0609z−2+0.4391z−1+0.4391z0+0.0609z1
T
fr=¾ B(z)=0.1353z−2+0.5094z−1+0.3400z0+0.0152z1
The gain g is then computed as follows:
and limited between 0 and 1.
Finally, the gain is quantized e.g. on 2 bits, using e.g. uniform quantization.
If the gain is quantized to 0, then no parameters are encoded in the bitstream, only the 1 decision bit (bit=0).
The description brought forward so far motivated and outlined the advantages of embodiments of the present application for a harmonicity-dependent control of a harmonic filter tool, also for the ones outlined below which represent generalized embodiments to the step-by-step embodiment above. Sometimes the description brought forward so far was very specific although the harmonicity-dependent control concept may also advantageously be used in the framework of other audio codecs and may be varied relative to the specific details outlined in the foregoing. For this reason, embodiments of the present application are described again in the following in a more generic manner. Nevertheless, from time to time the following description refers back to the detailed description brought forward above in order to use the above details in order to reveal as to how the generically described elements occurring below may be implemented in accordance with further embodiments. In doing so, it should be noted that all of these specific implementation details may be individually transferred from the above description towards the elements described below. Accordingly, whenever in the description outlined below reference is made to the description brought forward above, this reference is meant to be independent from further references to the above description.
Thus, a more generic embodiment which emerges from the above detailed description is depicted in
The apparatus 10 further comprises a temporal structure analyzer 24 configured to determine at least one temporal structure measure 26 in a manner dependent on the pitch lag 18, measure 26 measuring a characteristic of a temporal structure of the audio signal 12. For example, the dependency may rely in the positioning of the temporal region within which measure 26 measures the characteristic of a temporal structure of the audio signal 12, as described above and later in more detail. For sake of completeness, however, it is briefly noted that the dependency of the determination of measure 26 on the pitch-lag 18 may also be embodied differently to the description above and below. For example, instead of positioning the temporal portion, i.e. the determination window, in a manner dependent on the pitch-lag, the dependency could merely temporally vary weights at which a respective time-interval of the audio signal within a window positioned independently from the pitch-lag relative to the current frame, contribute to the measure 26. Relating to the description below, this may mean that the determination window 36 could be steadily located to correspond to the concatenation of the current and previous frames, and that the pitch-dependently located portion merely functions as a window of increased weight at which the temporal structure of the audio signal influences the measure 26. However, for the time being, it is assumed that the temporal window is located positioned according to the pitch-lag. Temporal structure analyzer 24 corresponds to the T/F envelope measure calculation block of
Finally, the apparatus of
The mode of operation of apparatus 10 is as follows. In particular, the task of apparatus 10 is to control the harmonic filter tool of an audio codec, and although the above-outlined more detailed description with respect to
As became clear from the above discussion, the harmonic filter tool which is illustrated in
In particular, such a tool 30 is especially useful in low bitrate scenarios where a quantization noise introduced would, without tool 30, lead in such harmonic phases to audible artifacts. It is important, however, that filter tool 30 does not negatively affect other temporal phases of the audio signal which are not predominately harmonic. Further, as outlined above, filter tool 30 may be of the post-filter approach or pre-filter plus post-filter approach. Pre and/or post-filters may operate in transform domain or time domain. For example, a post-filter of tool 30 may, for example, have a transfer function having local maxima arranged at spectral distances corresponding to, or being set dependent on, pitch lag 18. The implementation of pre-filter and/or post-filter in the form of an LTP filter, in the form of, for example, an FIR and IIR filter, respectively, is also feasible. The pre-filter may have a transfer function being substantially the inverse of the transfer function of the post-filter. In effect, the pre-filter seeks to hide the quantization noise within the harmonic component of the audio signal by increasing the quantization noise within the harmonic of the current pitch of the audio signal and the post-filter reshapes the transmitted spectrum accordingly. In case of the post-filter only approach, the post-filter really modifies the transmitted audio signal so as to filter quantization noise occurring the between the harmonics of the audio signal's pitch.
It should be noted that
As far as the harmonicity measurer 20 is concerned, it has become clear from the discussion above with respect to
For further details and possible implementations of the pitch estimator 16, reference is made to the section “pitch estimation” brought forward above. Possible implementations of the harmonicity measurer 20 were discussed above with respect to the equation of norm.corr. However, as also described above, the term “harmonicity measure” shall include not only a normalized correlation but also hints at measuring the harmonicity such as a prediction gain of the harmonic filter, wherein that harmonic filter may be equal to or may be different to the pre-filter of filter 230 in case of using the pre/post-filter approach and irrespective of the audio codec using this harmonic filter or as to whether this harmonic filter is merely used by harmonic measurer 20 so as to determine measure 22.
As was described above with respect to
For the time being, it is illustratively assumed that the current frame for which the controlling task of controller 28 is performed, is frame 34a. As was described above and as is illustrated in
The temporally future-heading end 40 of temporal region 36, in turn, may be set by temporal structure analyzer 24 depending on the temporal structure of the audio signal within a temporal candidate region 48 extending from the temporally past-heading end 38 of the temporal region 36 to the temporally future-heading end of the current frame, 44. In particular, as has been discussed above, the temporal structure analyzer 24 may evaluate a disparity measure of energy samples of the audio signal within the temporal candidate region 48 so as to decide on the position of the temporally future-heading end 40 of temporal region 36. In the above specific details presented with respect to
As became clear from the above discussion, the placement of the temporal region 36 dependent on pitch lag 18 is advantageous in that the apparatus's 10 ability to correctly identify situations where the harmonic filter tool 30 may advantageously be used is increased. In particular, the correct detection of such situations is made more reliable, i.e. such situations are detected at higher probability without substantially increasing falsely positive detection.
As was described above with respect to
As already noted above, the energy samples 52 do not necessarily measure the energy of the audio signal 12 in its original, unmodified version. Rather, the energy sample 52 may measure the energy of the audio signal in some modified domain. In the concrete example above, for example, the energy samples measured the energy of the audio signal as obtained after high pass filtering the same. Accordingly, the audio signal's energy at a spectrally lower region influences the energy samples 52 less than spectrally higher components of the audio signal. Other possibilities exist, however, as well. In particular, it should be noted that the example where the temporal structure analyzer 24 merely uses one value of the at least one temporal structure measure 26 per sample time instant in accordance with the examples presented so far, is merely one embodiment and alternatives exist according to which the temporal structure analyzer determine the temporal structure measure in a spectrally discriminating manner so as to obtain one value of the at least one temporal structure measure per spectral band of a plurality of spectral bands. Accordingly, the temporal structure analyzer 24 would then provide to the controller 28 more than one value of the at least one temporal structure measure 26 for the current frame 34a as determined within the temporal region 36, namely one per such spectral band, wherein the spectral bands partition, for example, the overall spectral interval of spectrogram 32.
The transform-based encoder 70 comprises a transformer 80 which subjects the audio signal 12 to a transform. Transformer 80 may use a lapped transform such a critically sampled lapped transform, an example of which is MDCT. In the example of
For the sake of completeness only, it should be noted that the order among transformer 80 and spectral shaper 82 has been chosen in
As illustrated in
In the case of
For the sake of completeness,
As already described above, the control signal 98 or 104 is sent, for example, on a regular basis, such as per frame 34. As to the frames, it is noted that same are not necessarily of equal length. The length of the frames 34 may also vary.
The above description, especially the one with regard to
As became clear from the above description of
In particular, in the example of
Thus,
As has been illustrated above with respect to
As also became clear from the above discussion, a transform-based encoder such as the one depicted in
The size of the region in which temporal measures for the LTP decision are calculated is dependent on the pitch (see equation (8)) and this region is different from the region where temporal measures for the transform length are calculated (usually current frame plus look-ahead).
In the example in
In the example in
In both examples (
Here we discuss the behavior of the LTP for impulse and step transients within harmonic signal, of which one example is given by signal's spectrogram in
When coding the signal includes the LTP for the complete signal (because the LTP decision is based only on the pitch gain), the spectrogram of the output looks as presented in
The waveform of the signal, which spectrogram is in
For short impulse like transients (as the first transient in
In
is above the threshold (1/0.375). For the step like transient in
is below the threshold (1/0.375) and thus only the energies from segments −8, −7 and −6 are used in the calculation of the temporal measures. These different choices of the segments where the temporal measures are calculated, leads to determination of much higher energy fluctuations for impulse like transients and thus to disabling the LTP for impulse like transients and enabling the LTP for step like transients.
However in some cases the usage of the temporal measures may be disadvantageous. The spectrogram in
The LTP decision that is dependent on the Temporal Flatness Measure and on the Maximum Energy Change disables the LTP for this type of signal as it detects huge temporal fluctuations of energy.
This sample is an example of ambiguity between transients and train of pulses that form low pitched signal.
As can be seen in
As can be seen in the same 600 milliseconds excerpt in
This kind of signals benefit from the LTP as there is clear repetitive structure (equivalent to clear harmonic structure). Since there is clear energy fluctuation (that can be seen in
Thus, above embodiments, inter alias, revealed, for example, a concept for a better harmonic filter decision for audio coding. It has to be restated in passing that slight deviations from said concept are feasible. In particular, as noted above, the audio signal 12 may be a speech or music signal and may be replaced by a pre-processed version of signal 12 for the purpose of pitch estimation, harmonicity measurement, or temporal structure analysis or measurement. Also, the pitch estimation may not be limited to measurements of pitch lags but, as should be known to those skilled in the art, may also be performed via measurements of a fundamental frequency, in the time or a spectral domain, which can easily be converted into an equivalent pitch lag by way of an equation such as “pitch lag=sampling frequency/pitch frequency”. Thus, generally speaking, the pitch estimator 16 estimates the audio signal's pitch which, in turn, is manifests itself in pitch-lag and pitch frequency.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
The inventive encoded audio signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
14178810.9 | Jul 2014 | EP | regional |
This application is a continuation of U.S. patent application Ser. No. 16/118,316, filed Aug. 30, 2018, which is a divisional application of U.S. Ser. No. 15/411,662, filed Jan. 20, 2017, now issued as U.S. Pat. No. 10,083,706, which is a continuation of copending International Application No. PCT/EP2015/067160, filed Jul. 27, 2015, which is incorporated herein by reference in its entirety, and additional claims priority from European Application No. EP 14178810.9, filed Jul. 28, 2014, which is incorporated herein by reference in its entirety. The present application is concerned with the decision on controlling of a harmonic filter tool such as of the pre/post filter or post-filter only approach. Such tool is, for example, applicable to MPEG-D unified speech and audio coding (USAC) and the upcoming 3GPP EVS codec.
Number | Date | Country | |
---|---|---|---|
Parent | 15411662 | Jan 2017 | US |
Child | 16118316 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16118316 | Aug 2018 | US |
Child | 16885109 | US | |
Parent | PCT/EP2015/067160 | Jul 2015 | US |
Child | 15411662 | US |