Embodiments according to the invention are related to masking threshold determinators, audio encoders, methods and computer programs for determining a masking threshold information.
The human auditory system comprises distinct characteristics for the processing of acoustic stimuli. One such characteristic is, for example, a frequency separation mechanism, performed by the inner ear (cochlea). Furthermore, human hearing is subject to masking effects, wherein a threshold of audibility for one sound is raised by the presence of another sound. In other words, in some situations, in an acoustic scene, a human is only capably of perceiving a fraction of the acoustic signals that are present in the scene. This effect can be used for determining an efficient representation and respectively transmission of an acoustic scene, for example for a reproduction of the scene.
Therefore, it is desired to get a concept for the provision of a representation of an acoustic scene, which makes a better compromise between a perceived quality and/or accuracy, a complexity and extent, for example amount of bits, of the representation of the acoustic scene and a computational complexity and an implementation complexity of the concept. This is achieved by the subject matter of the independent claims of the present application. Further embodiments according to the invention are defined by the subject matter of the dependent claims of the present application.
An embodiment may have a masking threshold determinator, wherein the masking threshold determinator is configured to obtain a plurality of bandpass signals using a plurality of filters having different bandwidths; and wherein the masking threshold determinator is configured to obtain a masking threshold information associated with a given frequency region on the basis of bandpass signal values of at least two bandpass signals.
Another embodiment may have an audio encoder for encoding an input audio signal, having the masking threshold determinator as mentioned above.
According to another embodiment, a method for determining masking threshold information may have the steps of: obtaining a plurality of bandpass signals using a plurality of filters having different bandwidths, and obtaining the masking threshold information associated with a given frequency region on the basis of bandpass signal values of at least two bandpass signals.
Another embodiment may have a non-transitory digital storage medium having stored thereon a computer program for performing the method for determining masking threshold information as mentioned above when the computer program is run by a computer.
Embodiments according to the invention comprise a masking threshold determinator, wherein the masking threshold determinator is configured to obtain a plurality of bandpass signals, e.g. filterbank outputs, e.g. xk, using a plurality of, for example bandpass, filters (e.g. individual filters of a filterbank; e.g. IIR filters; e.g. all-pole Gammatone filters; e.g. digital filters) having different bandwidths (e.g. having bandwidths (e.g. 3 dB bandwidths) which increase monotonically (but not necessarily strictly monotonically) with increasing center frequency). Furthermore, the masking threshold determinator is configured to obtain, e.g., to determine or to derive, a masking threshold information (e.g. a masking threshold value associated with a frequency region or a frequency band or a frequency subband; e.g. yb or a value derived from yb, using a linear or non-linear mapping, wherein, for example, a quantization step value may be derived from yb) associated with a given frequency region, e.g. a frequency band or frequency subband, on the basis of bandpass signal values, e.g. xb−1, xb, xb+1, of at least two bandpass signals (e.g. provided using filters having different center frequencies; e.g. of bandpass signal values of all bandpass signals).
The inventors recognized that using a plurality of filters, for example in the form of bandpass filters, a frequency separation as performed in the inner ear can be modelled efficiently. Therefore, the filters comprise different bandwidths, for example in order to represent a frequency dependent selectivity of the human inner ear. Based thereon, an acoustic signal can be processed, in order to approximate what a human would perceive. Accordingly, based on the bandpass signal values of a plurality of such bandpass signals a masking information can be obtained.
Therefore, the inventors recognized that a precise masking threshold information for a specific frequency region can be obtained based on signal values of at least two of the bandpass signals. This masking information can be further used in order to obtain a representation of an audio scene comprising the acoustic signal, so that other acoustic signals which are masked by the acoustic signal, as indicated by the masking threshold information, may not have to be taken into account, for example for an encoding, since a human may not be able to perceive said other acoustic signal.
Furthermore, using the masking threshold information, for example for a determination of scaling factors of a respective quantization, the encoding can be optimized with respect to a masking of quantization noise. As an example, for a big masked threshold, more narrowband noise can be accepted because it will be inaudible, and therefore, for example, a larger quantizer step size can be used.
According to an embodiment of the invention, the bandpass signal values are complex values (e.g. comprising both a real part and an imaginary part; e.g. having a complex number representation using a separate representation of a real part and of an imaginary part, or using a representation of a magnitude (absolute value) and of a phase). Furthermore, the masking threshold determinator may be configured to take phases of the bandpass signal values into account when obtaining the masking threshold information (e.g. a masking threshold value associated with a frequency region or a frequency band or a frequency subband; e.g. yb or a value derived from yb, using a linear or non-linear mapping, wherein, for example, a quantization step value may be derived from yb) associated with the given frequency region (e.g. a frequency band or frequency subband) on the basis of bandpass signal values (e.g. xb−1, xb, xb+1) of the at least two bandpass signals, e.g. provided using filters having different center frequencies.
The inventors recognized that for a modelling of human acoustic perception, inter alia, IIR filters, for example with a nonlinear phase response, may be suitable, e.g. allowing a good approximation of the frequency selectivity with a limited number of parameters. As an example, All-pole gammatone filters may be used. The inventors recognized that, for example for such filters or generally speaking filters providing complex signal values, incorporation of a respective phase information may allow to improve the calculation of the masking threshold information.
According to an embodiment of the invention, the masking threshold determinator is configured to apply, e.g., multiply, a phase correction value (e.g., zcorr, e.g., zcorr,u, e.g., zcorr,i, e.g., a phase correction value dependent on the applied bandpass signal) to at least one, e.g., each, of the at least two bandpass signals when obtaining the masking threshold information, e.g., in order to obtain a (static) phase correction. The inventors recognized that a multiplicative phase correction may provide a good trade-off between computational effort and improvement of masking threshold determination.
According to an embodiment of the invention, the phase correction value, e.g. zcorr, describes a phase difference, e.g. psrc-pdest, between transfer functions of, for example bandpass, filters having adjacent passbands at a transition frequency, e.g. at a crossing frequency (for example such that output signals of the filters having adjacent passbands, which are generated in response to an input signal of the masking threshold determinator having the transition frequency, are combined in a constructive manner (with a constructive interference, rather than a destructive interference)). A definition of the correction values based on a phase difference between such transfer functions may be calculated efficiently.
According to an embodiment of the invention, between a first passband, e.g., a source band, of a first, e.g. individual, filter, for example of the plurality of filters, and an adjacent second passband, e.g., a destination band, of a second, e.g. individual, filter, for example of the plurality of filters, e.g. adjacent to the first individual filter, there is a transition frequency (e.g., a crossing frequency, e.g. a transition frequency at which transfer functions of the first filter and of the second filter comprise equal magnitudes (e.g. within a tolerance of +/−30 percent)). Furthermore, a phase correction value, e.g., zcorr, e.g., zcorr,u, e.g., zcorr,o, between the first, for example individual, filter and the, for example adjacent, second, e.g. individual, filter may, for example, be (e.g. calculated using) a combination, e.g. a sum of, or a difference between, a first phase shift, e.g., psrc, between a pole, e.g., z∞,band, of the first passband and the crossing frequency and of a second phase shift, e.g., pdest, between a pole, e.g., z∞,band, of the second passband and the crossing frequency.
The inventors recognized that a determination of the phase correction based on a transition frequency may be performed in a robust manner. In contrast to other calculation approaches, for example a determination based on centre frequency differences of the plurality of filters, situations wherein a calculation of the phase correction value is not possible, for example, because of phase differences leading to bands cancelling each other out, may be averted. The inventive approach hence allows a determination of phase correction values based on, for example, only adjacent bands of the plurality of filters. Furthermore, this way, a simple vector of phase correction values for each band may be obtained.
According to an embodiment of the invention, the first, for example individual, filter is assigned a first phase correction value, e.g., zcorr,j, relative to an adjacent filter with a smaller band index and a second phase correction value, e.g., zcorr,u, relative to an adjacent filter with a larger band index. Hence, a calculation of correction values both for upward and downward spreading, as will be discussed in detail later, may be performed. Accordingly, phase correction towards higher and lower frequency bands may be performed, allowing for a good modelling of the human auditory masking characteristics.
According to an embodiment of the invention, a phase correction value zcorr between a source band, e.g. a first passband, and a destination band, e.g. a second passband, is determined as
wherein psrc, which is associated with a delay compensated filter of a source band, is determined by
with φsrc(ω) being the phase response of the filter, wherein dsrc is the delay compensation in samples (e.g. look-ahead) (e.g., derived from τg(ωsrc) which is the group delay at the center frequency of the source band), wherein ωcross is the crossing frequency, for example between the source band and the destination band, wherein j is an imaginary unit, wherein pdest, which is associated with a delay compensated filter of a destination band, is determined by
with φdest(ω) being the phase response of the filter, wherein ωcross is the crossing frequency, for example between the source band and the destination band, and wherein ddest is the delay compensation in samples (e.g. look-ahead) (e.g., derived from τg(ωdest) which is the group delay at the center frequency of the destination band.
According to an embodiment of the invention (e.g. for all-pole Gammatone filters with delay alignment) psrc is determined by
wherein O is a filter order, wherein ωc is a center frequency of the filter of the source band, wherein ωcross is the crossing frequency, wherein z∞,src is a pole of the source band, wherein dsrc is the delay compensation in samples (e.g. look-ahead) (e.g., derived from τg(ωsrc) which is the group delay at the center frequency of the source band), wherein arg is an argument operator which provides an argument of a complex number, wherein j is an imaginary unit, wherein pdest (e.g. for all-pole Gammatone filters with delay alignment) is determined by
wherein O is a filter order, wherein ωc is a center frequency of the filter of the destination band, wherein ωcross is the crossing frequency (e.g. between the source band and the destination band), wherein z∞,dest is a pole of the destination band, wherein ddest is the delay compensation in samples (e.g. look-ahead) (e.g., derived from τg(ωdest) which is the group delay at the center frequency of the destination band), wherein arg is an argument operator which provides an argument of a complex number and wherein j is an imaginary unit.
The inventors recognized that based on the above calculations a phase correction value may be provided based on which a determination of the masking threshold information may be improved efficiently.
According to an embodiment of the invention, the masking threshold determinator is configured to a apply a non-linear mapping to respective magnitudes, e.g. by mapping |x| onto |x|α, of complex values of the bandpass signal while keeping respective phase values, e.g. arg(x), of the complex values, e.g. x, unchanged except for an optional, for example static, phase correction, which may, for example, be independent from x, e.g., performing non-linear spreading.
The inventors recognized that based on a mapping to a non-linear domain, a subsequent processing of the respective magnitudes for the determination of the masking threshold information may be improved. The inventors further recognized that based on the mapped magnitudes a subsequent spreading of bandpass signals may be performed efficiently (e.g. in the non-linear domain), in order to approximate the processing of the human auditory system and to obtain the masking threshold information.
According to an embodiment of the invention, the masking threshold determinator is configured to obtain a masking threshold information associated with a given frequency region on the basis of magnitudes (e.g., and not a phase) of the bandpass signal values of the at least two bandpass signals (e.g., configured to perform magnitude spreading), e.g. while leaving phases of the bandpass signals values unconsidered.
The inventors recognized that the masking threshold information may, for example exclusively, e.g. with regard to a phase information, be determined based on the magnitudes of the bandpass signals. Hence, the determination may be performed with limited computational effort.
According to an embodiment of the invention, the masking threshold determinator is configured to combine bandpass signal values, for example associated with a frequency region or a frequency band or a frequency subband, of different bandpass signals, which may, for example, be output by different bandpass filters, in order to obtain the masking threshold information, e.g. yb.
The inventors recognized that this way, as an example, after frequency decomposition via the plurality of filters, a spreading of the frequency spectrum may be implemented, in order to approximate human hearing characteristics. Based on a filtering and subsequent spreading of the filter outputs, the inventors recognized that an accurate masking threshold information may be determined.
According to an embodiment of the invention, the masking threshold determinator is configured to obtain the masking threshold information, e.g. yb, associated with the given frequency region (e.g. a given frequency band or a given frequency subband; e.g. designated with index b) using a weighted combination (e.g. a weighted linear combination, e.g. a weighted combination according to equation (7)) of bandpass signal values, e.g xk, or non-linearly processed versions of the bandpass signal values, e.g. |xk|α, associated with (or for example of) a plurality of frequency regions (e.g. including the given frequency region and one or more other frequency regions; e.g. frequency bands having indices between 0 and B−1).
The inventors recognized that using a weighted combination, human auditory characteristics may, for example, included in the determination of the masking information by choice of perceptually adapted weights. Hence, the masking information may be determined as a good approximation of human hearing characteristics.
According to an embodiment of the invention, the masking threshold determinator is configured to perform a weighting of bandpass signal values, or of non-linearly processed versions of the bandpass signal values that decreases with increasing difference between a center frequency of the given frequency region and center frequencies of one or more other frequency regions.
The inventors recognized that a good approximation of human hearing characteristics can be achieved if an impact of bandpass signal values that are further away (at least within in one frequency direction, e.g. within an increasing frequency direction and/or within in a decreasing frequency direction) from the given frequency region in the weighted combination is smaller than an impact of bandpass signal values that are closer.
According to an embodiment of the invention, the weighting reduces exponentially with increasing difference between frequency band index of given frequency band and frequency band index of other frequency band. The inventors recognized that an exponential reduction approximates human auditory characteristics well, so that a perceptually accurate masking threshold information can be obtained based on the weighted combination.
According to an embodiment of the invention, the bandpass signal values or non-linearly processed versions of the bandpass signal values of frequency regions below (e.g., having a lower frequency than) the given frequency region are weighted differently (e.g., less, e.g., using smaller weights, e.g., using an upper spreading factor u, e.g., u=0.1025≈0.56) than bandpass signal values or non-linearly processed versions of the bandpass signal values of frequency regions above (e.g., having a higher frequency than) the given frequency region (e.g., using larger weights, e.g., using a lower spreading factor I, e.g., I=0.10.3875≈0.41).
The inventors recognized that although a weighting decrease in one frequency direction, starting from a given frequency band, hence e.g. in increasing or decreasing frequency direction, may be beneficial, such a weighting decrease and/or the weighting itself may additionally be different for the decreasing frequency direction in contrast to the increasing frequency direction. Hence, an additional degree of freedom may be incorporated with which the masking threshold determination can be adapted further towards the real human hearing perception.
According to an embodiment of the invention, the masking threshold determinator is configured to obtain the masking threshold information based on, e.g., in form of, a spreading output, e.g. y (e.g. comprising at least one masking threshold value associated with a frequency region, e.g., yb), which is determined using the equation
wherein yb designates a spreading output associated with a frequency band b, wherein xk designates a bandpass signal value or a magnitude value of a bandpass signal value of a frequency band having band index k, wherein u designates an upper spreading factor, e.g., u=0.10.25≈0.56, wherein I designates a lower spreading factor I, e.g., I=0.10.3875≈0.41, wherein k designates a running variable, and wherein B designates a number of frequency bands, e.g., 96 bands.
According to an embodiment of the invention, the masking threshold determinator is configured to obtain the masking threshold information based on, e.g., in form of, a spreading output, e.g., ynl,b (e.g. comprising at least one masking threshold value associated with a frequency region, e.g., ynl,b), which is determined using the equation
wherein ynl,b designates a spreading output associated with a frequency band having band index b, wherein xnl,k designates a non-linearly processed version of a bandpass signal value or a nonlinearly processed version of a magnitude value of a bandpass signal value of a frequency band having band index k (e.g. xnl=|x|α (e.g., for non-linear magnitude spreading) or xnl=x*|x|α−1 (e.g. for non-linear complex spreading)), wherein α designates a nonlinear exponent value, e.g., α=0.3, wherein k designates a running variable, wherein unl,k,b designates an upper spreading factor depending on the variables k and b, wherein Inl,k,b designates an lower spreading factor depending on the variables k and b, and wherein B designates a number of frequency bands, e.g., 96 bands.
The above calculations allow for an efficient determination of the masking threshold information, for example, using linear or non-linear spreading.
According to an embodiment of the invention, at least one of the upper spreading factor unl,k,b and the lower spreading factor Inl,k,b is dependent on a difference between the running variable k and the band index b, e.g., unl(k−b) and/or unl(b−k), e.g., Inl(k−b) and/or Inl(b−k). This may allow for an efficient implementation of the frequency dependent weighting, in particular a weighting decreasing with increasing distance to a given frequency band, which may be different in an increasing frequency direction in contrast to a decreasing frequency direction.
According to an embodiment of the invention, at least one of the upper spreading factor unl,k,b and the lower spreading factor Inl,k,b reduces exponentially with increasing difference between the running variable k and the band index b.
According to an embodiment of the invention, the masking threshold determinator is configured to obtain the masking threshold information based on, e.g., in form of, a spreading output, e.g., ynl (e.g. comprising at least one masking threshold value associated with a frequency region, e.g., ynl,b), which is determined using the equation
wherein ynl,b designates a spreading output associated with a frequency band having band index b, wherein xnl,k designates a non-linearly processed version of a bandpass signal value or a nonlinearly processed version of a magnitude value of a bandpass signal value of a frequency band having band index k (e.g. xnl=|x|α (e.g., for non-linear magnitude spreading) or xnl=x*|x|α−1 (e.g. for non-linear complex spreading)), wherein α designates a nonlinear exponent value α, e.g., α=0.3, wherein u designates an upper spreading factor u, e.g., u=0.10.25≈0.56, wherein I designates a lower spreading factor I, e.g., I=0.10.3875≈0.41, wherein k designates a running variable, and wherein B designates a number of frequency bands, e.g., 96 bands.
This may allow for an efficient implementation of a non-linear spreading. Exponent value α may provide a further degree of freedom allowing to approximate human hearing characteristics efficiently.
According to an embodiment of the invention, the masking threshold determinator is configured to obtain the masking threshold information on the basis of a spreading output mapped back into a linear domain. This may facilitate the determination of the masking information.
According to an embodiment of the invention, the masking threshold determinator is configured to obtain the masking threshold information using, in case of xnl, being defined as xnl=|x|α (e.g., for non-linear magnitude spreading, the equation
with a scaling factor snl,mag for non-linear magnitude spreading
with B(ω) as a Bark width of frequency ω and ωc,b as a center frequency of a filterbank band b] and the spreading output ynl, or, in case of xnl, being defined as xnl=x*|x|α−1 [e.g. for non-linear complex spreading), the equation
with a scaling factor snl,mag for non-linear complex spreading
with B(ω) as a Bark width of frequency ω and ωc,b as a center frequency of a filterbank band b) and the spreading output ynl.
The inventors recognized that the non-linear spreading may be improved based on a scaling, e.g. downscaling using scaling factor snl,mag. This factor can depend on the non-linear spreading exponent a and may allow the masking thresholds of the non-linear spreading to be in an appropriate amplitude range for further processing.
According to an embodiment of the invention, a filter subset comprises at least two, e.g., two, three, four, or more, of the plurality of filters, wherein the filters of the filter subset jointly cover an equivalent rectangular bandwidth, ERB (e.g., filterbank characteristics are based on the Bark scale). This may allow providing a simple and efficient covering of the acoustic frequency range.
According to an embodiment of the invention, center frequencies of the filters of the filterbank are chosen such that there are at least three filters per Bark, and/or the filters of the filterbank comprise 3 dB bandwidths which are smaller than a third of a Bark. The inventors recognized that usage of at least three filters per bark, and/or usage of filters having the above bandwidth characteristics, allows for a good trade-off between a complexity of the inventive processing, e.g. with regard to filter coefficients and computational costs and an accuracy of the determined masking threshold.
According to an embodiment of the invention, the filters of the filterbank are chosen to have an asymmetric transfer function, and/or the filterbank is an All-Pole-Gammatone filterbank, and/or the filterbank is a complex-valued All-Pole-Gammatone filterbank. The inventors recognized that one or more of the above characteristics may allow improving and/or facilitating the determination the masking information.
According to an embodiment of the invention, for the given frequency region, an attenuation of a filter of the filterbank at a frequency one Bark above and/or 1 Bark below its center frequency is at least 40 dB, or, for example, lies between 50 and 80 dB. The inventors recognized that using such an attenuation or damping may allow a good approximation of human hearing.
According to an embodiment of the invention, at least one of the plurality of filters comprises an All-pole gammatone filter, wherein zeros, e.g., factors of a numerator, from a pole-zero decomposition, e.g., a rational fraction of factorized polynomial functions, of a basic gammatone filter are discarded and optionally, poles with negative imaginary parts are additionally discarded. All-pole gammatone filters may allow for a good approximation of the filtering as performed in the inner ear of a human with limited implementation effort.
According to an embodiment, the masking threshold determinator is configured to subsample the bandpass signals of the plurality of filters; and to obtain the masking threshold information associated with a given frequency region on the basis of bandpass signal values of at least two subsampled bandpass signals.
It is to be noted that optionally, according to embodiments, thresholds (and hence in general the masking threshold information or a portion thereof) may be obtained of a different spectral resolution, e.g. MDCT and/or ACC scale factor band, e.g. via interpolation. Such a step may, for example, be a final processing step, e.g. after an averaging in time, e.g. a time averaging.
According to an embodiment, the masking threshold determinator is configured to perform an averaging, e.g. of the subsamples masking threshold information, within, e.g. over, a given time interval in order to determine the masking threshold information, e.g. for the given time interval.
According to an embodiment, the masking threshold determinator is configured to determine an averaged masking threshold information using an averaging of the masking threshold information over the given time interval.
According to an embodiment, the masking threshold determinator is configured to perform an averaging of the masking threshold information in a non-linear domain.
According to an embodiment, the masking threshold determinator is configured to a apply a non-linear mapping to respective magnitudes of complex values of the bandpass signal while keeping respective phase values of the complex values unchanged except for an optional phase correction to perform an averaging based on the non-linearly mapped magnitudes of the bandpass signal.
According to an embodiment, the masking threshold determinator is configured to perform an additional processing, e.g. for considering postmasking, in a non-linear domain.
According to an embodiment, the masking threshold determinator is configured to a apply a non-linear mapping to respective magnitudes of complex values of the bandpass signal while keeping respective phase values of the complex values unchanged except for an optional phase correction; and to perform an additional processing of the non-linearly mapped magnitudes of the bandpass signal in order to consider postmasking.
Embodiments according to the invention comprise an audio encoder for encoding an input audio signal, comprising the masking threshold determinator according to any of the preceding claims.
According to an embodiment of the invention, the audio encoder is configured to adjust a quantization step for quantizing the input audio signal, or a preprocessed version thereof, in dependence on the masking threshold.
Using the inventive determination of the masking information, a quantization step size may be chosen, so as to consider only, or at least substantially, the acoustically relevant portion of an audio signal that is to be encoded. Hence, an optimization of the quantization step size may not be hindered by signal portions that are not perceivable by a human because of masking.
According to an embodiment of the invention, the audio encoder is configured to determine for the given frequency region a larger value of the obtained masking threshold information, e.g., associated with the given frequency region, and a pre-determined threshold in quiet (which may, for example, be referred to as masking threshold in quiet, e.g. although not necessarily being a masking threshold in quiet, but for example only a threshold quiet. Irrespective of that, both expressions may optionally be used herein, e.g. in an interchangeable fashion), e.g., associated with the given frequency region, and to select a quantizer step size for the given frequency region based determined larger value, and, optionally, to encode an audio signal using the selected quantizer step size. In other words, the determined masking threshold may be compared to a threshold in quiet and the larger of the two may be selected in order to determine an optimized quantization step size for the encoding. Although a quality of the decoded audio signal may not necessarily be improved, based on the above features, the thresholds may be adapted in an improved manner to the human auditory system (e.g. human hearing, e.g. auditory sense). Based thereon, it may be prevented that the thresholds become unnecessarily small and a bitrate of a respective encoded audio signal may be prevented from becoming too large.
Embodiments according to the invention comprise a method for determining masking threshold information. The method comprises obtaining a plurality of bandpass signals, e.g. filterbank outputs; e.g. xk, using a plurality of filters (e.g. individual filters of a filterbank; e.g. IIR filters; e.g. all-pole Gammatone filters; e.g. digital filters) having different bandwidths (e.g. having bandwidths (e.g. 3 dB band-widths) which increase monotonically (but not necessarily strictly monotonically) with in-creasing center frequency). The method further comprises obtaining, e.g., determining or deriving, the masking threshold information (e.g. a masking threshold value associated with a frequency region or a frequency band or a frequency subband; e.g. yb or a value derived from yb, using a linear or non-linear mapping, wherein, for example, a quantization step value may be derived from yb) associated with a given frequency region, e.g. a frequency band or frequency subband, on the basis of bandpass signal values, e.g. xb−1, xb, xb+1, of at least two bandpass signals (e.g. provided using filters having different center frequencies; e.g. of bandpass signal values of all bandpass signals).
The method as described above is based on the same considerations as the above-described masking threshold determinator. The method can, by the way, be completed with all features and functionalities, which are also described with regard to the masking threshold determinator.
Embodiments according to the invention comprise a computer program for performing any of the methods as disclosed herein when the computer program runs on a computer.
The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments of the invention are described with reference to the following drawings, in which:
Equal or equivalent elements or elements with equal or equivalent functionality are denoted in the following description by equal or equivalent reference numerals even if occurring in different figures.
In the following description, a plurality of details is set forth to provide a more throughout explanation of embodiments of the present invention. However, it will be apparent to those skilled in the art that embodiments of the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form rather than in detail in order to avoid obscuring embodiments of the present invention. In addition, features of the different embodiments described herein after may be combined with each other, unless specifically noted otherwise.
Optionally, the plurality of bandpass signals 111 may comprise complex values, or in other words, bandpass signal values may be provided in the form of complex values, hence having a real part and an imaginary part, comprising a magnitude and phase information. Optionally, threshold analyzer 120 is configured to analyze the phases of the at least two bandpass signals, in order to determine the masking threshold information 121 associated with the given frequency region. Therefore, threshold analyzer 120 may be configured to apply a phase correction value to at least one of the at least two bandpass signals 111 when obtaining the masking threshold information 121. Optional implementations of an inventive phase correction will be further discussed in the following. It is to be noted that a corresponding phase correction determinator may be part of a respective threshold analyzer or a separate unit, as show in the following
In the following, further aspects and embodiments of the invention are discussed and other embodiments are explained in other words. In the following, contents are structured in different sections, namely, a first section “Introduction” (e.g. 1.), a second section “Methods” (e.g. 2.), comprising a first subsection “Masking” (e.g. 2.1), a second subsection “All-pole gammatone filter” (e.g. 2.2) and a third subsection “Frequency domain Spreading” (e.g. 2.3.), having a first sub-subsection “Phase correction” (e.g. 2.3.1.), a second sub-subsection “Linear spreading” (e.g. 2.3.2) and a third sub-subsection “Nonlinear spreading” (e.g. 2.3.3.), a third section “Results”, comprising a first subsection “Time resolution” (e.g. 3.1), a second subsection “Magnitude response” (e.g. 3.2), a third subsection “Masked threshold estimation” (e.g. 3.3) and a fourth subsection “Listening test” (e.g. 3.4), a section “Summary of optional aspects regarding non-linear spreading (e.g. disclosing conclusions with regard to embodiments)” (e.g. 3.5), a section “Example temporal/spectral magnitude responses” (e.g. 3.6) (e.g. comprising an appendix, e.g. appendix A, e.g. showing examples of magnitude responses), and a section “Optional aspects regarding subsampling and temporal averaging” (e.g. 3.7).
Reference is made to
Again, referring to
Hence, as discussed before, a quantization step size for the encoding of audio signal 301 may be optimized with regard to masking threshold information 341. As an example, the quantization may be optimized so that perceptually relevant acoustic information is quantized with more bits than acoustically less relevant information. Therefore, the comparison of the threshold in quiet 324 and the threshold 321, representing an approximation of the masking effects happening in a human ear for audio signal 301, may allow determining whether an acoustic information is below a general perception threshold (e.g. 324) or below a relative threshold (e.g. 321), so that such information may be neglected or only coarsely quantized for the encoding since it may not be audible anyways.
As an example, a masking threshold determinator, e.g. 100, 200, may hence comprise the filterbank 310 (e.g. representing the plurality of filters 110, 210) and a threshold analyzer, comprising the spreading unit 322 and a comparator (not shown). Inputs of the threshold analyzer may hence be a plurality of bandpass signals 311 and optionally a threshold in quiet 324, in order to output the information 341 about the masking threshold, e.g. as a result of a comparison as explained before.
In line with the above, embodiments comprise an audio encoder for encoding an input audio signal, e.g. 301, the encoder comprising a masking threshold determinator according to any of the embodiments as disclosed herein. As explained before, the audio encoder may be configured to adjust a quantization step, e.g. using a quantization and coding unit 360, for quantizing the input audio signal 301, or a preprocessed version thereof (e.g. a transformed version thereof), in dependence on the masking threshold 341.
In particular, an encoder according to embodiments may be configured to determine for the given frequency region a larger value of the obtained masking threshold information, e.g. 321, and a pre-determined threshold in quiet, e.g. 324 (e.g. referred to as masking threshold in quiet), to select, e.g. using a quantization and coding unit 360, a quantizer step size for the given frequency region based on the determined larger value, and, optionally, and to encode the audio signal, e.g. 301 using the selected quantizer step size.
Hence, the audio signal may be encoded in a bitstream 361, which may consequently be processed in a corresponding decoder, e.g. comprising a decoding and dequantization unit 380, in order to provide a decoded version of the audio signal 381.
As indicated in
Reference is made to
The human ear can be separated into three main parts: the outer, middle, and inner ear (
Here, the important task of frequency separation may be performed: different tones cause different regions of the basilar membrane to vibrate. Which frequencies are transferred to and concentrated at which area of the basilar membrane is exemplarily shown in
In this figure, it is visible that the spectral analysis performed in the cochlea 506 has a frequency dependent selectivity often described by critical bandwidth. According to an aspect of the invention, the frequency separation mechanism of the human ear is imitated, for example, by an All-pole gammatone filterbank. This will be further explained in the following section.
Another example of a mechanism of the basilar membrane which may be used in this project is the effect of one sound being made inaudible by another sound. This so-called masking effect can be experienced in many situations of the daily life. For example, the sound of a car engine may be masked by music from the car's radio. Masking has been defined as the process by which the threshold of audibility for one sound is raised by the presence of another (masking) sound [3]. The threshold of audibility of a sound in the presence of a masking sound is called masked threshold. It is the sound pressure level of a test sound which is necessary to be just audible in the presence of a masker. In nearly all cases this masked threshold lies above the threshold in quiet. If the frequencies of the masker and the test tone are very different, the masked threshold and the threshold in quiet are identical [2]. There are multiple ways to achieve a masking effect: Firstly, the masker and the test sound may be played at the same time. An example for this would be an orchestra, in which a loud instrument might mask another instrument which remains faint. This is called simultaneous masking. Secondly, the test sound could be played before or after the masker, which would be classified as non-simultaneous masking. In this case the test sound has to be a short burst or sound impulse. While no strong masking effects can be measured if the test sound is presented before the masker is switched on (premasking), pronounced effects occur if the test sound is presented after the masker is switched off (postmasking) [2].
For simultaneous masking, maximum masking is usually obtained if both signals have the same center frequency. This is shown exemplarily in
The figure shows an example of the dependence of the masked threshold of a test tone on the level of noise centered at 1 kHz. Here, it is clearly visible that for all levels of noise the masking effect increases the closer the frequency of the test tone gets to the center frequency of the noise [2]. The biggest masking effect can thus be achieved if masker and test sound both have frequencies in the same critical band. In this work, simultaneous masking effects may be used, for example, to optimize the quantizer step sizes of the audio encoder.
Referring to
All-pole gammatone filter Reference is made to
A gammatone filter (GTF) may be defined in the time domain by the multiplication of a sinusoidal signal with a gamma distribution:
Because of this simple description gammatone filters provide, for example, the possibility of an efficient digital or analog implementation. Even though the basic gammatone filter is very popular, it may have limitations as a basilar membrane model. Firstly, the magnitude transfer function of the GTF is inherently nearly symmetric as shown in
A filter that provides a closer time-domain match to basilar membrane mechanical impulse measurements is, for example, the All-pole gammatone filter (APGF), a close relative of the GTF. The APGF is defined by simply discarding the zeros from a pole-zero decomposition of a basic GTF. And a complex valued APGF may be obtained, for example, by additionally discarding the poles with negative imaginary parts. In the resulting impulse responses the cosine terms may then be replaced by complex exponentials.
∈z=zeros of the pole-zero decomposition, ∈p=poles of the pole-zero decomposition
Therefore, the APGF does not have the simple time-domain description of the GTF. Still, the APGF has several properties that make it attractive for applications in auditory modeling. Firstly, the APGF shows a realistic asymmetry in the frequency domain, other than the GTF. The impulse response and the magnitude transfer function of a APGF and a GTF in comparison are shown exemplarily in
Hence, it is to be noted that according to embodiments, at least one of the plurality of filters comprises an all-pole gammatone filter, wherein zeros from a pole-zero decomposition of a basic gammatone filter are discarded and optionally, poles with negative imaginary parts are additionally discarded.
Accordingly, in general, filters of a filterbank according to embodiments may be chosen to have an asymmetric transfer function, as shown in
Reference is made to
In this project, All-pole gammatone filters, for example, with the order N=6 may be used. Because the purpose of the APGF filterbank is to mimic the frequency separation mechanism of the inner ear, the filterbank characteristics may be based on the Bark scale. The Bark scale was, for example, approximated using the following equations:
Bn=approximation for mapping from frequency to Bark number, Bn=approximation for mapping from frequency to Bark width, f=frequency.
Reference is made to
As an example, a according to embodiments or the filterbank comprises or consists, for example, of 96 bands with a frequency resolution of, for example, 4 bands per Bark. Examples of the frequency response of an APGF filterbank are shown in
To align the outputs of the different filterbank bands the group delay of every band was, for example, calculated dependent on, for example, the respective center frequency. This group delay was, for example, then weighted and stored as, for example, the look-ahead for every band respectively. For the ratio of the look-ahead to the group delay of the filters a factor of 0.7 was, for example, chosen.
The decomposition of the masking sound spectrum into simple masker components is, for example, performed by the All-pole gammatone filterbank. The filterbank outputs are then, for example, smeared out over frequency to result, for example, in the overall masked threshold. This spreading or superposition can be implemented linear or nonlinear. Psychoacoustic measurements have shown that linear addition of masker components often results in a much lower overall threshold than determined experimentally, while nonlinear superposition matches the measured thresholds more closely [6]. Also, the nonlinear spreading aims to mimic the nonlinear behavior of the ear in the processing of sound. As the filterbank may yield complex valued outputs, the (nonlinear or linear) spreading can, for example, be performed using, for example, only the magnitude values of the filterbank outputs or using the complex values. Because the All-pole gammatone filters are, for example, IIR filters with a nonlinear phase response, a phase correction factor may be implemented for (nonlinear and linear) spreading with complex values. Also, the group delay of each filter, which is defined as the derivation of the phase response, may be taken into consideration. This will be further explained in the following section.
Hence, it is to be noted that an the masking threshold determinator according to embodiments may be configured to obtain a masking threshold information associated with a given frequency region on the basis of magnitudes of the bandpass signal values of the at least two bandpass signals.
Reference is made to
The first approach to correcting the phase shift for complex spreading was to calculate, for example, the phase difference, for example, between the center frequencies of all filterbank bands, e.g. as shown as an example, differences φ(ω) 1020 of center frequencies 1012, 1014. This is visualized exemplarily in
N=amount of poles, M=amount of zeros
Because the APGT does not contain any zeros, this equation can be reduced to:
Here, ωc is the center frequency of the source band, z∞,n is the pole of the destination band, and O is the order of the APGF filterbank. Because every band of the APGF contains one band specific pole value as many times as the order of the filter indicates, equation (1) can be simplified to equation (3). The approach of using the center frequencies of the bands to calculate the phase shift was less successful because, for example, the phase difference between the center frequencies of two adjacent bands in the used APGF filterbank is often quite big and can even be close to π. When calculating the phase shift, for example, using the phase response of equation (1) or (3), a phase difference of π in two adjacent bands leads to the bands canceling each other out. Therefore, this approach did not yield useful results.
Reference is made to
To solve this problem, in the second approach to calculate the phase shift, for example, the frequency of the bands estimated crossing point, e.g. 1030, was used instead of the center frequency (
Wsrc=center frequency of the source band, Wdest=center frequency of the destination band, u=upper spreading factor.
The phase shift between respectively the source band and the destination band and the crossing point is then calculated, for example, using formula (1) or (3). Here, also the look-ahead dband of the respective band may be taken into account.
which, for APGT, resolves to
pband=overall phase response of the source or destination band at the crossing frequency ωcross, z∞,band=pole of the band source or destination, and e.g. dband=0.7*τg(ωband) with τg(ωband) being the group delay at the center frequency of the respective band.
The overall phase correction value is then calculated, for example, as:
The correction values may be calculated for all bands, for example, for both upward and downward spreading. In the case of upward spreading the destination band is, for example, the band adjacent to the source band with the next larger index, in the case of downward spreading the destination band is, for example, the band adjacent to the source band with the next smaller index.
Reference is made to
In one example of a linear spreading operation, the output value, e.g. a value of bandpass signals 1211, of one filterbank band is, for example, weighted by a spreading function and then added, for example, to its neighboring bands. The spreading function is, for example, a two sided exponential, meaning that going upwards and going downwards from the source band, the weighted value of the source band that gets added to the destination band gets, for example, exponentially smaller the further away the destination band is from the source band. The linear spreading implemented in this project is, for example, based on the pseudo code published in [7]. One change that was made in this project as compared to the code suggested in [7] is that, for example, the upward and downward spreading functions both are implemented as level independent and thus for the upper and the lower slope of the spreading function, for example, constant values were used. The upper slope su was, for example, set to 20 dB and the lower slope s, was set, for example, to 31 dB.
The upward spreading factor u and the downward spreading factor I are, for example, defined in the following. The defined amount of filterbank bands per one bark is represented by the letter β.
The upward and downward spreading functions are thus defined, for example, as ux and dx. The spreading is, for example, implemented as:
x=spreading input, y=spreading output, u=upper spreading factor, I=lower spreading factor, B=number of bands.
For magnitude linear spreading the spreading input x is, for example, the magnitude value of the filterbanks complex output. If complex linear spreading is performed the complex filterbank output is, for example, used as the input x. As previously mentioned, because of the nonlinear phase response of the APGF filter a phase correction may be added to the algorithm. The linear complex spreading can, for example, be described through the pseudo code shown in
Reference is made to
A non-linear spreading operation follows, for example, the same concept as a linear spreading operation. The difference between linear and nonlinear spreading is, for example, that the nonlinear model uses, for example, a compressive exponential characteristic prior to the addition of masker components [6]. This means, for example, that first, the input value is mapped to the nonlinear domain, then, for example, spreading is performed, for example, with a spreading factor that is adapted to the nonlinear domain. The resulting value is then, for example, mapped back to the linear domain. Similar to linear spreading, nonlinear spreading can either be performed, for example, on the complex output values of the filterbank 1310, e.g. APGF filterbank, or only, for example, on the magnitudes of those outputs. If non-linear magnitude spreading is performed, the mapping, e.g. 1352, to the nonlinear domain can, for example, be implemented as follows for the output values of all filterbank bands, e.g. 1311:
x=spreading input, α=nonlinear spreading factor, xnl=spreading input in the nonlinear domain.
In the nonlinear domain, the spreading is performed, for example, the same way as in linear spreading (see equation 7). Here, similarly to the spreading input value, the upper and lower spreading factors both, for example, may be mapped in the nonlinear domain:
The spreading is then, for example, performed analogous to equation 7, for example, with xnl as the spreading input, unl as the upper spreading factor and Inl as the lower spreading factor. The spreading output value ynl then, for example, may be mapped back into the linear domain, for example, to result into the final spreading output y. Because the nonlinear spreading yields very big results with a nonlinear exponent value of, for example, α=0.3, additionally to the mapping, the spreading output value ynl may be scaled down, for example, using the scaling factor snl,mag. This is just an empirically found scaling which takes into account, for example, a and the ratio of the bark width of the first filterbank band to the band width of the concerning band.
α=nonlinear exponent, ωc,b=center frequency of filterbank band b, B(ω)=Bark width of frequency ω, |y|=spreading output mapped back into the linear domain.
If nonlinear spreading is performed, for example, on complex values and not only on the magnitudes of the filterbank outputs, the mapping from the linear into the nonlinear domain may be adapted. Here, it may be taken into account that, for example, only the magnitude of the complex value may be mapped into the nonlinear domain, for example, while the phase remains unchanged:
Then the spreading is performed, for example, similarly to equation 7, again using, for example, upper and lower spreading factors that are, for example, also mapped to the nonlinear domain as shown in equation 11. The phase correction values (e.g. as explained in the context of
Reference is made to
The influence of different values for the nonlinear exponent a in nonlinear spreading is, for example, shown in
The performance of the different spreading functions that were applied, for example, to the APGF filterbank output were tested using, for example, multiple noise and tonal input audios. The audios were first separated into their frequency components, for example, by the APGF filterbank, then a spreading function was applied to this output. Here, a sampling rate of, for example, 48 kHz was used. The spreading output was then used, for example, to calculate the masked threshold of the input sample, which was then used, for example, to alter the quantizer step size of the encoder. To plot the result the encoded signal was again decoded and/or analyzed.
In this project the following examples of spreading functions were implemented and tested:
Reference is made to
In
Reference is made to
To show the magnitude response of the All-pole gammatone filterbank for input noise and sine signals, spectrogram plots were used. For
Reference is made to
Reference is made to
For
The same can be seen in table 1: for exemplary magnitude and complex nonlinear spreading the maximum magnitude value for an input sine signal is considerably smaller than the exemplary maximum magnitude value for a narrowband noise signal with, for example, 1, 2 or 3 bark. In comparison the maximum magnitude values for exemplary magnitude linear spreading, complex linear spreading and no spreading for all noise and the sine signal all are in a very narrow range of values. This is also shown exemplary in table 2.
In table 2 the difference between the maximum output values for all different exemplary spreading functions for the input sine signal and the input noise signals is shown. Here it is again visible the maximum magnitude value for the exemplary linear spreading functions and no spreading is very similar for the sine and the noise signals, the biggest difference being −3.76972 dB (Noise 1 bark with no spreading). For complex and magnitude nonlinear spreading, for example, a different behavior can be seen: the difference between the maximum magnitude values of the noise signals and the sine signal, for example, is big in comparison to the values for the other spreading functions with a maximum difference of 25.6695 dB (noise 3 bark with magnitude nonlinear spreading).
Reference is made to
Reference is made to
To evaluate the different exemplary spreading methods (nonlinear complex, nonlinear magnitude, linear complex, linear magnitude) a webMUSHRA listening test was performed. For the test, audio files which were encoded using the exemplary masked thresholds estimated by the different exemplary spreading functions were used. The quantization was adjusted to an average bitrate based on the perceptual entropy of, for example, 16 kBit/s. To achieve this, for each exemplary spreading function a scaling factor for the masked thresholds was calculated. For the exemplary spreading specific scaling factor, all different exemplary spreading kinds yield the same perceptual entropy for an input “all_mono” file. The scaling factors calculated this way were then used on the test items to achieve an average bitrate of, for example, 16 kBit/s for a large set of audio samples. All input test audios had a sample rate of, for example, 48 kHz and were encoded using a MDCT with a frame length of, for example, 1024 samples. The listening test consists of, for example, 8 test audios on which respectively all exemplary spreading functions were applied. For each audio, the listeners could listen to the original audio and the test audios for the different exemplary spreading functions and then rate the quality of the test audios on a scale of, for example, 0 to 100. Among the test audios always was the original audio to ensure reliable results. If a participant of the listening test rated an original audio below, for example, 90, that participant's results was excluded from the evaluation. The results shown in
The listening test results (
In this project, inter alia, 4 different kinds of frequency domain spreading were applied to the output of, for example, an All-pole gammatone filterbank. The resulting frequency spectrum was then used, for example, to estimate the masked threshold of the input sample which was then used, for example, to optimize the quantizer step size of the encoder. The implemented spreading functions are, for example, linear magnitude spreading, linear complex spreading, nonlinear magnitude spreading, and nonlinear complex spreading. The goal for the spreading was, for example, to match the frequency separation mechanism of the ear as closely as possible. Therefore, for example, the nonlinear spreading aimed to mimic the nonlinear properties of the human ear. The results of the performed listening test indicate that this was successful, as the test audios for which, for example, nonlinear spreading was used on average had a higher score than, for example, the audios on which linear spreading was used. While there appears to be no clear winner for the magnitude and complex spreading, the results show that frequency domain spreading with complex values works with the phase correction mechanism that was implemented in this project which has been disclosed above.
The new method presented herein comprises calculating masking thresholds for, for example, tonal and non-tonal signals, for example, without explicit tonality estimation. The tonality may be taken into account using a non linear spreading. Complex value IIR filter (e.g., an All-pole gammatone filterbank) with a smaller bandwidth compared to common gammatone filters may be used. Spreading with non-linearity may be applied to magnitudes or complex values with phase corrections. The choice of the phase correction is important. Compensating, respectively, the phase differences at the transition regions, at which the magnitude frequency responses of two adjacent filter are equal, has proven advantageous. Furthermore, one may bear in mind that with complex values, non linearity is applied, for example, only to the magnitudes such that the phases are not changed thereby. Such spreading may result in a significantly larger time resolution of the resulting superposition than the spreading with magnitudes. A special aspect of the method relates to a combination of complex value IIR filters (e.g., with smaller bandwidths compared to common gammatone filterbanks) and spreading with non linearity (onto magnitudes) and phase correction.
The present invention relates to the technical field of audio coding. For example, the present invention may be used in an audio encoder using current or future standards or proprietary coding methods. Hence, such an inventive audio encoder according to embodiments may, for example be configured to use current or future standards or proprietary coding methods.
The present invention relates to the technical field of psychoacoustic models with a gammatone filterbank and, e.g., non-linear spreading. The present invention may include processing complex values. The present invention may include phase correction. The present invention may comprise or be used in audio coding.
Example temporal/spectral magnitude responses
In the following further examples of magnitude responses, in the form of schematic spectrograms are shown in
Furthermore, in general, the masking threshold determinator can be configured to obtain the masking threshold information for each sample.
To reduce complexity, subsampling can, for example, be performed after the filterbank and before applying the non-linear mapping and spreading.
To obtain a masking threshold information for a given time interval, the (optionally subsampled) masking threshold information can optionally be averaged within this time interval.
This can be further improved by already performing the averaging of the masking threshold information in the non-linear domain.
Furthermore, additional processing (for considering, for example, postmasking) can, for example, be as well applied in the non-linear domain.
The mapping back to the linear domain in audio coding applications may, for example, just be required for the control of quantizer stepsizes.
This, e.g. the inventive, processing order can be motivated by the non-linear processing in the cochlea, which affects all following stages in the human auditory system.
Usual time intervals in audio encoders may, for example, correspond to the lengths of the MDCT windows, e.g. 2048 samples (long blocks) or 256 samples (short blocks).
In the following, reference is made to
Furthermore, masking threshold determinator 3200 comprises a threshold analyzer 3220 and a phase correction determinator 3230. These elements may, for example, have the functionalities as explained in the context of
As shown, the plurality of filters 3210 are configured to receive an audio signal 3201. The filters may, for example, be ordered into different subsets, so that a filter subset comprises at least two of the plurality of filters 3210, and so that the filters of the filter subset jointly cover an equivalent rectangular bandwidth, ERB.
In particular, the filters may, for example, be implemented so as to cover respective frequency intervals of critical bands on a bark scale. In particular, center frequencies of the filters of the filterbank 3210 may be chosen, such that there are at least three filters per Bark, and/or such that the filters of the filterbank comprise 3 dB bandwidths which are smaller than a third of a Bark. Referring to
Examples for such filter function, e.g. transfer functions and impulse responses are shown in
As a general feature, optionally, the plurality of bandpass signals 3211, obtained by the filtering of audio signal 3201, may be complex valued.
For a given frequency region an attenuation of a filter of the filterbank 2110 at a frequency one Bark above and/or 1 Bark below its center frequency may, for example, be at least 40 dB.
As an optional feature, the plurality of bandpass signals 3211, e.g. at least two, may be subsampled using subsampling unit 3240. It is to be noted, that such a subsampling is optional. To reduce complexity, such a subsampling can be performed after the filterbank 3240, and before further processing, for example comprising applying a non-linear mapping and spreading.
Using phase correction determinator 3230, the masking threshold determinator 3200 may optionally be configured to determine a phase correction value 3231 based on or using phases of the bandpass signal values. In particular, threshold analyzer 3220 may be configured to apply the phase correction value 3231 to at least one of the at least two bandpass signals in order to obtain the masking threshold information 3221.
As explained before, the phase correction value 3231 may, for example, describe a phase difference between transfer functions of filters having adjacent passbands at a transition frequency, e.g. ωcross. As shown in the excerpt 3232 from
Accordingly, a plurality of phase correction values for the different filters of the plurality of filters 3210 may, for example, be determined. For a filter of the plurality of filters 3210, a first phase correction value may, for example, be assigned relative to an adjacent filter with a smaller band index and a second phase correction value may be assigned relative to an adjacent filter with a larger band index.
As an example, a phase correction value zcorr between a source band and a destination band may be determined according to equation 6, e.g. as explained in the context of
and/or, for example for an all-pole gammatone filter, according to
For the sake of brevity, for the definition of the respective further parameters and variables, reference is made to the above discussion, in particular of
As shown in
As an optional feature, using mapping unit 3222, analyzer 3220 may be configured to apply a non-linear mapping to respective magnitudes of complex values of the bandpass signal 3211, while keeping respective phase values of the complex values unchanged, except for an optional phase correction, e.g. as performed in spreading unit 3224 based on the phase correction value(s) 3231.
In other words, the input values of analyzer 3220 may optionally be mapped or transformed in a non-linear domain, for the further processing in spreading unit 3224.
Furthermore, using spreading unit 3224, after frequency decomposition in filterbank 3210, spreading functions may be applied to the frequency spectrum to model the frequency separation mechanism of the human inner ear. The spreading unit 3224 may be configured to perform linear magnitude spreading, linear complex spreading, nonlinear magnitude spreading and/or nonlinear complex spreading. It is to be noted that a respective linear spreading may be performed without any mapping to a non-linear domain and hence without a mapping unit 3222 (accordingly units 3226 and 3228 may, for example, not be present either).
In general, spreading unit 3224 may be configured to combine bandpass signal values of different bandpass signals 3211 in order to obtain the masking threshold information 3221. Therefore, for a given frequency region, a weighted combination of bandpass signal values, or non-linearly processed versions of the bandpass signal values, associated with a plurality of frequency regions, may be used. An example for an algorithm for the linear case is shown in the excerpt 3225 from
in the non-linear case.
Optionally, the weighting may decrease with increasing difference between a center frequency (e.g. center frequency) of the given frequency region and center frequencies of one or more other frequency regions. In particular, the weighting may reduce exponentially with increasing difference between a frequency band index of the given frequency band and a frequency band index of another frequency band. Furthermore, bandpass signal values or non-linearly processed versions of the bandpass signal values of frequency regions below the given frequency region may, for example, be weighted differently than bandpass signal values or non-linearly processed versions of the bandpass signal values of frequency regions above the given frequency region.
Optionally, at least one of the upper spreading factor, e.g. unl,k,b, and the lower spreading factor, e.g. Inl,k,b, may, for example, be dependent on a difference between the running variable k and the band index b. As another optional feature, at least one of the upper spreading factor, e.g. unl,k,b, and the lower spreading factor, e.g. Inl,k,b, may, for example, reduce exponentially with increasing difference between the running variable k and the band index b. Hence, based on such a difference between the running variable k and the band index b, the spreading unit 3224 may be configured to determine a spreading output, using the equation
As another optional feature, a spreading output of spreading unit 3224, may be mapped back, in the non-linear case, to a linear domain, using mapping unit 3226. This mapping can optionally be performed after the averaging of the masking threshold in the non-linear domain. Optionally, the spreading output in linear domain may be scaled, e.g. scaled down, using scaling unit 3228 in order to obtain a masking threshold representing masking effects of audio signal 3201.
Here, reference is made to
As an example, the averaging of the masking threshold information may hence be performed in a non-linear domain, for example, using an averaging unit 3310 or in the linear domain using averaging unit 3320. Hence, as shown in
As an example, using the scaling unit 3228, the masking threshold information may be obtained using, in case of xnl being defined as xnl=|x|α, the equation
with a scaling factor snl,mag for non-linear magnitude spreading and the spreading output ynl, or, in case of xnl being defined as xnl=x*|x|α−1, the equation
with a scaling factor snl,mag for non-linear complex spreading and the spreading output ynl.
Furthermore, using the optional comparator 3229, a masking threshold obtained using the spreading unit 3224 may be compared to a threshold in quiet 3250 (e.g. referred to as masking threshold in quiet 3250), in order to select a bigger one of the two as the masking threshold information 3221.
It should be noted that individual aspects described herein can be used individually or in combination. Thus, details can be added to each of said individual aspects without adding details to another one of said aspects.
It should also be noted that the present disclosure describes, explicitly or implicitly, features usable in an audio encoder (Encoder for providing an encoded representation of audio signals) and in an audio decoder (Decoder for providing decoded audio signals on the basis of an encoded representation). Thus, any of the features described herein can be used in the context of an audio encoder and in the context of an audio decoder.
Moreover, features and functionalities disclosed herein relating to a method can also be used in an apparatus (configured to perform such functionality). Furthermore, any features and functionalities disclosed herein with respect to an apparatus can also be used in a corresponding method. In other words, the methods disclosed herein can optionally be supplemented by any of the features and functionalities described with respect to the apparatuses.
Also, any of the features and functionalities described herein can be implemented in hardware or in software, or using a combination of hardware and software, as will be described in the section “implementation alternatives”.
Although some aspects are described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The apparatus described herein, or any components of the apparatus described herein, may be implemented at least partially in hardware and/or in software.
The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The methods described herein, or any components of the apparatus described herein, may be performed at least partially by hardware and/or by software.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
22183700.8 | Jul 2022 | EP | regional |
This application is a continuation of copending International Application No. PCT/EP2023/068858, filed Jul. 7, 2023, which is incorporated herein by reference in its entirety, and additionally claims priority from European Application No. 22183700.8, filed Jul. 7, 2022, which is also incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2023/068858 | Jul 2023 | WO |
Child | 19012760 | US |