The present document relates to audio forensics, notably the blind detection of traces of parametric audio encoding/decoding in audio signals. In particular, the present document relates to the detection of parametric frequency extension audio coding, such as spectral band replication (SBR) or spectral extension (SPX), and/or the detection of parametric stereo coding from uncompressed waveforms such as PCM (pulse code modulation) encoded waveforms.
HE-AAC (high efficiency-advanced audio coding) is an efficient music audio codec at low and moderate bitrates (e.g. 24-96 kb/s for stereo content). In HE-AAC, the audio signal is down-sampled by a factor of two and the resulting lowband signal is AAC waveform coded. The removed high frequencies are coded parametrically using SBR at low additional bitrate (typically at 3 kb/s per audio channel). As a result, the total bitrate can be reduced significantly compared to plain AAC waveform coding across the full spectral band of the audio signal.
The transmitted SBR parameters describe the way the higher frequency bands are generated from the AAC decoded low band output. This generation process of the high frequency bands comprises a copy-and-paste or copy-up process of patches from the lowband signal to the high frequency bands. In HE-AAC a patch describes a group of adjacent subbands that are copied-up to higher frequencies in order to recreate high frequency content that was not AAC coded. Typically 2-3 patches are applied dependent on the coding bitrate conditions. Usually the patch parameters do not change over time for one coding bitrate condition. However the MPEG standard allows changing the patch parameters over time. The spectral envelopes of the artificially generated higher frequency bands are modified based on envelope parameters which are transmitted within the encoded bitstream. As a result of the copy-up process and the envelope adjustment, the characteristics of the original audio signal may be perceptually maintained.
SBR coding may use other SBR parameters in order to further adjust the signal in the extended frequency range, i.e. to adjust the high-band signal, by noise and/or tone addition/removal.
The present document provides means to evaluate if a PCM audio signal has been coded (encoded and decoded) using parametric frequency extension audio coding such as MPEG SBR technology (e.g. using HE-AAC). In other words, the present document provides means for analyzing a given audio signal in the uncompressed domain and for determining if the given audio signal had been previously submitted to parametric frequency extension audio coding. In yet other words, given a (decoded) audio signal (e.g. in PCM format), it may be desirable to know whether or not the audio signal had previously been encoded using a certain encoding/decoding scheme. In particular, it may be desirable to know whether or not the high-frequency spectral components of the audio signal were generated by a spectral bandwidth replication process. In addition, it may be desirable to know if a stereo signal was created based on a transmitted mono signal or if certain time/frequency regions of a stereo signal originate from time/frequency data of the same mono signal.
It should be noted that even though the methods outlined in the present document are described in the context of audio coding, they are applicable to any form of audio processing that incorporates duplication of time/frequency data. In particular, the methods may be applied in the context of blind SBR which is a special case in audio coding where no SBR parameters are transmitted.
A possible use case may be the protection of SBR related intellectual property rights, e.g. the monitoring of unauthorized usage of MPEG SBR technology or any other new parametric frequency extension coding tool fundamentally based on SBR e.g., Enhanced SBR (eSBR) in MPEG-D Universal Speech and Audio Codec (USAC). Furthermore, trans-coding and/or re-encoding may be improved when no more information other than the (decoded) PCM audio signal is available. By way of example, if it is known that the high-frequency spectral components of the decoded PCM audio signal have been generated by a bandwidth extension process, then this information could be used when re-encoding the audio signal. In particular, the parameters (e.g. the cross-over frequency and patch parameters) of the re-encoder could be set such that the high-frequency spectral components are SBR encoded, while the lowband signal is waveform encoded. This would result in bit-rate savings compared to plain waveform coding and higher quality bandwidth extension. Furthermore, knowledge regarding the encoding history of a (decoded) audio signal could be used for quality assurance of high bit-rate waveform encoded (e.g., AAC or Dolby Digital) content. This could be achieved by making sure that SBR coding or some other parametric coding scheme, which is not a transparent coding method, was not applied to the (decoded) audio signal in the past. In addition, the knowledge regarding the encoding history could be the basis for a sound quality assessment of the (decoded) audio signal, e.g. by taking into account the number and size of SBR patches detected within the (decoded) audio signal.
As such, the present document relates to the detection of parametric audio coding schemes in PCM encoded waveforms. The detection may be carried out by the analysis of repetitive patterns across frequency and/or audio channels. Identified parametric coding schemes may be MPEG Spectral Band Replication (SBR) in HE-AACv1 or v2, Parametric Stereo (PS) in HE-AAVv2, Spectral Extension (SPX) in Dolby Digital Plus and Coupling in Dolby Digital or Dolby Digital Plus. Since the analysis may be based on signal phase information, the proposed methods are robust against magnitude modifications as typically applied in parametric audio coding. In SBR coding schemes high frequency content is generated in the audio decoder by copying low frequency subbands into higher frequency regions and by adjusting the energy envelope in a perceptual sense. In parametric spatial audio coding schemes (e.g. PS, Coupling) data in multiple audio channels may be generated from transmitted data relating to only a single audio channel. The duplication of data may be tracked back robustly from PCM waveforms by analyzing phase information in frequency subbands.
According to an aspect, a method for detecting frequency extension coding in the coding history of an audio signal, e.g. a time domain audio signal, is described. In other words, the method described in the present document may be applied to a time domain audio signal (e.g. a pulse code modulated audio signal). The method may determine if the (time domain) audio signal had been submitted to a frequency extension encoding/decoding scheme in the past. Examples for such frequency extension coding/decoding schemes are enabled in HE-AAC and DD+ codecs.
The method may comprise transforming the time domain audio signal into a frequency domain, thereby generating a plurality of subband signals in a corresponding plurality of subbands. Alternatively, the plurality of subband signals may be provided, i.e. the method may obtain the plurality of subband signals without having to apply the transform. The plurality of subbands may comprise low and high frequency subbands. For this purpose, the method may apply a time domain to frequency domain transformation typically employed in a sound encoder, such as a quadrature mirror filter (QMF) bank, a modified discrete cosine transform, and/or a fast Fourier transform. As a result of such transformation, the plurality of subband signals may be obtained, wherein each subband signal may correspond to a different excerpt of the frequency spectrum of the audio signal, i.e. to a different subband. In particular, the subband signals may be attributed to low frequency subbands or alternatively high frequency subbands. Subband signals of the plurality of subband signals in a low frequency subband may comprise or may correspond to frequencies at or below a cross-over frequency, whereas subband signals of the plurality of subband signals in a high frequency subband may comprise or may correspond to frequencies above the cross-over frequency. In other words, the cross-over frequency may be a frequency defined within a frequency extension coder, whereas the frequency components of the audio signal above the cross-over frequency are generated from the frequency components of the audio signal at or below the cross-over frequency.
As such, the plurality of subband signals may be generated using a filter bank comprising a plurality of filters. For the correct identification of the patch parameters of the frequency extension scheme, the filter bank may have the same frequency characteristics (e.g. same number of channels, same center frequencies and bandwidths) as the filter bank used in the decoder of the frequency extension coder (e.g. 64 oddly stacked filters for HE-AAC and 256 oddly stacked filters for DD+). For enhanced robustness of the patch analysis it may be beneficial to minimize the leakage into adjacent bands by increasing the stop band attenuation. This can be accomplished e.g. with a higher filter order compared to the original filter bank (e.g. twice the filter order) used in the decoder. In other words, in order to ensure a high degree of frequency selectivity of the filter bank, each filter of the filter bank may have a roll-off which exceeds a predetermined roll-off threshold for frequencies lying within a stopband of the respective filter. By way of example, instead of using filters having a stop band attenuation of about 60 dB (as is the case for the filters used in HE-AAC), the stop band attenuation of the filters used for detecting audio extension coding may be increased to 70 or 80 dB, thereby increasing the detection performance. This means that the roll-off threshold may correspond to 70 or 80 dB attenuation. As such, it may be ensured that the filter bank is sufficiently selective in order to isolate different frequency components of the audio signal within different subband signals. A high degree of selectivity may be achieved by using filters which comprise a minimum number of filter coefficients. By way of example, the filters of the plurality of filters may comprise a number M of filter coefficients, wherein M may be greater than 640.
It should be noted that the audio signal may comprise a plurality of audio channels, e.g. the audio signal may be a stereo audio signal or a multi-channel audio signal such as a 5.1 or 7.1 audio signal. The method may be applied to one or more of the audio channels. Alternatively or in addition, the method may comprise the step of downmixing the plurality of audio channels to determine a downmixed time domain audio signal. As such, the method may be applied to the downmixed time domain audio signal. In particular, the plurality of subband signals may be generated from the downmixed time domain audio signal.
The method may comprise determining a maximum frequency of the audio signal. In other words, the method may comprise the step of determining the bandwidth of the time domain audio signal. The maximum frequency of the audio signal may be determined by analyzing a power spectrum of the audio signal in the frequency domain. The maximum frequency may be determined such that for all frequencies greater than the maximum frequency, the power spectrum is below a power threshold. As a consequence of the determination of the bandwidth of the audio signal, the method for detection coding history may be limited to the frequency spectrum of the audio signal up to the maximum frequency. As such, the plurality of subband signals may only comprise frequencies at or below the maximum frequency.
The method may comprise determining a degree of relationship between subband signals in the low frequency subbands and subband signals in the high frequency subbands. The degree of relationship may be determined based on the plurality of subband signals. By way of example, the degree of relationship may indicate a similarity between a group of subband signals in the low frequency subbands and a group of subband signals in the high frequency subbands. Such a degree of relationship may be determined through analysis of the audio signal and/or through use of a probabilistic model derived from a training set of audio signals with a frequency extension coding history.
It should be noted that the plurality of subband signals may be complex-valued, i.e. the plurality of subband signals may correspond to a plurality of complex subband signals. As such, the plurality of subband signals may comprise a corresponding plurality of phase signals and/or a corresponding plurality of magnitude signals, respectively. In such cases, the degree of relationship may be determined based on the plurality of phase signals. In addition, the degree of relationship may not be determined based on the plurality of magnitude signals. It has been found that for parametric coding schemes it is beneficial to analyze phase signals. Furthermore, complex waveform signals give useful information. In particular the information gained from complex and phase data may be used in combination to increase robustness of the detection scheme. This is notably the case where the parametric coding scheme involves a copy-up process of magnitude data along frequency (such as in a modulation spectrum codec).
Furthermore, the step of determining a degree of relationship may comprise determining a group of subband signals in the high frequency subbands which has been generated from a group of subband signals in the low frequency subbands. Such a group of subband signals may comprise subband signals from successive subbands, i.e. directly adjacent subbands.
The method may comprise determining frequency extension coding history if the degree of relationship is greater than a relationship threshold. The relationship threshold may be determined experimentally. In particular, the relationship threshold may be determined from a set of audio signals with a frequency extension coding history and/or a further set of audio signals with no frequency extension coding history.
The step of determining a degree of relationship may comprise determining a set of cross-correlation values between the pluralities of subband signals. A correlation value between a first and a second subband signal may be determined as an average over time of products of corresponding samples of the first and second subband signals at a pre-determined time lag. The pre-determined time lag may be zero. In other words, corresponding samples of the first and second subband signals at a given time instant (and at the pre-determined time lag) may be multiplied, thereby yielding a multiplication result at the given time instant. The multiplication results may be averaged over a certain time interval, thereby yielding an averaged multiplication result which may be used for determining a cross-correlation value.
It should be noted that in case of multi-channel signals (e.g. stereo or 5.1/7.1 signals), the multi-channel signal may be downmixed and the set of cross-correlation values may be determined on the downmixed audio signal. Alternatively, different sets of cross-correlation values may be determined for some or all channels of the multi-channel signal. The different sets of cross-correlation values may be averaged to determine an average set of cross-correlation values which may be used for the detection of copy-up patches. In particular, the plurality of subband signals may comprise K subband signals, K>0 (e.g. K>1, K smaller or equal to 64). The parameter K may be equal to the number of channels as used in the decoder of the frequency extension codec to generate the missing high frequency subbands. For the mere detection of spectral extension 64 bands may be sufficient (frequency patches are typically wider than the bandwidths in the 64 channels case). For correct patch identification of SPX in DD+ an increased number K of subbbands may be used (e.g. K=256). As such, the set of cross-correlation values may comprise (K−1)! cross-correlation values corresponding to all combinations of different subband signals from the plurality of subband signals. The step of determining frequency extension coding history in the audio signal may comprise determining that at least one maximum cross-correlation value from the set of cross-correlation values exceeds the relationship threshold.
It should be noted that the analysis methods outlined in the present document may be performed in a time dependent manner. As indicated above, frequency extension codecs typically use time-independent patch parameters. However, the frequency extension codecs may be configured to change patch parameters over time. This may be taken into account by analyzing windows of the audio signal. The windows of the audio signals may have a predetermined length (e.g. 10-20 seconds or shorter). In case of patch parameters which do not change over time, the robustness of the analysis methods described in the present document may be increased by averaging the set of cross-correlation values obtained for different windows of the audio signal. In order to decrease the complexity of the analysis methods, the different windows of the audio signal (i.e. different segments of the audio signal) may be averaged prior to determining the set of cross-correlation values based on the averaged windows of the audio signal.
The set of cross-correlation values may be arranged in a symmetrical K×K correlation matrix. The main diagonal of the correlation matrix may have arbitrary values, e.g. values corresponding to zero or value corresponding to auto-correlation values for the plurality of subband signals. The correlation matrix may be considered as an image from which particular structures or patterns may be determined. These patterns may provide an indication on the degree of relationship between the pluralities of subband signals. In view of the fact that the correlation matrix is symmetrical, only one “triangle” of the correlation matrix (either below or above the main diagonal) may need to be analyzed. As such, the method steps described in the present document may only be applied to one such “triangle” of the correlation matrix.
As indicated above, the correlation matrix may be considered as an image comprising patterns which indicate a relationship between low frequency subbands and high frequency subbands. The patterns to be detected may be diagonals of locally increased correlation parallel to the main diagonal of the correlation matrix. Line enhancement schemes may be applied to the correlation matrix (or a tilted version of the correlation matrix, wherein the correlation matrix may be tilted such that the diagonal structures turn into vertical or horizontal structures) in order to emphasize one or more such diagonals of local maximum cross-correlation values in the correlation matrix. An example line enhancement scheme may comprise convolving the correlation matrix with an enhancement matrix
thereby yielding an enhanced correlation matrix. If line enhancement or any other pattern enhancement technique is applied, the step of determining frequency extension coding history may comprise determining that at least one maximum cross-correlation value from the enhanced correlation matrix, excluding the main diagonal, exceeds the relationship threshold. In other words, the determination of the degree of relationship may be based on the enhanced correlation matrix (and the enhanced set of cross-correlation values).
The method may be configured to determine particular parameters of the frequency extension coding scheme which had been applied to the time domain audio signal. Such parameters may e.g. be parameters relating to the subband copy-up process of the frequency extension coding scheme. In particular, it may be determined which subband signals in the low frequency subbands (the source subbands) had been copied up to subband signals in the high frequency subbands (the target subbands). This information may be referred to as patching information and it may be determined from diagonals of local maximum cross-correlation values within the correlation matrix.
As such, the method may comprise analyzing the correlation matrix to detect one or more diagonals of local maximum cross-correlation values. In order to detect such one or more diagonals, one or more of the following criteria may be applied: A diagonal of local maximum cross-correlation values may not lie on the main diagonal of the correlation matrix; and/or a diagonal of local maximum cross-correlation values may or should comprise more than one local maximum cross-correlation values, wherein each of the more than one local maximum cross-correlation values exceeds a minimum correlation threshold. The minimum correlation threshold is typically smaller than the relationship threshold.
A diagonal may be detected if the more than one local maximum cross-correlation values are arranged in a diagonal manner parallel to the main diagonal of the correlation matrix; and/or if for each of the more than one local maximum cross-correlation values in a given row of the correlation matrix, a cross-correlation value in the same row and a directly adjacent left side column is at or below the minimum correlation threshold and/or if a cross-correlation value in the same row and a directly adjacent right side column is at or below the minimum correlation threshold.
As outlined above, the analysis of the correlation matrix may be limited to only one “triangle” of the correlation matrix. It may occur that more than one diagonal of local maximum cross-correlation values are detected either above or below the main diagonal. This may be an indication that a plurality of copy-up patches had been applied within the frequency extension coding scheme. On the other hand, if more than two diagonals of local maximum cross-correlation values are detected, at least one of the more than two diagonals may indicate correlations between copy-up patches. Such diagonals do not indicate a copy-up patch and should be identified. Such inter-patch correlations may be employed to improve robustness of the detection scheme.
The correlation matrix may be arranged such that a row of the correlation matrix indicates a source subband and a column of the correlation matrix indicates a target subband. It should be noted that the arrangement with columns of the correlation matrix indicating the source subbands and rows of the correlation matrix indicating the target subbands is equally possible. In this case, the method may be applied by exchanging “rows” and “columns”
In order to isolate appropriate copy-up patches, the method may comprise detecting at least two redundant diagonals having local maximum cross-correlation values for the same source subband of the correlation matrix. The diagonal of the at least two redundant diagonals having the respective lowest target subbands may be identified as an authentic copy-up patch from a plurality of source subbands to a plurality of target subbands. The other diagonal(s) may indicate a correlation between different copy-up patches.
Having identified the copy-up diagonal(s), the pairs of source and target subbands of the diagonal indicate the low frequency subbands which have been copied up to high frequency subbands.
It may be observed that the edges of the copy-up diagonals (i.e. their start and/or end points) have a reduced maximum cross-correlation value with regards to the other correlation points of the diagonal. This may be due to the fact that the transform which was used to determine the plurality of subband signals has a different frequency resolution than the transform which was used within the frequency extension coding scheme applied to the time domain audio signal. As such, the detection of “weak” edges of the diagonal may indicate a mismatch of the filter bank characteristics (e.g. a mismatch of the number of subbands, a mismatch of the center frequencies, and/or a mismatch of the bandwidth of the subbands) and therefore may provide information on the type of frequency extension coding scheme which had been applied to the time domain audio signal.
In order to exploit the above mentioned observation, the method may comprise the step of detecting that local maximum cross-correlation values of a detected diagonal at a start and/or an end of the detected diagonal are below a blurring threshold. The blurring threshold is typically higher than the minimum correlation threshold. The method may proceed in comparing parameters of the transform step with parameters of transform steps used for a plurality of frequency extension coding schemes. In particular, the transformation orders (i.e. the number of subbands) may be compared. Based on the comparing step the frequency extension coding scheme, which has been applied to the audio signal, may be determined from the plurality of frequency extension coding schemes. By way of example, when using a filter bank with a high number of subbands (or channels) and if a patch border does not fall exactly on the grid of the filter bank used in HE-AAC, it can be concluded that the frequency extension coding scheme is not HE-AAC.
The correlation matrix may be analyzed, in order to detect a particular decoding mode applied by the frequency extension coding scheme. This applies e.g. to HE-AAC which allows for low power (LP) or High Quality (HQ) decoding. For this purpose, various correlation thresholds may be defined. In particular, it may be determined that the maximum cross-correlation value from the set of cross-correlation values is either below or above a decoding mode threshold, thereby detecting a decoding mode of a frequency extension coding scheme applied to the audio signal. The decoding mode threshold may be greater than the minimum correlation threshold. Furthermore, the decoding mode threshold may be greater than the relationship threshold. In the case of LP or HQ decoding, LP decoding may be detected if the maximum cross-correlation value is below the decoding mode threshold (but above the relationship threshold). HQ decoding may be detected if the maximum cross-correlation value is above the decoding mode threshold.
As indicated above, the degree of relationship between subband signals in low frequency subbands and subband signals in high frequency subbands may involve the usage of a probabilistic model. As such, the method may comprise the step of providing a probabilistic model determined from a set of training vectors derived from training audio signals with a frequency extension coding history. The probabilistic model may describe a probabilistic relationship between vectors in a vector space spanned by the plurality of high frequency subbands and the low frequency subbands. Assuming that the plurality of subbands comprises K subbands, the vector space may have a dimension of K. Alternatively or in addition, the probabilistic model may describe a probabilistic relationship between vectors in a vector space spanned by the plurality of subbands and the low frequency subbands. Assuming that the plurality of subbands comprises K subbands of which Kl are low frequency subbands, the vector space may have a dimension of K+Kl. In the following the latter probabilistic model is described in further detail. However, the method is equally applicable for the first probabilistic model.
The probabilistic model may be a Gaussian Mixture Model. In particular, the probabilistic model may comprise a plurality of mixture components, each mixture component having a mean vector μ in the vector space and a covariance matrix C in the vector space. The mean vector μi of an ith mixture component may represent a centroid of a cluster in the vector space; and the covariance matrix Ci of the ith mixture component may represent a correlation between the different dimensions in the vector space. The mean vectors μi and the covariance matrices Ci, i.e. the parameters of the probabilistic model, may be determined using a set of training vectors in the vector space, wherein the training vectors may be determined from a set of training audio signals with a frequency extension coding history.
The method may comprise the step of providing an estimate of the plurality of subband signals given the subband signals in the low frequency subband. The estimate may be determined based on the probabilistic model. In particular, the estimate may be determined based on the mean vectors μi and the covariance matrices Ci of the probabilistic model. Even more particularly, the estimate may be determined as
with E[y|x] being the estimate of the plurality of subband signals y given the subband signals x in the low frequency subbands, with hi(x) indicating a relevance of the ith mixture component of the Gaussian Mixture Model given the subband signals x, with μiy being a component of the mean vector μi corresponding to the subspace of the plurality of subbands, with μix being a component of the mean vector μi corresponding to the subspace of the low frequency subbands, with Q being the number of components of the Gaussian Mixture Model, and with Ciyx and C being sub-matrices from the covariance matrix Ci.
The relevance indicator hi(x) may be determined as the probability that subband signals x in the low frequency subbands fall within the ith mixture component of the Gaussian Mixture Model, i.e. as
Having provided an estimate, a degree of relationship may be determined based on an estimation error derived from the estimate of the plurality of subband signals and the plurality of subband signals. The estimation error may be a mean square error.
The audio signal may be a multi-channel signal, e.g. comprising a first and a second channel. The first and second channels may be left and right channels, respectively. In this case, it may be desirable to determine particular parametric encoding schemes applied on the multi-channel signals, such as MPEG parametric stereo encoding or coupling as used by DD(+) (or MPEG intensity stereo). This information may be detected from the plurality of subband signals of the first and second channels. In order to determine the plurality of subband signals of the first and second channels, the method may comprise transforming the first and the second channels into the frequency domain, thereby generating a plurality of first subband signals and a plurality of second subband signal. The first and second subband signals may be complex-valued and may comprise first and second phase signals, respectively. Consequently, a plurality of phase difference subband signals may be determined as the difference of corresponding first and second subband signals.
The method may proceed in determining a plurality of phase difference values, wherein each phase difference value may be determined as an average over time of samples of the corresponding phase difference subband signal. Parametric stereo encoding in the coding history of the audio signal may be determined by detecting a periodic structure within the plurality of phase difference values. In particular, the periodic structure may comprise an oscillation of phase difference values of adjacent subbands between positive and negative phase difference values, wherein a magnitude of the oscillating phase difference values exceeds an oscillation threshold.
In order to detect coupling of the first and second channel or coupling between multiple channels in the case of general multi-channel signals, the method may comprise the step of determining, for each phase difference subband signal, a fraction of samples having a phase difference smaller than a phase difference threshold. Coupling of the first and second channel in the coding history of the audio signal may be determined when detecting that the fraction exceeds a fraction threshold, in particular for subband signals in the high frequency subbands.
According to another aspect, a method for detecting the use of a parametric audio coding tool (e.g. parametric stereo coding or coupling) in the coding history of an audio signal is described. The audio signal may be a multi-channel signal comprising a first and a second channel, e.g. comprising a left and a right channel. The method may comprise the step of providing a plurality of first subband signals and a plurality of second subband signals. The plurality of first subband signals may correspond to a time/frequency domain representation of the first channel of the multi-channel signal. The plurality of second subband signals may correspond to a time/frequency domain representation of the second channel of the multi-channel signal. As such, the plurality of first and second subband signals may have been generated using a time domain to frequency domain transform (e.g. a QMF). The plurality of first and second subband signals may be complex-valued and may comprise a plurality of first and second phase signals, respectively.
The method may comprise the step of determining a plurality of phase difference subband signals as the difference of corresponding first and second phase signals from the plurality of first and second phase signals. The use of a parametric audio coding tool in the coding history of the audio signal may be detected from the plurality of phase difference subband signals.
In particular, the method may comprise the step of determining a plurality of phase difference values, wherein each phase difference value may be determined as an average over time of samples of the corresponding phase difference subband signal. Parametric stereo encoding in
the coding history of the audio signal may be detected by detecting a periodic structure within the plurality of phase difference values.
Alternatively or in addition, the method may comprise the step of determining, for each phase difference subband signal, a fraction of samples having a phase difference smaller than a phase difference threshold. A coupling of the first and second channel in the coding history of the audio signal may be detected by
detecting that the fraction exceeds a fraction threshold for subband signals at frequencies above a cross-over frequency (also referred to as the coupling start frequency in the context of coupling), e.g. for the subband signals in the high frequency subbands.
According to a further aspect, a software program is described, which is adapted for execution on a processor and for performing the method steps outlined in the present document when carried out on a computing device.
According to another aspect, a storage medium is described, which comprises a software program adapted for execution on a processor and for performing the method steps outlined in the present document when carried out on a computing device.
According to another aspect, a computer program product is described which comprises executable instructions for performing the method outlined in the present document when executed on a computer.
It should be noted that the methods and systems including its preferred embodiments as outlined in the present document may be used stand-alone or in combination with the other methods and systems disclosed in this document. Furthermore, all aspects of the methods and systems outlined in the present document may be arbitrarily combined. In particular, the features of the claims may be combined with one another in an arbitrary manner.
The invention is explained below in an exemplary manner with reference to the accompanying drawings, wherein
a-1f illustrates an example correlation based analysis using magnitude, complex and/or phase data;
a, 2b, 2c and 2d show example maximum cross-correlation values and probability density functions based on complex and phase-only data;
a and 4b illustrate a comparison between example similarity matrices determined using different analysis filter banks;
a, 6b and 6c show example probability density functions determined using different analysis filter banks;
a and 10b illustrate example phase difference graphs used for parametric stereo and coupling detection.
As has been outlined above, in MPEG SBR encoding an audio signal is waveform encoded at a reduced sample-rate and bandwidth. The missing higher frequencies are reconstructed in the decoder by copying low frequency parts to high frequency parts using transmitted side information. The transmitted side information (e.g. spectral envelope parameters, noise parameters, tone addition/removal parameters) is applied to the patches from the lowband signal, wherein the patches have been copied-up or transposed to higher frequencies. As a result of this copy-up process, there should be correlations between certain spectral portions of the lowband signal and copied-up spectral portions of the highband signal. These correlations could be the basis for detecting spectral band replication based encoding within a decoded audio signal.
The correlation between spectral portions of the lowband signal and spectral portions of the highband signal may have been reduced or removed by the application of the side information, i.e. the SBR parameters, onto the copied-up patches. However, it has been observed that the application of SBR parameters onto the copied-up patches does not significantly affect the phase characteristics of the copied-up patches (i.e. the phases of the complex valued subband coefficients). In other words, the phase characteristics of copied-up low frequency bands are largely preserved in the higher frequency bands. The extent of preservation typically depends on the bitrate of the encoded signal and on the characteristics of the encoded audio signal. As such, the correlation of phase data in the spectral portions of the (decoded) audio signal can be used to trace back the frequency patching operations performed in the context of SBR encoding.
In the following, several correlation based analysis methods of PCM waveforms are described. These methods may be used to detect remnants of audio coding employing parametric frequency extension tools such as SBR in MPEG HE-AAC or SPX in Dolby Digital Plus (DD+). In addition, particular parameters, specifically the patching information of the frequency extension process may be extracted. This information may be useful for an efficient re-encoding. Moreover additional measures are described that indicate the presence of MPEG Parametric Stereo (PS) as used in HE-AACv2 and the presence of Coupling as used in DD(+).
It should be noted that the basic principle of bandwidth extension as used in DD+ is similar to MPEG SBR. Consequently, the analysis techniques outlined in this document in the context of MPEG SBR encoded audio signals are equally applicable to audio signals which had previously been DD+ encoded. This means that even though the analysis methods are outlined in the context of HE-AAC, the methods are also applicable to other bandwidth extension based encoders such as DD+.
The audio signal analysis methods should be able to operate for the various operation modes of the audio encoders/decoders. Furthermore, the analysis methods should be able to distinguish between these different operation modes. By way of example, HE-AAC codecs make use of two different HE-AAC decoding modes: High Quality (HQ) and Low Power (LP) decoding. In the LP mode, the decoder complexity is reduced by using a real valued critically sampled filter bank compared to a complex oversampled filter bank used in the HQ mode. Usually small inaudible aliasing products may be present in audio signals which have been decoded using the LP mode. These aliasing products may affect the audio quality and it is therefore desirable to detect the decoding mode which has been used to decode the analyzed PCM audio signal. In a similar manner, different decoding modes or complexity modes should also be identified in other frequency extension codecs such as USAC based on SBR.
For HE-AACv2, which applies PS (parametric stereo), the decoder typically uses the HQ mode. PS enables an improved audio quality at low bitrates such as 20-32 kb/s, however, it cannot usually compete with the stereo quality of HE-AACv1 at higher bitrates such as 64 kb/s. HE-AACv1 is most efficient at bitrates between 32 and 96 kb/s, however, it is not transparent for higher bitrates. In other words, PS (HE-AACv2) at 64 kb/s typically provides a worse audio quality than HE-AACv1 at 64 kb/s. On the other hand, PS at 32 kb/s will usually be only slightly worse than HE-AACv1 at 64 kb/s but much better than HE-AACv1 at 32 kb/s. Therefore knowledge about the actual coding conditions may be a useful indicator to provide a rough audio quality assessment of the (decoded) audio signal.
Coupling as used e.g. in Dolby Digital (DD) and DD+ makes use of the hearing phase insensitivity at high frequencies. Conceptually, coupling is related to the MPEG Intensity Stereo (IS) tool, where only a single audio channel (or the coefficients related to the scale factor band of only one audio channel) is transmitted in the bitstream along with inter channel level difference parameters. Due to time/frequency sharing of these parameters, the bitrate of the encoded bitstream can be reduced significantly especially for multi-channel audio. As such, the frequency bins of the reconstructed audio channels are correlated for shared side level information, and this information could be used in order to detect an audio codec making use of coupling.
In a first approach, the (decoded) audio signal, e.g. the PCM waveform signal, may be transformed into the time/frequency domain using an analysis filter bank. In an embodiment, the analysis filter bank is the same analysis filter bank as used in an HE-AAC encoder. By way of example, a 64 band complex valued filter bank (which is oversampled by a factor of two) may be used to transform the audio signal into the time/frequency domain. In case of a multi-channel audio signal, the plurality of channels may be downmixed prior to the filter bank analysis, in order to yield a downmixed audio signal. As such, the filter bank analysis (e.g. using a QMF filter bank) may be performed on the downmixed audio signal. Alternatively, the filter bank analysis may be performed on some or all of the plurality of channels.
As a result of the filter bank analysis, a plurality of complex subband signals is obtained for the plurality of filter bank subbands. This plurality of complex subband signals may be the basis for the analysis of the audio signal. In particular, the phase angles of the plurality of complex subband signals or the plurality of complex QMF bins may be determined.
Furthermore, the bandwidth of the audio signal may be determined from the plurality of complex subband signals using power spectrum analysis. By way of example, the average energy within each subband may be determined. Subsequently, the cutoff subband may be determined as the subband for which all subbands at higher frequencies have an average energy below a pre-determined energy threshold value. This will provide a measure of the bandwidth of the audio signal. Furthermore, the analysis of the correlation between the subbands of the audio signal may be limited to subbands having frequencies with the cutoff subband or below (as will be described below).
In addition, the cross-correlation at zero lag between all QMF bands over the analysis time range may be determined, thereby providing a self-similarity matrix. In other words, the cross-correlation (at a time lag of zero) between all pairs of subband signals may be determined. This results in a symmetrical self-similarity matrix, e.g. in a 64×64 matrix in case of 64 QMF bands. This self-similarity matrix may be used to detect repeating structures in the frequency-domain. In particular, a maximum correlation value (or a plurality of maximum correlation values) within the self-similarity matrix may be used to detect spectral band replication within the audio signal. For the determination of the one or more maximum correlation values, auto-correlation values within the main diagonal should be excluded (as the auto-correlation values do not provide an indication of the correlation between different subbands). Furthermore, the determination of the maximum value could be limited to the limits of the previously determined audio bandwidth, i.e. the determination of the self-similarity matrix may be limited to the cutoff subband and the subbands at lower frequencies.
It should be noted that in case of multi-channel audio signals, the above procedure can be applied to all channels of the multi-channel audio signal independently. In this case, a self-similarity matrix could be determined for each channel of the multi-channel signal. The maximum correlation value across all audio channels could be taken as an indicator for the presence of SBR based encoding within the multi-channel audio signal. In particular, if the maximum cross-correlation value exceeds a pre-determined correlation threshold, the waveform signal may be classified as coded by a frequency extension tool.
It should be noted that the above procedure may also be based on the complex or the magnitude QMF data (as opposed to the phase angle QMF data). However, since in frequency extension coding, the magnitude envelopes of the patched lowband signals are modified in accordance to the original high frequency data, a reduced correlation may be expected when basing the analysis on magnitude data.
In
Lines of high correlation which run parallel to the main diagonal indicate a high degree of correlation or similarity between QMF bands and therefore potentially indicate frequency patches. The presence of these lines implies that a frequency extension tool has been applied to the (decoded) audio signal.
In
It can be seen that the dynamic range for a phase based analysis (middle row of
The self-similarity matrices 120, 121 in
For further evaluation of the above described analysis method, the maximum cross-correlation value derived from the self-similarity matrices 110, 111, 120, 121 has been plotted for 160 music files and 13 different coding conditions. The 13 different coding conditions comprise coders with and without parametric frequency extension (SBR/SPX) tools as listed in Table 1.
Table 1 shows the different coding conditions which have been analyzed. It has been observed that copy-up patches and thus frequency extension based coding can be detected with a reasonable degree of certainty. This can also be seen in
The analysis results shown in
The robustness of the correlation based analysis method may be improved by various measures, such as the selection of an appropriate analysis filter bank. Leakage from (modified) adjacent QMF bands may change the original low frequency band phase characteristics. This may have an impact on the degree of correlation which may be determined between the phases of different QMF bands. As such, it may be beneficial to select an analysis filter bank which provides for a sharp frequency separation. The frequency separation of the analysis filter bank may be sharpened by designing the modulated analysis filter banks using prototype filters with an increased length. In an example, a prototype filter with 1280 samples length (compared to 640 samples length of the filter used for the results of
a and 4b illustrate the self-similarity matrices 400 and 410 which have been determined based on phase-only data of the QMF subbands. For the matrix 400 the shorter filter 301 has been used, whereas for the matrix 410 the longer filter 302 has been used. A first frequency patch 401 is indicated by the diagonal line starting at QMF band 3 (x-axis) and covers target QMF bands from band index 20 to 35 (y-axis). For the higher selective filter used for matrix 410, a second frequency patch 412 becomes visible starting at QMF band Nr. 8. This second frequency patch 412 is not identified in matrix 400 derived using the original filter 301.
It should be noted that the presence of the second patch 412 can be deduced from the diagonal line 403 starting at QMF band 25 on the x-axis. However, since the band 25 is a target QMF band of the first patch, the diagonal line 403 indicates the inter-patch similarity for QMF source bands that are employed in both patches. It should be further noted that QMF source band regions may overlap, but target QMF band regions may not. This means that QMF source bands may be patched to a plurality of target QMF bands, however, typically every target QMF band has a unique conesponding QMF source band. It can also be observed that by using highly separating analysis filter banks 302, the similarity indicating lines 401, 412 of
The highly selective prototype filter 302 has been evaluated for phase-only data and complex data based analysis as shown in
The probability density functions 600 and 610 conesponding to the maximum correlation values determined based on complex data and based on phase-only data are illustrated in
As such, the use of highly selective analysis filter banks may improve the robustness of the similarity matrix based frequency extension detection schemes. Alternatively or in addition, line enhancement schemes may be applied in order to more clearly isolate the diagonal structures (i.e. the indicators for frequency patches) within the similarity matrix. An example line enhancement scheme may apply an enhancement matrix h to the similarity matrix C, e.g.
wherein a line enhanced similarity matrix may be determined by convolving the enhancement matrix h to the similarity matrix C. The maximum value of the line enhanced similarity matrix may be taken as an indicator of the presence of frequency extension within the audio signal.
The self-similarity matrices comprising the cross-correlation coefficients between subbands may be used to determine frequency extension parameters, i.e. parameters that were used for the frequency extension when encoding the audio signal. The extraction of particular frequency patching parameters may be based on line detection schemes in the self-similarity matrix. In particular, the lowbands which have been patched to highbands may be determined. This correspondence information may be useful for re-encoding, as the same or a similar correspondence between lowbands and highbands could be used.
Considering the self-similarity matrix (e.g. matrix 410) as a grey level image, any line detection method (e.g., edge detection followed by Hough Transforms) known from image processing may be applied. For illustrative purposes, an example method has been implemented for evaluation as shown in
In order to design an appropriate line detection scheme, codec specific information could be used in order to make the analysis method more robust. For instance, it may be assumed that lower frequency bands are used to patch higher frequency bands and not vice versa. Furthermore, it may be assumed that a patched QMF band may originate from only one source band (i.e. it may be assumed that patches do not overlap). On the other hand, the same QMF source band may be used in a plurality of patches. This may lead to increased correlation between patched highbands (as e.g. the diagonal 403 in
Using some or all of the above assumptions, an example line detection scheme may apply any of the following steps:
Using the above approach (or similar line detection schemes) patch detection may be performed. In particular, the above approach has been evaluated for HE-AAC coding (coding conditions 1-8) listed in Table 1. The detection performance may be determined as a percentage of audio files for which all patch parameters have been identified correctly. It has been observed that phase-only data based analysis yields significantly better detection results for non-re-encoded HE-AAC (coding conditions 1-5) than complex data based analysis. For these coding conditions, the patching parameters (notably the mapping between source and target bands) can be determined with a high degree of reliability. As such, the estimated patching parameters may be used when re-encoding the audio signal, thereby avoiding or reducing further signal degradation due to the re-encoding process.
The patch parameter detection rate decreases for LP-SBR decoded signals compared to HQ-SBR decoded signals. For AAC re-encoded signals (coding conditions 6-8), the detection rates decrease significantly for both methods (phase-only data based and complex data based) to a low level. This has been analyzed in further detail. For condition 6 the similarity matrix 800 is shown in
As has already been indicated above, the methods described in the present document may be applied to various frequency extension schemes including SPX encoding. As such, a similarity matrix may be determined based on an analysis filter bank resolution which does not necessarily correspond to the filter bank resolution used within the frequency band scheme which has been applied to the audio signal. This is illustrated in
Overall it may be stated that the more accurate analysis results (both with respect to the actual detection of frequency extension coding, and with respect to the determination of patch parameters) may be achieved when using analysis filter banks with increased frequency resolution, e.g. a frequency resolution which is equal or higher than the frequency resolution of the filter bank used for frequency extension coding.
As pointed above, DD+ coding uses a different frequency resolution for frequency extension than HE-AAC. It has been indicated that when using a frequency resolution for the frequency extension detection which differs from the frequency resolution which had actually been used for the frequency extension, the patch borders, i.e. the lowest and/or highest bands of a patch may be blurred. This information may be used to determine information about the coding system which was applied on the audio signal. In other words, by evaluating the frequency patch borders, the coding scheme may be determined. By way of example, if the patch borders do not fall exactly on the 64 QMF band grid used for determining the similarity matrix, it may be concluded that the coding scheme is not HE-AAC.
It may further be desirable to provide measures for detecting the use of Parametric Stereo (PS) encoding in HE-AACv2 and the use of Coupling in DD/DD+. PS is only relevant for stereo content, while Coupling is applied in stereo and multi-channel audio. In the case of both tools, only data according to a single channel is transmitted within the bitstream along with a small amount of side information which is used in the decoder in order to generate the other channels (i.e. the second stereo channel or the multi-channels) from the transmitted channel. While PS is active over the whole audio bandwidth, Coupling is only applied at higher frequencies. Coupling is related to the concept of Intensity Stereo (IS) coding and can be detected from inter-channel correlation analysis or by comparing the phase information in the left and right channels. PS maintains the inter channel correlation characteristics of the original signal by means of a decorrelation scheme, therefore the phase relation between the left and right channels in PS is complex. However, PS decorrelation leaves a characteristic fingerprint in the average inter-channel phase difference as shown in
An example method for detecting the use of PS encoding may apply any of the following steps:
An example method for detecting the use of coupling (in the case of stereo content) may apply any of the following steps:
As has been outlined above, a spectral bandwidth replication method generates high frequency coefficients based on information in the low frequency coefficients. This implies that the bandwidth replication method introduces a specific relationship or correlation between low and high frequency coefficients. In the following, a further approach for detecting that a (decoded) audio signal has been submitted to spectral bandwidth replication is described. In this approach, a probabilistic model is built that captures the specific relationship between low- and high-frequency coefficients.
In order to capture the relationship between low- and high-frequency coefficients, a training dataset comprising N spectral lowband vectors {x1, x2 . . . xN} may be created. The lowband vectors {x1, x2 . . . xN} are spectral vectors which may be computed from audio signals which have a predetermined maximum frequency Fnarrow (e.g. 8 kHz). That is, {x1, x2 . . . xN} are spectral vectors computed from audio at a sampling rate of e.g. 16 kHz. The lowband vectors may be determined based on the low frequency bands of e.g. HE-AAC or MPEG SBR encoded audio signals, i.e. of audio signals which have a frequency extension coding history.
Furthermore, bandwidth extended versions of these N spectral vectors {x1, x2 . . . xN} may be determined using a bandwidth replication method (e.g., MPEG SBR). The bandwidth extended versions of the vectors {x1, x2 . . . xN} may be referred to as {y1, y2 . . . yN}. The maximum frequency content in {y1, y2 . . . yN } may be a predetermined maximum frequency Fwide (e.g. 16 kHz). This implies that the frequency coefficients between Fnarrow (e.g. 8 kHz) and Fwide (e.g. 16-kHz) are generated based on {x1, x2 . . . xN}.
Given this training data set, a joint density of a set of the vectors {z1, z2 . . . zN} where zj={xj yj} (i.e. a concatenation of the narrow band spectral vector and wide band spectral vector) may be determined as:
with n being the dimensionality of the vectors zi. Q is the number of components in the Gaussian Mixture Model (GMM) used to approximate the joint density p(z|λ), μi is the mean of the ith mixture component and Ci is the covariance of the ith mixture component in the GMM.
Note that the covariance matrix of z (i.e. Ci) can be written as
where Cixx refers to the covariance matrix of the lowband spectral vector, Ciyy refers to the covariance matrix of the wideband spectral vector, and Cixy refers to the cross-covariance matrix between lowband and wideband spectral vector.
Similarly, the mean vector of z (μi) can be written as
where μix is the mean of the lowband spectral vector of the ith mixture component and μiy is the mean of the wideband spectral vector of the ith mixture component.
Based on the joint density, i.e. based on the determined mean vectors μi and covariance matrices Ci a function F(x) may be defined that maps the lowband spectral vectors (x1) to wideband spectral vectors (yi). In the present example, F(x) is chosen such that it minimizes the mean squared error between the original wideband spectral vector and the reconstructed spectral vector. Under this assumption, F(x) may be determined as
Here E[y|x] refers to the conditional expectation of y given the observed lowband spectral vector x. The term hi(x) refers to the probability that the observed lowband spectral vector x is generated from the ith mixture component of the estimated GMM (see equation (1)).
The term hi(x) can be computed as follows
Using the above described statistical model, an SBR detection scheme may be described as follows. Based on equations (1) and (2) the relationship between low and high frequency components may be captured using a training data set comprising lowband spectral vectors and their corresponding wideband spectral vectors.
Given a novel wideband spectral vector (u) which is determined from a novel (decoded) audio signal, the statistical model may be used to determine whether the high frequency spectral components of the (decoded) audio signal were generated based on a bandwidth replication method. The following steps may be performed in order to detect whether bandwidth replication was performed:
The input wideband spectral vector (u) may be split into two parts u=[ux uhi], wherein ux corresponds to the lowband spectral vector, and uhi corresponds to the high frequency part of the spectrum of the audio signal which may or may not have been created by a bandwidth replication method.
By using the probabilistic model and in particular by using equation (2) a wideband vector F(ux) may be estimated based on ux. The prediction error ∥u−F(ux)∥ would be small if the high frequency components were generated according to the probabilistic model in equation (1). Otherwise, the prediction error would be large indicating that the high frequency components were not generated by a bandwidth replication method. Consequently, by comparing the prediction error ∥u−F(ux)∥ with a suitable error threshold, it may be detected whether SBR was performed on the input vector “u”, i.e. whether the (decoded) audio signal had been submitted to SBR processing.
It should be noted that the above statistical model may alternatively be determined using the lowband vectors {x1, x2 . . . xN} and the corresponding highband vectors {y1, y2 . . . yN}, wherein the highband vectors {y1, y2 . . . yN} have been determined from {x1, x2 . . . xN} using a bandwidth replication method (e.g., MPEG SBR). This means that the vectors {y1, y2 . . . yN} only comprise the highband components which were generated using the bandwidth replication method and not the lowband components from which the highband components are generated. The set of the vectors {z1, z2 . . . zN}, where zj={xj yj}, is determined as a concatenation of the low band spectral vector and the high band spectral vector. By doing this, the dimension of the Gaussian Mixture Model (GMM) can be reduced, thereby reducing the overall complexity. It should be noted that the equations described above are also applicable to the case with {y1, y2 . . . yN} being the highband vectors.
In the present document, methods and systems for analyzing a (decoded) audio signal have been described. The methods and systems may be used to determine if the audio signal had been submitted to a frequency extension based codec, such as HE-AAC or DD+. Furthermore, the methods and systems may be used to detect specific parameters which were used by the frequency extension based codec, such as corresponding pairs of low frequency subbands and high frequency subbands, decoding modes (LP or HQ decoding), the use of parametric stereo encoding, the use of coupling, etc. The described method and systems are adapted to determine the above mentioned information from the (decoded) audio signal alone, i.e. without any further information regarding the history of the (decoded) audio signal (e.g. a PCM audio signal).
The method and system described in the present document may be implemented as software, firmware and/or hardware. Certain components may e.g. be implemented as software running on a digital signal processor or microprocessor. Other components may e.g. be implemented as hardware and or as application specific integrated circuits.
This application claims priority to U.S. Patent Provisional Application No. 61/488,122, filed 19 May 2011, hereby incorporated by reference in its entirety.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2012/035785 | 4/30/2012 | WO | 00 | 11/6/2013 |
Number | Date | Country | |
---|---|---|---|
61488122 | May 2011 | US |