This disclosure relates to signal processing.
The development of digital encoding and decoding processes for audio and video data continues to have a significant effect on the delivery of entertainment content. Despite the increased capacity of memory devices and widely available data delivery at increasingly high bandwidths, there is continued pressure to minimize the amount of data to be stored and/or transmitted. Audio and video data are often delivered together, and the bandwidth for audio data is often constrained by the requirements of the video portion.
Accordingly, audio data are often encoded at high compression factors, sometimes at compression factors of 30:1 or higher. Because signal distortion increases with the amount of applied compression, trade-offs may be made between the fidelity of the decoded audio data and the efficiency of storing and/or transmitting the encoded data.
Moreover, it is desirable to reduce the complexity of the encoding and decoding algorithms. Encoding additional data regarding the encoding process can simplify the decoding process, but at the cost of storing and/or transmitting additional encoded data. Although existing audio encoding and decoding methods are generally satisfactory, improved methods would be desirable.
Some aspects of the subject matter described in this disclosure can be implemented in audio processing methods. Some such methods may involve receiving audio data corresponding to a plurality of audio channels. The audio data may include a frequency domain representation corresponding to filterbank coefficients of an audio encoding or processing system. The method may involve applying a decorrelation process to at least some of the audio data. In some implementations, the decorrelation process may be performed with the same filterbank coefficients used by the audio encoding or processing system.
In some implementations, the decorrelation process may be performed without converting coefficients of the frequency domain representation to another frequency domain or time domain representation. The frequency domain representation may be the result of applying a perfect reconstruction, critically-sampled filterbank. The decorrelation process may involve generating reverb signals or decorrelation signals by applying linear filters to at least a portion of the frequency domain representation. The frequency domain representation may be a result of applying a modified discrete sine transform, a modified discrete cosine transform or a lapped orthogonal transform to audio data in a time domain. The decorrelation process may involve applying a decorrelation algorithm that operates entirely on real-valued coefficients.
According to some implementations, the decorrelation process may involve selective or signal-adaptive decorrelation of specific channels. Alternatively, or additionally, the decorrelation process may involve selective or signal-adaptive decorrelation of specific frequency bands. The decorrelation process may involve applying a decorrelation filter to a portion of the received audio data to produce filtered audio data. The decorrelation process may involve using a non-hierarchal mixer to combine a direct portion of the received audio data with the filtered audio data according to spatial parameters.
In some implementations, decorrelation information may be received, either with the audio data or otherwise. The decorrelation process may involve decorrelating at least some of the audio data according to the received decorrelation information. The received decorrelation information may include correlation coefficients between individual discrete channels and a coupling channel, correlation coefficients between individual discrete channels, explicit tonality information and/or transient information.
The method may involve determining decorrelation information based on received audio data. The decorrelation process may involve decorrelating at least some of the audio data according to determined decorrelation information. The method may involve receiving decorrelation information encoded with the audio data. The decorrelation process may involve decorrelating at least some of the audio data according to at least one of the received decorrelation information or the determined decorrelation information.
According to some implementations, the audio encoding or processing system may be a legacy audio encoding or processing system. The method may involve receiving control mechanism elements in a bitstream produced by the legacy audio encoding or processing system. The decorrelation process may be based, at least in part, on the control mechanism elements.
In some implementations, an apparatus may include an interface and a logic system configured for receiving, via the interface, audio data corresponding to a plurality of audio channels. The audio data may include a frequency domain representation corresponding to filterbank coefficients of an audio encoding or processing system. The logic system may be configured for applying a decorrelation process to at least some of the audio data. In some implementations, the decorrelation process may be performed with the same filterbank coefficients used by the audio encoding or processing system. The logic system may include at least one of a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components.
In some implementations, the decorrelation process may be performed without converting coefficients of the frequency domain representation to another frequency domain or time domain representation. The frequency domain representation may be the result of applying a critically-sampled filterbank. The decorrelation process may involve generating reverb signals or decorrelation signals by applying linear filters to a least a portion of the frequency domain representation. The frequency domain representation may be the result of applying a modified discrete sine transform, a modified discrete cosine transform or a lapped orthogonal transform to audio data in a time domain. The decorrelation process may involve applying a decorrelation algorithm that operates entirely on real-valued coefficients.
The decorrelation process may involve selective or signal-adaptive decorrelation of specific channels. The decorrelation process may involve selective or signal-adaptive decorrelation of specific frequency bands. The decorrelation process may involve applying a decorrelation filter to a portion of the received audio data to produce filtered audio data. In some implementations, the decorrelation process may involve using a non-hierarchal mixer to combine the portion of the received audio data with the filtered audio data according to spatial parameters.
The apparatus may include a memory device. In some implementations, the interface may be an interface between the logic system and the memory device. Alternatively, the interface may be a network interface.
The audio encoding or processing system may be a legacy audio encoding or processing system. In some implementations, the logic system may be further configured for receiving, via the interface, control mechanism elements in a bitstream produced by the legacy audio encoding or processing system. The decorrelation process may be based, at least in part, on the control mechanism elements.
Some aspects of this disclosure may be implemented in a non-transitory medium having software stored thereon. The software may include instructions for controlling an apparatus to receive audio data corresponding to a plurality of audio channels. The audio data may include a frequency domain representation corresponding to filterbank coefficients of an audio encoding or processing system. The software may include instructions for controlling the apparatus to apply a decorrelation process to at least some of the audio data. In some implementations, the decorrelation process being performed with the same filterbank coefficients used by the audio encoding or processing system.
In some implementations, the decorrelation process may be performed without converting coefficients of the frequency domain representation to another frequency domain or time domain representation. The frequency domain representation may be the result of applying a critically-sampled filterbank. The decorrelation process may involve generating reverb signals or decorrelation signals by applying linear filters to a least a portion of the frequency domain representation. The frequency domain representation may be a result of applying a modified discrete sine transform, a modified discrete cosine transform or a lapped orthogonal transform to audio data in a time domain. The decorrelation process may involve applying a decorrelation algorithm that operates entirely on real-valued coefficients.
Some methods may involve receiving audio data corresponding to a plurality of audio channels and determining audio characteristics of the audio data. The audio characteristics may include transient information. The methods may involve determining an amount of decorrelation for the audio data based, at least in part, on the audio characteristics and processing the audio data according to a determined amount of decorrelation.
In some instances, no explicit transient information may be received with the audio data. In some implementations, the process of determining transient information may involve detecting a soft transient event.
The process of determining transient information may involve evaluating a likelihood and/or a severity of a transient event. The process of determining transient information may involve evaluating a temporal power variation in the audio data.
The process of determining the audio characteristics may involve receiving explicit transient information with the audio data. The explicit transient information may include at least one of a transient control value corresponding to a definite transient event, a transient control value corresponding to a definite non-transient event or an intermediate transient control value. The explicit transient information may include an intermediate transient control value or a transient control value corresponding to a definite transient event. The transient control value may be subject to an exponential decay function.
The explicit transient information may indicate a definite transient event. Processing the audio data may involve temporarily halting or slowing a decorrelation process. The explicit transient information may include a transient control value corresponding to a definite non-transient event or an intermediate transient value. The process of determining transient information may involve detecting a soft transient event. The process of detecting a soft transient event may involve evaluating at least one of a likelihood or a severity of a transient event.
The determined transient information may be a determined transient control value corresponding to the soft transient event. The method may involve combining the determined transient control value with the received transient control value to obtain a new transient control value. The process of combining the determined transient control value and the received transient control value may involve determining the maximum of the determined transient control value and the received transient control value.
The process of detecting a soft transient event may involve detecting a temporal power variation of the audio data. Detecting the temporal power variation may involve determining a variation in a logarithmic power average. The logarithmic power average may be a frequency-band-weighted logarithmic power average. Determining the variation in the logarithmic power average may involve determining a temporal asymmetric power differential. The asymmetric power differential may emphasize increasing power and may de-emphasize decreasing power. The method may involve determining a raw transient measure based on the asymmetric power differential. Determining the raw transient measure may involve calculating a likelihood function of transient events based on an assumption that the temporal asymmetric power differential is distributed according to a Gaussian distribution. The method may involve determining a transient control value based on the raw transient measure. The method may involve applying an exponential decay function to the transient control value.
Some methods may involve applying a decorrelation filter to a portion of the audio data, to produce filtered audio data and mixing the filtered audio data with a portion of the received audio data according to a mixing ratio. The process of determining the amount of decorrelation may involve modifying the mixing ratio based, at least in part, on the transient control value.
Some methods may involve applying a decorrelation filter to a portion of the audio data to produce filtered audio data. Determining the amount of decorrelation for the audio data may involve attenuating an input to the decorrelation filter based on the transient information. The process of determining an amount of decorrelation for the audio data may involve reducing an amount of decorrelation in response to detecting a soft transient event.
Processing the audio data may involve applying a decorrelation filter to a portion of the audio data, to produce filtered audio data, and mixing the filtered audio data with a portion of the received audio data according to a mixing ratio. The process of reducing the amount of decorrelation may involve modifying the mixing ratio.
Processing the audio data may involve applying a decorrelation filter to a portion of the audio data to produce filtered audio data, estimating a gain to be applied to the filtered audio data, applying the gain to the filtered audio data and mixing the filtered audio data with a portion of the received audio data.
The estimating process may involve matching a power of the filtered audio data with a power of the received audio data. In some implementations, the processes of estimating and applying the gain may be performed by a bank of duckers. The bank of duckers may include buffers. A fixed delay may be applied to the filtered audio data and the same delay may be applied to the buffers.
At least one of a power estimation smoothing window for the duckers or the gain to be applied to the filtered audio data may be based, at least in part, on determined transient information. In some implementations, a shorter smoothing window may be applied when a transient event is relatively more likely or a relatively stronger transient event is detected, and a longer smoothing window may be applied when a transient event is relatively less likely, a relatively weaker transient event is detected or no transient event is detected.
Some methods may involve applying a decorrelation filter to a portion of the audio data to produce filtered audio data, estimating a ducker gain to be applied to the filtered audio data, applying the ducker gain to the filtered audio data and mixing the filtered audio data with a portion of the received audio data according to a mixing ratio. The process of determining the amount of decorrelation may involve modifying the mixing ratio based on at least one of the transient information or the ducker gain.
The process of determining the audio characteristics may involve determining at least one of a channel being block switched, a channel being out of coupling or channel coupling not being in use. Determining an amount of decorrelation for the audio data may involve determining that a decorrelation process should be slowed or temporarily halted.
Processing the audio data may involve a decorrelation filter dithering process. The method may involve determining, based at least in part on the transient information, that the decorrelation filter dithering process should be modified or temporarily halted. According to some methods, it may be determined that the decorrelation filter dithering process will be modified by changing a maximum stride value for dithering poles of the decorrelation filter.
According to some implementations, an apparatus may include an interface and a logic system. The logic system may be configured for receiving, from the interface, audio data corresponding to a plurality of audio channels and for determining audio characteristics of the audio data. The audio characteristics may include transient information. The logic system may be configured for determining an amount of decorrelation for the audio data based, at least in part, on the audio characteristics and for processing the audio data according to a determined amount of decorrelation.
In some implementations, no explicit transient information may be received with the audio data. The process of determining transient information may involve detecting a soft transient event. The process of determining transient information may involve evaluating at least one of a likelihood or a severity of a transient event. The process of determining transient information may involve evaluating a temporal power variation in the audio data.
In some implementations, determining the audio characteristics may involve receiving explicit transient information with the audio data. The explicit transient information may indicate at least one of a transient control value corresponding to a definite transient event, a transient control value corresponding to a definite non-transient event or an intermediate transient control value. The explicit transient information may include an intermediate transient control value or a transient control value corresponding to a definite transient event. The transient control value may be subject to an exponential decay function.
If the explicit transient information indicates a definite transient event, processing the audio data may involve temporarily slowing or halting a decorrelation process. If the explicit transient information includes a transient control value corresponding to a definite non-transient event or an intermediate transient value, the process of determining transient information may involve detecting a soft transient event. The determined transient information may be a determined transient control value corresponding to the soft transient event.
The logic system may be further configured for combining the determined transient control value with the received transient control value to obtain a new transient control value. In some implementations, the process of combining the determined transient control value and the received transient control value may involve determining the maximum of the determined transient control value and the received transient control value.
The process of detecting a soft transient event may involve evaluating at least one of a likelihood or a severity of a transient event. The process of detecting a soft transient event may involve detecting a temporal power variation of the audio data.
In some implementations, the logic system may be further configured for applying a decorrelation filter to a portion of the audio data to produce filtered audio data and mixing the filtered audio data with a portion of the received audio data according to a mixing ratio. The process of determining the amount of decorrelation may involve modifying the mixing ratio based, at least in part, on the transient information.
The process of determining an amount of decorrelation for the audio data may involve reducing an amount of decorrelation in response to detecting the soft transient event. Processing the audio data may involve applying a decorrelation filter to a portion of the audio data, to produce filtered audio data, and mixing the filtered audio data with a portion of the received audio data according to a mixing ratio. The process of reducing the amount of decorrelation may involve modifying the mixing ratio.
Processing the audio data may involve applying a decorrelation filter to a portion of the audio data to produce filtered audio data, estimating a gain to be applied to the filtered audio data, applying the gain to the filtered audio data and mixing the filtered audio data with a portion of the received audio data. The estimating process may involve matching a power of the filtered audio data with a power of the received audio data. The logic system may include a bank of duckers configured to perform the processes of estimating and applying the gain.
Some aspects of this disclosure may be implemented in a non-transitory medium having software stored thereon. The software may include instructions to control an apparatus for receiving audio data corresponding to a plurality of audio channels and for determining audio characteristics of the audio data. In some implementations, the audio characteristics may include transient information. The software may include instructions to controlling an apparatus for determining an amount of decorrelation for the audio data based, at least in part, on the audio characteristics and for processing the audio data according to a determined amount of decorrelation.
In some instances, no explicit transient information may be received with the audio data. The process of determining transient information may involve detecting a soft transient event. The process of determining transient information may involve evaluating at least one of a likelihood or a severity of a transient event. The process of determining transient information may involve evaluating a temporal power variation in the audio data.
However, in some implementations determining the audio characteristics may involve receiving explicit transient information with the audio data. The explicit transient information may include a transient control value corresponding to a definite transient event, a transient control value corresponding to a definite non-transient event and/or an intermediate transient control value. If the explicit transient information indicates a transient event, processing the audio data may involve temporarily halting or slowing a decorrelation process.
If the explicit transient information includes a transient control value corresponding to a definite non-transient event or an intermediate transient value, the process of determining transient information may involve detecting a soft transient event. The determined transient information may be a determined transient control value corresponding to the soft transient event. The process of determining transient information may involve combining the determined transient control value with the received transient control value to obtain a new transient control value. The process of combining the determined transient control value and the received transient control value may involve determining the maximum of the determined transient control value and the received transient control value.
The process of detecting a soft transient event may involve evaluating at least one of a likelihood or a severity of a transient event. The process of detecting a soft transient event may involve detecting a temporal power variation of the audio data.
The software may include instructions for controlling the apparatus to apply a decorrelation filter to a portion of the audio data to produce filtered audio data and to mix the filtered audio data with a portion of the received audio data according to a mixing ratio. The process of determining the amount of decorrelation may involve modifying the mixing ratio based, at least in part, on the transient information. The process of determining an amount of decorrelation for the audio data may involve reducing an amount of decorrelation in response to detecting the soft transient event.
Processing the audio data may involve applying a decorrelation filter to a portion of the audio data, to produce filtered audio data, and mixing the filtered audio data with a portion of the received audio data according to a mixing ratio. The process of reducing the amount of decorrelation may involve modifying the mixing ratio.
Processing the audio data may involve applying a decorrelation filter to a portion of the audio data to produce filtered audio data, estimating a gain to be applied to the filtered audio data, applying the gain to the filtered audio data and mixing the filtered audio data with a portion of the received audio data. The estimating process may involve matching a power of the filtered audio data with a power of the received audio data.
Some methods may involve receiving audio data corresponding to a plurality of audio channels and determining audio characteristics of the audio data. The audio characteristics may include transient information. The transient information may include an intermediate transient control value indicating a transient value between a definite transient event and a definite non-transient event. Such methods also may involve forming encoded audio data frames that include encoded transient information.
The encoded transient information may include one or more control flags. The method may involve coupling at least a portion of two or more channels of the audio data into at least one coupling channel. The control flags may include at least one of a channel block switch flag, a channel out-of-coupling flag or a coupling-in-use flag. The method may involve determining a combination of one or more of the control flags to form encoded transient information that indicates at least one of a definite transient event, a definite non-transient event, a likelihood of a transient event or a severity of a transient event.
The process of determining transient information may involve evaluating at least one of a likelihood or a severity of a transient event. The encoded transient information may indicate at least one of a definite transient event, a definite non-transient event, the likelihood of a transient event or the severity of a transient event. The process of determining transient information may involve evaluating a temporal power variation in the audio data.
The encoded transient information may include a transient control value corresponding to a transient event. The transient control value may be subject to an exponential decay function. The transient information may indicate that a decorrelation process should be temporarily slowed or halted.
The transient information may indicate that a mixing ratio of a decorrelation process should be modified. For example, the transient information may indicate that an amount of decorrelation in a decorrelation process should be temporarily reduced.
Some methods may involve receiving audio data corresponding to a plurality of audio channels and determining audio characteristics of the audio data. The audio characteristics may include spatial parameter data. The methods may involve determining at least two decorrelation filtering processes for the audio data based, at least in part, on the audio characteristics. The decorrelation filtering processes may cause a specific inter-decorrelation signal coherence (“IDC”) between channel-specific decorrelation signals for at least one pair of channels. The decorrelation filtering processes may involve applying a decorrelation filter to at least a portion of the audio data to produce filtered audio data. The channel-specific decorrelation signals may be produced by performing operations on the filtered audio data.
The methods may involve applying the decorrelation filtering processes to at least a portion of the audio data to produce the channel-specific decorrelation signals, determining mixing parameters based, at least in part, on the audio characteristics and mixing the channel-specific decorrelation signals with a direct portion of the audio data according to the mixing parameters. The direct portion may correspond to the portion to which the decorrelation filter is applied.
The method also may involve receiving information regarding a number of output channels. The process of determining at least two decorrelation filtering processes for the audio data may be based, at least in part, on the number of output channels. The receiving process may involve receiving audio data corresponding to N input audio channels. The method may involve determining that the audio data for N input audio channels will be downmixed or upmixed to audio data for K output audio channels and producing decorrelated audio data corresponding to the K output audio channels.
The method may involve downmixing or upmixing the audio data for N input audio channels to audio data for M intermediate audio channels, producing decorrelated audio data for the M intermediate audio channels and downmixing or upmixing the decorrelated audio data for the M intermediate audio channels to decorrelated audio data for K output audio channels. Determining the two decorrelation filtering processes for the audio data may be based, at least in part, on the number M of intermediate audio channels. The decorrelation filtering processes may be determined based, at least in part, on N-to-K, M-to-K or N-to-M mixing equations.
The method also may involve controlling inter-channel coherence (“ICC”) between a plurality of audio channel pairs. The process of controlling ICC may involve at least one of receiving an ICC value or determining an ICC value based, at least in part, on the spatial parameter data.
The process of controlling ICC may involve at least one of receiving a set of ICC values or determining the set of ICC values based, at least in part, on the spatial parameter data. The method also may involve determining a set of IDC values based, at least in part, on the set of ICC values and synthesizing a set of channel-specific decorrelation signals that corresponds with the set of IDC values by performing operations on the filtered audio data.
The method also may involve a process of conversion between a first representation of the spatial parameter data and a second representation of the spatial parameter data. The first representation of the spatial parameter data may include a representation of coherence between individual discrete channels and a coupling channel. The second representation of the spatial parameter data may include a representation of coherence between the individual discrete channels.
The process of applying the decorrelation filtering processes to at least a portion of the audio data may involve applying the same decorrelation filter to audio data for a plurality of channels to produce the filtered audio data and multiplying the filtered audio data corresponding to a left channel or a right channel by −1. The method also may involve reversing a polarity of filtered audio data corresponding to a left surround channel with reference to the filtered audio data corresponding to the left channel and reversing a polarity of filtered audio data corresponding to a right surround channel with reference to the filtered audio data corresponding to the right channel.
The process of applying the decorrelation filtering processes to at least a portion of the audio data may involve applying a first decorrelation filter to audio data for a first and second channel to produce first channel filtered data and second channel filtered data and applying a second decorrelation filter to audio data for a third and fourth channel to produce third channel filtered data and fourth channel filtered data. The first channel may be a left channel, the second channel may be a right channel, the third channel may be a left surround channel and the fourth channel may be a right surround channel. The method also may involve reversing a polarity of the first channel filtered data relative to the second channel filtered data and reversing a polarity of the third channel filtered data relative to the fourth channel filtered data. The processes of determining at least two decorrelation filtering processes for the audio data may involve either determining that a different decorrelation filter will be applied to audio data for a center channel or determining that a decorrelation filter will not be applied to the audio data for the center channel.
The method also may involve receiving channel-specific scaling factors and a coupling channel signal corresponding to a plurality of coupled channels. The applying process may involve applying at least one of the decorrelation filtering processes to the coupling channel to generate channel-specific filtered audio data and applying the channel-specific scaling factors to the channel-specific filtered audio data to produce the channel-specific decorrelation signals.
The method also may involve determining decorrelation signal synthesizing parameters based, at least in part, on the spatial parameter data. The decorrelation signal synthesizing parameters may be output-channel-specific decorrelation signal synthesizing parameters. The method also may involve receiving a coupling channel signal corresponding to a plurality of coupled channels and channel-specific scaling factors. At least one of the processes of determining at least two decorrelation filtering processes for the audio data and applying the decorrelation filtering processes to a portion of the audio data may involve generating a set of seed decorrelation signals by applying a set of decorrelation filters to the coupling channel signal, sending the seed decorrelation signals to a synthesizer, applying the output-channel-specific decorrelation signal synthesizing parameters to the seed decorrelation signals received by the synthesizer to produce channel-specific synthesized decorrelation signals, multiplying the channel-specific synthesized decorrelation signals with channel-specific scaling factors appropriate for each channel to produce scaled channel-specific synthesized decorrelation signals and outputting the scaled channel-specific synthesized decorrelation signals to a direct signal and decorrelation signal mixer.
The method also may involve receiving channel-specific scaling factors. At least one of the processes of determining at least two decorrelation filtering processes for the audio data and applying the decorrelation filtering processes to a portion of the audio data may involve: generating a set of channel-specific seed decorrelation signals by applying a set of decorrelation filters to the audio data; sending the channel-specific seed decorrelation signals to a synthesizer; determining a set of channel-pair-specific level adjusting parameters based, at least in part, on the channel-specific scaling factors; applying the output-channel-specific decorrelation signal synthesizing parameters and the channel-pair-specific level adjusting parameters to the channel-specific seed decorrelation signals received by the synthesizer to produce channel-specific synthesized decorrelation signals; and outputting the channel-specific synthesized decorrelation signals to a direct signal and decorrelation signal mixer.
Determining the output-channel-specific decorrelation signal synthesizing parameters may involve determining a set of IDC values based, at least in part, on the spatial parameter data and determining output-channel-specific decorrelation signal synthesizing parameters that correspond with the set of IDC values. The set of IDC values may be determined, at least in part, according to a coherence between individual discrete channels and a coupling channel and a coherence between pairs of individual discrete channels.
The mixing process may involve using a non-hierarchal mixer to combine the channel-specific decorrelation signals with the direct portion of the audio data. Determining the audio characteristics may involve receiving explicit audio characteristic information with the audio data. Determining the audio characteristics may involve determining audio characteristic information based on one or more attributes of the audio data. The spatial parameter data may include a representation of coherence between individual discrete channels and a coupling channel and/or a representation of coherence between pairs of individual discrete channels. The audio characteristics may include at least one of tonality information or transient information.
Determining the mixing parameters may be based, at least in part, on the spatial parameter data. The method also may involve providing the mixing parameters to a direct signal and decorrelation signal mixer. The mixing parameters may be output-channel-specific mixing parameters. The method also may involve determining modified output-channel-specific mixing parameters based, at least in part, on the output-channel-specific mixing parameters and transient control information.
According to some implementations, an apparatus may include an interface and a logic system configured for receiving audio data corresponding to a plurality of audio channels and determining audio characteristics of the audio data. The audio characteristics may include spatial parameter data. The logic system may be configured for determining at least two decorrelation filtering processes for the audio data based, at least in part, on the audio characteristics. The decorrelation filtering processes may cause a specific IDC between channel-specific decorrelation signals for at least one pair of channels. The decorrelation filtering processes may involve applying a decorrelation filter to at least a portion of the audio data to produce filtered audio data. The channel-specific decorrelation signals may be produced by performing operations on the filtered audio data.
The logic system may be configured for: applying the decorrelation filtering processes to at least a portion of the audio data to produce the channel-specific decorrelation signals; determining mixing parameters based, at least in part, on the audio characteristics; and mixing the channel-specific decorrelation signals with a direct portion of the audio data according to the mixing parameters. The direct portion may correspond to the portion to which the decorrelation filter is applied.
The receiving process may involve receiving information regarding a number of output channels. The process of determining at least two decorrelation filtering processes for the audio data may be based, at least in part, on the number of output channels. For example, the receiving process may involve receiving audio data corresponding to N input audio channels and the logic system may be configured for: determining that the audio data for N input audio channels will be downmixed or upmixed to audio data for K output audio channels and producing decorrelated audio data corresponding to the K output audio channels.
The logic system may be further configured for: downmixing or upmixing the audio data for N input audio channels to audio data for M intermediate audio channels; producing decorrelated audio data for the M intermediate audio channels; and downmixing or upmixing the decorrelated audio data for the M intermediate audio channels to decorrelated audio data for K output audio channels.
The decorrelation filtering processes may be determined based, at least in part, on N-to-K mixing equations. Determining the two decorrelation filtering processes for the audio data may be based, at least in part, on the number M of intermediate audio channels. The decorrelation filtering processes may be determined based, at least in part, on M-to-K or N-to-M mixing equations.
The logic system may be further configured for controlling ICC between a plurality of audio channel pairs. The process of controlling ICC may involve at least one of receiving an ICC value or determining an ICC value based, at least in part, on the spatial parameter data. The logic system may be further configured for determining a set of IDC values based, at least in part, on the set of ICC values and synthesizing a set of channel-specific decorrelation signals that corresponds with the set of IDC values by performing operations on the filtered audio data.
The logic system may be further configured for a process of conversion between a first representation of the spatial parameter data and a second representation of the spatial parameter data. The first representation of the spatial parameter data may include a representation of coherence between individual discrete channels and a coupling channel. The second representation of the spatial parameter data may include a representation of coherence between the individual discrete channels.
The process of applying the decorrelation filtering processes to at least a portion of the audio data may involve applying the same decorrelation filter to audio data for a plurality of channels to produce the filtered audio data and multiplying the filtered audio data corresponding to a left channel or a right channel by −1. The logic system may be further configured for reversing a polarity of filtered audio data corresponding to a left surround channel with reference to the filtered audio data corresponding to the left-side channel and reversing a polarity of filtered audio data corresponding to a right surround channel with reference to the filtered audio data corresponding to the right-side channel.
The process of applying the decorrelation filtering processes to at least a portion of the audio data may involve applying a first decorrelation filter to audio data for a first and second channel to produce first channel filtered data and second channel filtered data, and applying a second decorrelation filter to audio data for a third and fourth channel to produce third channel filtered data and fourth channel filtered data. The first channel may be a left-side channel, the second channel may be a right-side channel, the third channel may be a left surround channel and the fourth channel may be a right surround channel.
The logic system may be further configured for reversing a polarity of the first channel filtered data relative to the second channel filtered data and reversing a polarity of the third channel filtered data relative to the fourth channel filtered data. The processes of determining at least two decorrelation filtering processes for the audio data may involve either determining that a different decorrelation filter will be applied to audio data for a center channel or determining that a decorrelation filter will not be applied to the audio data for the center channel.
The logic system may be further configured for receiving, from the interface, channel-specific scaling factors and a coupling channel signal corresponding to a plurality of coupled channels. The applying process may involve applying at least one of the decorrelation filtering processes to the coupling channel to generate channel-specific filtered audio data and applying the channel-specific scaling factors to the channel-specific filtered audio data to produce the channel-specific decorrelation signals.
The logic system may be further configured for determining decorrelation signal synthesizing parameters based, at least in part, on the spatial parameter data. The decorrelation signal synthesizing parameters may be output-channel-specific decorrelation signal synthesizing parameters. The logic system may be further configured for receiving, from the interface, a coupling channel signal corresponding to a plurality of coupled channels and channel-specific scaling factors.
At least one of the processes of determining at least two decorrelation filtering processes for the audio data and applying the decorrelation filtering processes to a portion of the audio data may involve: generating a set of seed decorrelation signals by applying a set of decorrelation filters to the coupling channel signal; sending the seed decorrelation signals to a synthesizer; applying the output-channel-specific decorrelation signal synthesizing parameters to the seed decorrelation signals received by the synthesizer to produce channel-specific synthesized decorrelation signals; multiplying the channel-specific synthesized decorrelation signals with channel-specific scaling factors appropriate for each channel to produce scaled channel-specific synthesized decorrelation signals; and outputting the scaled channel-specific synthesized decorrelation signals to a direct signal and decorrelation signal mixer.
At least one of the processes of determining at least two decorrelation filtering processes for the audio data and applying the decorrelation filtering processes to a portion of the audio data may involve: generating a set of channel-specific seed decorrelation signals by applying a set of channel-specific decorrelation filters to the audio data; sending the channel-specific seed decorrelation signals to a synthesizer; determining channel-pair-specific level adjusting parameters based, at least in part, on the channel-specific scaling factors; applying the output-channel-specific decorrelation signal synthesizing parameters and the channel-pair-specific level adjusting parameters to the channel-specific seed decorrelation signals received by the synthesizer to produce channel-specific synthesized decorrelation signals; and outputting the channel-specific synthesized decorrelation signals to a direct signal and decorrelation signal mixer.
Determining the output-channel-specific decorrelation signal synthesizing parameters may involve determining a set of IDC values based, at least in part, on the spatial parameter data and determining output-channel-specific decorrelation signal synthesizing parameters that correspond with the set of IDC values. The set of IDC values may be determined, at least in part, according to a coherence between individual discrete channels and a coupling channel and a coherence between pairs of individual discrete channels.
The mixing process may involve using a non-hierarchal mixer to combine the channel-specific decorrelation signals with the direct portion of the audio data. Determining the audio characteristics may involve receiving explicit audio characteristic information with the audio data. Determining the audio characteristics may involve determining audio characteristic information based on one or more attributes of the audio data. The audio characteristics may include tonality information and/or transient information.
The spatial parameter data may include a representation of coherence between individual discrete channels and a coupling channel and/or a representation of coherence between pairs of individual discrete channels. Determining the mixing parameters may be based, at least in part, on the spatial parameter data.
The logic system may be further configured for providing the mixing parameters to a direct signal and decorrelation signal mixer. The mixing parameters may be output-channel-specific mixing parameters. The logic system may be further configured for determining modified output-channel-specific mixing parameters based, at least in part, on the output-channel-specific mixing parameters and transient control information.
The apparatus may include a memory device. The interface may be an interface between the logic system and the memory device. However, the interface may be a network interface.
Some aspects of this disclosure may be implemented in a non-transitory medium having software stored thereon. The software may include instructions to control an apparatus for receiving audio data corresponding to a plurality of audio channels and for determining audio characteristics of the audio data. The audio characteristics may include spatial parameter data. The software may include instructions to control the apparatus for determining at least two decorrelation filtering processes for the audio data based, at least in part, on the audio characteristics. The decorrelation filtering processes may cause a specific IDC between channel-specific decorrelation signals for at least one pair of channels. The decorrelation filtering processes may involve applying a decorrelation filter to at least a portion of the audio data to produce filtered audio data. The channel-specific decorrelation signals may be produced by performing operations on the filtered audio data
The software may include instructions to control the apparatus for applying the decorrelation filtering processes to at least a portion of the audio data to produce the channel-specific decorrelation signals; determining mixing parameters based, at least in part, on the audio characteristics; and mixing the channel-specific decorrelation signals with a direct portion of the audio data according to the mixing parameters. The direct portion may correspond to the portion to which the decorrelation filter is applied.
The software may include instructions for controlling the apparatus to receive information regarding a number of output channels. The process of determining at least two decorrelation filtering processes for the audio data may be based, at least in part, on the number of output channels. For example, the receiving process may involve receiving audio data corresponding to N input audio channels. The software may include instructions for controlling the apparatus to determine that the audio data for N input audio channels will be downmixed or upmixed to audio data for K output audio channels and to produce decorrelated audio data corresponding to the K output audio channels.
The software may include instructions for controlling the apparatus to: downmix or upmix the audio data for N input audio channels to audio data for M intermediate audio channels; produce decorrelated audio data for the M intermediate audio channels; and downmix or upmix the decorrelated audio data for the M intermediate audio channels to decorrelated audio data for K output audio channels.
Determining the two decorrelation filtering processes for the audio data may be based, at least in part, on the number M of intermediate audio channels. The decorrelation filtering processes may be determined based, at least in part, on N-to-K, M-to-K or N-to-M mixing equations.
The software may include instructions for controlling the apparatus to perform a process of controlling ICC between a plurality of audio channel pairs. The process of controlling ICC may involve receiving an ICC value and/or determining an ICC value based, at least in part, on the spatial parameter data. The process of controlling ICC may involve at least one of receiving a set of ICC values or determining the set of ICC values based, at least in part, on the spatial parameter data. The software may include instructions for controlling the apparatus to perform processes of determining a set of IDC values based, at least in part, on the set of ICC values and synthesizing a set of channel-specific decorrelation signals that corresponds with the set of IDC values by performing operations on the filtered audio data.
The process of applying the decorrelation filtering processes to at least a portion of the audio data may involve applying the same decorrelation filter to audio data for a plurality of channels to produce the filtered audio data and multiplying the filtered audio data corresponding to a left channel or a right channel by −1. The software may include instructions for controlling the apparatus to perform processes of reversing a polarity of filtered audio data corresponding to a left surround channel with reference to the filtered audio data corresponding to the left-side channel and reversing a polarity of filtered audio data corresponding to a right surround channel with reference to the filtered audio data corresponding to the right-side channel.
The process of applying the decorrelation filter to a portion of the audio data may involve applying a first decorrelation filter to audio data for a first and second channel to produce first channel filtered data and second channel filtered data and applying a second decorrelation filter to audio data for a third and fourth channel to produce third channel filtered data and fourth channel filtered data. The first channel may be a left-side channel, the second channel may be a right-side channel, the third channel may be a left surround channel and the fourth channel may be a right surround channel.
The software may include instructions for controlling the apparatus to perform processes of reversing a polarity of the first channel filtered data relative to the second channel filtered data and reversing a polarity of the third channel filtered data relative to the fourth channel filtered data. The processes of determining at least two decorrelation filtering processes for the audio data may involve either determining that a different decorrelation filter will be applied to audio data for a center channel or determining that a decorrelation filter will not be applied to the audio data for the center channel.
The software may include instructions for controlling the apparatus to receive channel-specific scaling factors and a coupling channel signal corresponding to a plurality of coupled channels. The applying process may involve applying at least one of the decorrelation filtering processes to the coupling channel to generate channel-specific filtered audio data and applying the channel-specific scaling factors to the channel-specific filtered audio data to produce the channel-specific decorrelation signals.
The software may include instructions for controlling the apparatus to determine decorrelation signal synthesizing parameters based, at least in part, on the spatial parameter data. The decorrelation signal synthesizing parameters may be output-channel-specific decorrelation signal synthesizing parameters. The software may include instructions for controlling the apparatus to receive a coupling channel signal corresponding to a plurality of coupled channels and channel-specific scaling factors. At least one of the processes of determining at least two decorrelation filtering processes for the audio data and applying the decorrelation filtering processes to a portion of the audio data may involve: generating a set of seed decorrelation signals by applying a set of decorrelation filters to the coupling channel signal; sending the seed decorrelation signals to a synthesizer; applying the output-channel-specific decorrelation signal synthesizing parameters to the seed decorrelation signals received by the synthesizer to produce channel-specific synthesized decorrelation signals; multiplying the channel-specific synthesized decorrelation signals with channel-specific scaling factors appropriate for each channel to produce scaled channel-specific synthesized decorrelation signals; and outputting the scaled channel-specific synthesized decorrelation signals to a direct signal and decorrelation signal mixer.
The software may include instructions for controlling the apparatus to receive a coupling channel signal corresponding to a plurality of coupled channels and channel-specific scaling factors. At least one of the processes of determining at least two decorrelation filtering processes for the audio data and applying the decorrelation filtering processes to a portion of the audio data may involve: generating a set of channel-specific seed decorrelation signals by applying a set of channel-specific decorrelation filters to the audio data; sending the channel-specific seed decorrelation signals to a synthesizer; determining channel-pair-specific level adjusting parameters based, at least in part, on the channel-specific scaling factors; applying the output-channel-specific decorrelation signal synthesizing parameters and the channel-pair-specific level adjusting parameters to the channel-specific seed decorrelation signals received by the synthesizer to produce channel-specific synthesized decorrelation signals; and outputting the channel-specific synthesized decorrelation signals to a direct signal and decorrelation signal mixer.
Determining the output-channel-specific decorrelation signal synthesizing parameters may involve determining a set of IDC values based, at least in part, on the spatial parameter data and determining output-channel-specific decorrelation signal synthesizing parameters that correspond with the set of IDC values. The set of IDC values may be determined, at least in part, according to a coherence between individual discrete channels and a coupling channel and a coherence between pairs of individual discrete channels.
In some implementations, a method may involve: receiving audio data comprising a first set of frequency coefficients and a second set of frequency coefficients; estimating, based on at least part on the first set of frequency coefficients, spatial parameters for at least part of the second set of frequency coefficients; and applying the estimated spatial parameters to the second set of frequency coefficients to generate a modified second set of frequency coefficients. The first set of frequency coefficients may correspond to a first frequency range and the second set of frequency coefficients may correspond to a second frequency range. The first frequency range may be below the second frequency range.
The audio data may include data corresponding to individual channels and a coupled channel. The first frequency range may correspond to an individual channel frequency range and the second frequency range may correspond to a coupled channel frequency range. The applying process may involve applying the estimated spatial parameters on a per-channel basis.
The audio data may include frequency coefficients in the first frequency range for two or more channels. The estimating process may involve calculating combined frequency coefficients of a composite coupling channel based on frequency coefficients of the two or more channels and computing, for at least a first channel, cross-correlation coefficients between frequency coefficients of the first channel and the combined frequency coefficients. The combined frequency coefficients may correspond to the first frequency range.
The cross-correlation coefficients may be normalized cross-correlation coefficients. The first set of frequency coefficients may include audio data for a plurality of channels. The estimating process may involve estimating normalized cross-correlation coefficients for multiple channels of the plurality of channels. The estimating process may involve dividing at least part of the first frequency range into first frequency range bands and computing a normalized cross-correlation coefficient for each first frequency range band.
In some implementations, the estimating process may involve averaging the normalized cross-correlation coefficients across all of the first frequency range bands of a channel and applying a scaling factor to the average of the normalized cross-correlation coefficients to obtain the estimated spatial parameters for the channel. The process of averaging the normalized cross-correlation coefficients may involve averaging across a time segment of a channel. The scaling factor may decrease with increasing frequency.
The method may involve the addition of noise to model the variance of the estimated spatial parameters. The variance of added noise may be based, at least in part, on the variance in the normalized cross-correlation coefficients. The variance of added noise may be dependent, at least in part, on a prediction of the spatial parameter across bands, the dependence of the variance on the prediction being based on empirical data.
The method may involve receiving or determining tonality information regarding the second set of frequency coefficients. The applied noise may vary according to the tonality information.
The method may involve measuring per-band energy ratios between bands of the first set of frequency coefficients and bands of the second set of frequency coefficients. The estimated spatial parameters may vary according to the per-band energy ratios. In some implementations, the estimated spatial parameters may vary according to temporal changes of input audio signals. The estimating process may involve operations only on real-valued frequency coefficients.
The process of applying the estimated spatial parameters to the second set of frequency coefficients may be part of a decorrelation process. In some implementations, the decorrelation process may involve generating a reverb signal or a decorrelation signal and applying it to the second set of frequency coefficients. The decorrelation process may involve applying a decorrelation algorithm that operates entirely on real-valued coefficients. The decorrelation process may involve selective or signal-adaptive decorrelation of specific channels. The decorrelation process may involve selective or signal-adaptive decorrelation of specific frequency bands. In some implementations, the first and second sets of frequency coefficients may be results of applying a modified discrete sine transform, a modified discrete cosine transform or a lapped orthogonal transform to audio data in a time domain.
The estimating process may be based, at least in part, on estimation theory. For example, the estimating process may be based, at least in part, on at least one of a maximum likelihood method, a Bayes estimator, a method of moments estimator, a minimum mean squared error estimator or a minimum variance unbiased estimator.
In some implementations, the audio data may be received in a bitstream encoded according to a legacy encoding process. The legacy encoding process may, for example, be a process of the AC-3 audio codec or the Enhanced AC-3 audio codec. Applying the spatial parameters may yield a more spatially accurate audio reproduction than that obtained by decoding the bitstream according to a legacy decoding process that corresponds with the legacy encoding process.
Some implementations involve apparatus that includes an interface and a logic system. The logic system may be configured for: receiving audio data comprising a first set of frequency coefficients and a second set of frequency coefficients; estimating, based on at least part of the first set of frequency coefficients, spatial parameters for at least part of the second set of frequency coefficients; and applying the estimated spatial parameters to the second set of frequency coefficients to generate a modified second set of frequency coefficients.
The apparatus may include a memory device. The interface may be an interface between the logic system and the memory device. However, the interface may be a network interface.
The first set of frequency coefficients may correspond to a first frequency range and the second set of frequency coefficients may correspond to a second frequency range. The first frequency range may be below the second frequency range. The audio data may include data corresponding to individual channels and a coupled channel. The first frequency range may correspond to an individual channel frequency range and the second frequency range may correspond to a coupled channel frequency range.
The applying process may involve applying the estimated spatial parameters on a per-channel basis. The audio data may include frequency coefficients in the first frequency range for two or more channels. The estimating process may involve calculating combined frequency coefficients of a composite coupling channel based on frequency coefficients of the two or more channels and computing, for at least a first channel, cross-correlation coefficients between frequency coefficients of the first channel and the combined frequency coefficients.
The combined frequency coefficients may correspond to the first frequency range. The cross-correlation coefficients may be normalized cross-correlation coefficients. The first set of frequency coefficients may include audio data for a plurality of channels. The estimating process may involve estimating normalized cross-correlation coefficients multiple channels of the plurality of channels.
The estimating process may involve dividing the second frequency range into second frequency range bands and computing a normalized cross-correlation coefficient for each second frequency range band. The estimating process may involve dividing the first frequency range into first frequency range bands, averaging the normalized cross-correlation coefficients across all of the first frequency range bands and applying a scaling factor to the average of the normalized cross-correlation coefficients to obtain the estimated spatial parameters.
The process of averaging the normalized cross-correlation coefficients may involve averaging across a time segment of a channel. The logic system may be further configured for the addition of noise to the modified second set of frequency coefficients. The addition of noise may be added to model a variance of the estimated spatial parameters. The variance of noise added by the logic system may be based, at least in part, on a variance in the normalized cross-correlation coefficients. The logic system may be further configured for receiving or determining tonality information regarding the second set of frequency coefficients and varying the applied noise according to the tonality information.
In some implementations, the audio data may be received in a bitstream encoded according to a legacy encoding process. For example, the legacy encoding process may be a process of the AC-3 audio codec or the Enhanced AC-3 audio codec.
Some aspects of this disclosure may be implemented in a non-transitory medium having software stored thereon. The software may include instructions to control an apparatus for: receiving audio data comprising a first set of frequency coefficients and a second set of frequency coefficients; estimating, based on at least part of the first set of frequency coefficients, spatial parameters for at least part of the second set of frequency coefficients; and applying the estimated spatial parameters to the second set of frequency coefficients to generate a modified second set of frequency coefficients.
The first set of frequency coefficients may correspond to a first frequency range and the second set of frequency coefficients may correspond to a second frequency range. The audio data may include data corresponding to individual channels and a coupled channel. The first frequency range may correspond to an individual channel frequency range and the second frequency range may correspond to a coupled channel frequency range. The first frequency range may be below the second frequency range.
The applying process may involve applying the estimated spatial parameters on a per-channel basis. The audio data may include frequency coefficients in the first frequency range for two or more channels. The estimating process may involve calculating combined frequency coefficients of a composite coupling channel based on frequency coefficients of the two or more channels and computing, for at least a first channel, cross-correlation coefficients between frequency coefficients of the first channel and the combined frequency coefficients.
The combined frequency coefficients may correspond to the first frequency range. The cross-correlation coefficients may be normalized cross-correlation coefficients. The first set of frequency coefficients may include audio data for a plurality of channels. The estimating process may involve estimating normalized cross-correlation coefficients multiple channels of the plurality of channels. The estimating process may involve dividing the second frequency range into second frequency range bands and computing a normalized cross-correlation coefficient for each second frequency range band.
The estimating process may involve: dividing the first frequency range into first frequency range bands; averaging the normalized cross-correlation coefficients across all of the first frequency range bands; and applying a scaling factor to the average of the normalized cross-correlation coefficients to obtain the estimated spatial parameters. The process of averaging the normalized cross-correlation coefficients may involve averaging across a time segment of a channel.
The software also may include instructions for controlling the decoding apparatus to add noise to the modified second set of frequency coefficients in order to model a variance of the estimated spatial parameters. A variance of added noise may be based, at least in part, on a variance in the normalized cross-correlation coefficients. The software also may include instructions for controlling the decoding apparatus to receive or determine tonality information regarding the second set of frequency coefficients. The applied noise may vary according to the tonality information.
In some implementations, the audio data may be received in a bitstream encoded according to a legacy encoding process. For example, the legacy encoding process may be a process of the AC-3 audio codec or the Enhanced AC-3 audio codec.
According to some implementations, a method, may involve: receiving audio data corresponding to a plurality of audio channels; determining audio characteristics of the audio data; determining decorrelation filter parameters for the audio data based, at least in part, on the audio characteristics; forming a decorrelation filter according to the decorrelation filter parameters; and applying the decorrelation filter to at least some of the audio data. For example, the audio characteristics may include tonality information and/or transient information.
Determining the audio characteristics may involve receiving explicit tonality information or transient information with the audio data. Determining the audio characteristics may involve determining tonality information or transient information based on one or more attributes of the audio data.
In some implementations, the decorrelation filter may include a linear filter with at least one delay element. The decorrelation filter may include an all-pass filter.
The decorrelation filter parameters may include dithering parameters or randomly selected pole locations for at least one pole of the all-pass filter. For example, the dithering parameters or pole locations may involve a maximum stride value for pole movement. The maximum stride value may be substantially zero for highly tonal signals of the audio data. The dithering parameters or pole locations may be bounded by constraint areas within which pole movements are constrained. In some implementations, the constraint areas may be circles or annuli. In some implementations, the constraint areas may be fixed. In some implementations, different channels of the audio data may share the same constraint areas.
According to some implementations, the poles may be dithered independently for each channel. In some implementations, motions of the poles may not be bounded by constraint areas. In some implementations, the poles may maintain a substantially consistent spatial or angular relationship relative to one another. According to some implementations, a distance from a pole to a center of a z-plane circle may be a function of audio data frequency.
In some implementations, an apparatus may include an interface and a logic system. In some implementations, the logic system may include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic and/or discrete hardware components.
The logic system may be configured for receiving, from the interface, audio data corresponding to a plurality of audio channels and determining audio characteristics of the audio data. In some implementations, the audio characteristics may include tonality information and/or transient information. The logic system may be configured for determining decorrelation filter parameters for the audio data based, at least in part, on the audio characteristics, forming a decorrelation filter according to the decorrelation filter parameters and applying the decorrelation filter to at least some of the audio data.
The decorrelation filter may include a linear filter with at least one delay element. The decorrelation filter parameters may include dithering parameters or randomly selected pole locations for at least one pole of the decorrelation filter. The dithering parameters or pole locations may be bounded by constraint areas within which pole movements are constrained. The dithering parameters or pole locations may be determined with reference to a maximum stride value for pole movement. The maximum stride value may be substantially zero for highly tonal signals of the audio data.
The apparatus may include a memory device. The interface may be an interface between the logic system and the memory device. However, the interface may be a network interface.
Some aspects of this disclosure may be implemented in a non-transitory medium having software stored thereon. The software may include instructions for controlling an apparatus to: receive audio data corresponding to a plurality of audio channels; determine audio characteristics of the audio data, the audio characteristics comprising at least one of tonality information or transient information; determine decorrelation filter parameters for the audio data based, at least in part, on the audio characteristics; form a decorrelation filter according to the decorrelation filter parameters; and apply the decorrelation filter to at least some of the audio data. The decorrelation filter may include a linear filter with at least one delay element.
The decorrelation filter parameters may include dithering parameters or randomly selected pole locations for at least one pole of the decorrelation filter. The dithering parameters or pole locations may be bounded by constraint areas within which pole movements are constrained. The dithering parameters or pole locations may be determined with reference to a maximum stride value for pole movement. The maximum stride value may be substantially zero for highly tonal signals of the audio data.
According to some implementations, a method, may involve: receiving audio data corresponding to a plurality of audio channels; determining decorrelation filter control information corresponding to a maximum pole displacement of a decorrelation filter; determining decorrelation filter parameters for the audio data based, at least in part, on the decorrelation filter control information; forming the decorrelation filter according to the decorrelation filter parameters; and applying the decorrelation filter to at least some of the audio data.
The audio data may be in the time domain or the frequency domain. Determining the decorrelation filter control information may involve receiving an express indication of the maximum pole displacement.
Determining the decorrelation filter control information may involve determining audio characteristic information and determining the maximum pole displacement based, at least in part, on the audio characteristic information. In some implementations, the audio characteristic information may include at least one of tonality information or transient information.
Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Like reference numbers and designations in the various drawings indicate like elements.
The following description is directed to certain implementations for the purposes of describing some innovative aspects of this disclosure, as well as examples of contexts in which these innovative aspects may be implemented. However, the teachings herein can be applied in various different ways. Although the examples provided in this application are primarily described in terms of the AC-3 audio codec, and the Enhanced AC-3 audio codec (also known as E-AC-3), the concepts provided herein apply to other audio codecs, including but not limited to MPEG-2 AAC and MPEG-4 AAC. Moreover, the described implementations may be embodied in various audio processing devices, including but not limited to encoders and/or decoders, which may be included in mobile telephones, smartphones, desktop computers, hand-held or portable computers, netbooks, notebooks, smartbooks, tablets, stereo systems, televisions, DVD players, digital recording devices and a variety of other devices. Accordingly, the teachings of this disclosure are not intended to be limited to the implementations shown in the figures and/or described herein, but instead have wide applicability.
Some audio codecs, including the AC-3 and E-AC-3 audio codecs (proprietary implementations of which are licensed as “Dolby Digital” and “Dolby Digital Plus”), employ some form of channel coupling to exploit redundancies between channels, encode data more efficiently and reduce the coding bit-rate. For example, with the AC-3 and E-AC-3 codecs, in a coupling channel frequency range beyond a specific “coupling-begin frequency,” the modified discrete cosine transform (MDCT) coefficients of the discrete channels (also referred to herein as “individual channels”) are downmixed to a mono channel, which may be referred to herein as a “composite channel” or a “coupling channel.” Some codecs may form two or more coupling channels.
The AC-3 and E-AC-3 decoders upmix the mono signal of the coupling channel into the discrete channels using scale factors based on coupling coordinates sent in the bitstream. In this manner, the decoder restores a high frequency envelope, but not the phase, of the audio data in the coupling channel frequency range of each channel.
As shown in
Various implementations described herein may mitigate these effects, at least in part. Some such implementations involve novel audio encoding and/or decoding tools. Such implementations may be configured to restore phase diversity of the output channels in frequency regions encoded by channel coupling. In accordance with various implementations, a decorrelated signal may be synthesized from the decoded spectral coefficients in the coupling channel frequency range of each output channel.
However, many other types of audio processing devices and methods are described herein.
In this example, the audio data elements 220a through 220n correspond to a plurality of audio channels 1 through N. Here, the audio data elements 220a through 220n include a frequency domain representations corresponding to filterbank coefficients of an audio encoding or processing system, which may be a legacy audio encoding or processing system. However, in alternative implementations, the audio data elements 220a through 220n may correspond to a plurality of frequency bands 1 through N.
In this implementation, all of the audio data elements 220a through 220n are received by both the switch 203 and the decorrelator 205. Here, all of the audio data elements 220a through 220n are processed by the decorrelator 205 to produce decorrelated audio data elements 230a through 230n. Moreover, all of the decorrelated audio data elements 230a through 230n are received by the switch 203.
However, not all of the decorrelated audio data elements 230a through 230n are received by the inverse transform module 255 and converted to time domain audio data 260. Instead, the switch 203 selects which of the decorrelated audio data elements 230a through 230n will be received by the inverse transform module 255. In this example the switch 203 selects, according to the channel, which of the audio data elements 230a through 230n will be received by the inverse transform module 255. Here, for example, the audio data element 230a is received by the inverse transform module 255, whereas the audio data element 230n is not. Instead, the switch 203 sends the audio data element 220n, which has not been processed by the decorrelator 205, to the inverse transform module 255.
In some implementations, the switch 203 may determine whether to send a direct audio data element 220 or a decorrelated audio data element 230 to the inverse transform module 255 according to predetermined settings corresponding to the channels 1 through N. Alternatively, or additionally, the switch 203 may determine whether to send an audio data element 220 or a decorrelated audio data element 230 to the inverse transform module 255 according to channel-specific components of the selection information 207, which may be generated or stored locally, or received with the audio data 220. Accordingly, the audio processing system 200 may provide selective decorrelation of specific audio channels.
Alternatively, or additionally, the switch 203 may determine whether to send a direct audio data element 220 or a decorrelated audio data element 230 to the inverse transform module 255 according to changes in the audio data 220. For example, the switch 203 may determine which, if any, of the decorrelated audio data elements 230 are sent to the inverse transform module 255 according to signal-adaptive components of the selection information 207, which may indicate transients or tonality changes in the audio data 220. In alternative implementations, the switch 203 may receive such signal-adaptive information from the decorrelator 205. In yet other implementations, the switch 203 may be configured to determine changes in the audio data, such as transients or tonality changes. Accordingly, the audio processing system 200 may provide signal-adaptive decorrelation of specific audio channels.
As noted above, in some implementations the audio data elements 220a through 220n may correspond to a plurality of frequency bands 1 through N. In some such implementations, the switch 203 may determine whether to send an audio data element 220 or a decorrelated audio data element 230 to the inverse transform module 255 according to predetermined settings corresponding to the frequency bands and/or according to received selection information 207. Accordingly, the audio processing system 200 may provide selective decorrelation of specific frequency bands.
Alternatively, or additionally, the switch 203 may determine whether to send a direct audio data element 220 or a decorrelated audio data element 230 to the inverse transform module 255 according to changes in the audio data 220, which may be indicated by the selection information 207 or by information received from the decorrelator 205. In some implementations, the switch 203 may be configured to determine changes in the audio data. Therefore, the audio processing system 200 may provide signal-adaptive decorrelation of specific frequency bands.
Referring again to
The decorrelation process may involve applying a decorrelation filter to a portion of the received audio data elements 220a through 220n to produce filtered audio data elements. The decorrelation process may involve using a non-hierarchal mixer to combine a direct portion of the received audio data (to which no decorrelation filter has been applied) with the filtered audio data according to spatial parameters. For example, a direct portion of the audio data element 220a may be mixed with a filtered portion of the audio data element 220a in an output-channel-specific manner. Some implementations may include an output-channel-specific combiner (e.g., a linear combiner) of decorrelation or reverb signals. Various examples are described below.
In some implementations, the spatial parameters may be determined by audio processing system 200 pursuant to analysis of the received audio data 220. Alternatively, or additionally, the spatial parameters may be received in a bitstream, along with the audio data 220 as part or all of the decorrelation information 240. In some implementations the decorrelation information 240 may include correlation coefficients between individual discrete channels and a coupling channel, correlation coefficients between individual discrete channels, explicit tonality information and/or transient information. The decorrelation process may involve decorrelating at least a portion of the audio data 220 based, at least in part, on the decorrelation information 240. Some implementations may be configured to use both locally determined and received spatial parameters and/or other decorrelation information. Various examples are described below.
The decorrelator 205 applies a decorrelation process to at least a portion of the audio data elements 220a through 220n. For example, the decorrelation process may involve generating reverb signals or decorrelation signals by applying linear filters to at least a portion of the audio data elements 220a through 220n. The decorrelation process may be performed, at least in part, according to decorrelation information 240 received by the decorrelator 205. For example, the decorrelation information 240 may be received in a bitstream along with the frequency domain representations of the audio data elements 220a through 220n. Alternatively, or additionally, at least some decorrelation information may be determined locally, e.g., by the decorrelator 205.
The inverse transform module 255 applies an inverse transform to produce the time domain audio data 260. In this example, the inverse transform module 255 applies an inverse transform equivalent to a perfect reconstruction, critically-sampled filterbank. The perfect reconstruction, critically-sampled filterbank may correspond to that applied to audio data in the time domain (e.g., by an encoding device) to produce the frequency domain representations of the audio data elements 220a through 220n.
The upmixer 225 also receives coupling coordinates 212 for each channel and coupling channel frequency range. In this implementation, scaling information, in the form of coupling coordinates 212, has been computed in a Dolby Digital or Dolby Digital Plus encoder in an exponent-mantissa form. The upmixer 225 may compute frequency coefficients for each output channel by multiplying the coupling channel frequency coordinates by the coupling coordinates for that channel.
In this implementation, the upmixer 225 outputs decoupled MDCT coefficients of individual channels in the coupling channel frequency range to the decorrelator 205. Accordingly, in this example the audio data 220 that are input to the decorrelator 205 include MDCT coefficients.
In the example shown in
In this example, decorrelation information 240 is received by the decorrelator 205. The type of decorrelation information 240 received may vary according to the implementation. In some implementations, the decorrelation information 240 may include explicit, decorrelator-specific control information and/or explicit information that may form the basis of such control information. The decorrelation information 240 may, for example, include spatial parameters such as correlation coefficients between individual discrete channels and a coupling channel and/or correlation coefficients between individual discrete channels. Such explicit decorrelation information 240 also may include explicit tonality information and/or transient information. This information may be used to determine, at least in part, decorrelation filter parameters for the decorrelator 205.
However, in alternative implementations, no such explicit decorrelation information 240 is received by the decorrelator 205. According to some such implementations, the decorrelation information 240 may include information from a bitstream of a legacy audio codec. For example, the decorrelation information 240 may include time segmentation information that is available in a bitstream encoded according to the AC-3 audio codec or the E-AC-3 audio codec. The decorrelation information 240 may include coupling-in-use information, block-switching information, exponent information, exponent strategy information, etc. Such information may have been received by an audio processing system in a bitstream along with audio data 210.
In some implementations, the decorrelator 205 (or another element of the audio processing system 200) may determine spatial parameters, tonality information and/or transient information based on one or more attributes of the audio data. For example, the audio processing system 200 may determine spatial parameters for frequencies in the coupling channel frequency range based on the audio data 245a or 245b, outside of the coupling channel frequency range. Alternatively, or additionally, the audio processing system 200 may determine tonality information based on information from a bitstream of a legacy audio codec. Some such implementations will be described below.
In this example, the N-to-M upmixer/downmixer 262 may be configured to upmix or downmix the audio data for N channels to audio data for M channels, according to the mixing information 266. However, in some implementations, the N-to-M upmixer/downmixer 262 may be a pass-through element. In such implementations, N=M. The mixing information 266 may include N-to-M mixing equations. The mixing information 266 may, for example, be received by the audio processing system 200 in a bitstream along with the decorrelation information 240, frequency domain representations corresponding to a coupling channel, etc. In this example, the decorrelation information 240 that is received by the decorrelator 205 indicates that the decorrelator 205 should output M channels of the decorrelated audio data 230 to the switch 203.
The switch 203 may determine, according to the selection information 207, whether the direct audio data from the N-to-M upmixer/downmixer 262 or the decorrelated audio data 230 will be forwarded to the M-to-K upmixer/downmixer 264. The M-to-K upmixer/downmixer 264 may be configured to upmix or downmix the audio data for M channels to audio data for K channels, according to the mixing information 268. In such implementations, the mixing information 268 may include M-to-K mixing equations. For implementations in which N=M, the M-to-K upmixer/downmixer 264 may upmix or downmix the audio data for N channels to audio data for K channels according to the mixing information 268. In such implementations, the mixing information 268 may include N-to-K mixing equations. The mixing information 268 may, for example, be received by the audio processing system 200 in a bitstream along with the decorrelation information 240 and other data.
The N-to-M, M-to-K or N-to-K mixing equations may be upmixing or downmixing equations. The N-to-M, M-to-K or N-to-K mixing equations may be a set of linear combination coefficients that map input audio signals to output audio signals. According to some such implementations, the M-to-K mixing equations may be stereo downmixing equations. For example, the M-to-K upmixer/downmixer 264 may be configured to downmix audio data for 4, 5, 6, or more channels to audio data for 2 channels, according to the M-to-K mixing equations in the mixing information 268. In some such implementations, audio data for a left channel (“L”), a center channel (′C″) and a left surround channel (“Ls”) may be combined, according to the M-to-K mixing equations, into a left stereo output channel Lo. Audio data for a right channel (“R”), the center channel and a right surround channel (“Rs”) may be combined, according to the M-to-K mixing equations, into a right stereo output channel Ro. For example, the M-to-K mixing equations may be as follows:
Lo=L+0.707C+0.707Ls
Ro=R+0.707C+0.707Rs
Alternatively, the M-to-K mixing equations may be as follows:
Lo=L+−3dB*C+att*Ls
Ro=R+−3dB*C+att*Rs,
where att may, for example, represent a value such as −3 dB, −6 dB, −9 dB or zero. For implementations in which N=M, the foregoing equations may be considered N-to-K mixing equations.
In this example, the decorrelation information 240 that is received by the decorrelator 205 indicates that the audio data for M channels will subsequently be upmixed or downmixed to K channels. The decorrelator 205 may be configured to use a different decorrelation process, depending on whether the data for M channels will subsequently be upmixed or downmixed to audio data for K channels. Accordingly, the decorrelator 205 may be configured to determine decorrelation filtering processes based, at least in part, on the M-to-K mixing equations. For example, if the M channels will subsequently be downmixed to K channels, different decorrelation filters may be used for channels that will be combined in the subsequent downmix. According to one such example, if the decorrelation information 240 indicates that audio data for L, R, Ls and Rs channels will be downmixed to 2 channels, one decorrelation filter may be used for both the L and the R channels and another decorrelation filter may be used for both the Ls and Rs channels.
In some implementations, M=K. In such implementations, the M-to-K upmixer/downmixer 264 may be a pass-through element.
However, in other implementations, M>K. In such implementations, the M-to-K upmixer/downmixer 264 may function as a downmixer. According to some such implementations, a less computationally intensive method of generating the decorrelated downmix may be used. For example, the decorrelator 205 may be configured to generate the decorrelated audio data 230 only for channels that the switch 203 will send to the inverse transform module 255. For example, if N=6, and M=2, the decorrelator 205 may be configured to generate the decorrelated audio data 230 for only 2 downmixed channels. In the process, the decorrelator 205 may use decorrelation filters for only 2 channels rather than 6, reducing complexity. Corresponding mixing information may be included in the decorrelation information 240, the mixing information 266 and the mixing information 268. Accordingly, the decorrelator 205 may be configured to determine decorrelation filtering processes based, at least in part, on the N-to-M, N-to-K or M-to-K mixing equations.
In this example, audio data 220 are input to the decorrelation signal generator 218 and the mixer 215. The audio data 220 may correspond to a plurality of audio channels. For example, the audio data 220 may include data resulting from channel coupling during an audio encoding process that has been upmixed prior to being received by the decorrelator 205. In some embodiments, the audio data 220 may be in the time domain, whereas in other embodiments the audio data 220 may be in the frequency domain. For example, the audio data 220 may include time sequences of transform coefficients.
The decorrelation signal generator 218 may form one or more decorrelation filters, apply the decorrelation filters to the audio data 220 and provide the resulting decorrelation signals 227 to the mixer 215. In this example, the mixer combines the audio data 220 with the decorrelation signals 227 to produce decorrelated audio data 230.
In some embodiments, the decorrelation signal generator 218 may determine decorrelation filter control information for a decorrelation filter. According to some such embodiments, the decorrelation filter control information may correspond to a maximum pole displacement of the decorrelation filter. The decorrelation signal generator 218 may determine decorrelation filter parameters for the audio data 220 based, at least in part, on the decorrelation filter control information.
In some implementations, determining the decorrelation filter control information may involve receiving an express indication of the decorrelation filter control information (for example, an express indication of a maximum pole displacement) with the audio data 220. In alternative implementations, determining the decorrelation filter control information may involve determining audio characteristic information and determining decorrelation filter parameters (such as a maximum pole displacement) based, at least in part, on the audio characteristic information. In some implementations, the audio characteristic information may include spatial information, tonality information and/or transient information.
Some implementations of the decorrelator 205 will now be described in more detail with reference to
In this example, the process 300 begins when a decorrelator receives audio data (block 305). As described above with reference to
In block 310, decorrelation filter control information is determined. The decorrelation filter control information may, for example, be determined according to audio characteristics of the audio data. In some implementations, such as the example shown in
In the embodiment shown in
In some implementations, no explicit spatial information, tonality information or transient information is received by the decorrelator 205. In some such implementations, a transient control module of the decorrelator 205 (or another element of an audio processing system) may be configured to determine transient information based on one or more attributes of the audio data. A spatial parameter module of the decorrelator 205 may be configured to determine spatial parameters based on one or more attributes of the audio data. Some examples are described elsewhere herein.
In block 315 of
In the implementation shown in
In this embodiment, the decorrelation filter 410 includes a fixed delay 415 followed by the time-varying portion 420, which is an all-pass filter in this example. In some embodiments, the decorrelation signal generator 218 may include a bank of all-pass filters. For example, in some embodiments wherein the audio data 220 is in the frequency domain, the decorrelation signal generator 218 may include an all-pass filter for each of a plurality of frequency bins. However, in alternative implementations, the same filter may be applied to each frequency bin. Alternatively, frequency bins may be grouped and the same filter may be applied to each group. For example, the frequency bins may be grouped into frequency bands, may be grouped by channel and/or grouped by frequency band and by channel.
The amount of the fixed delay may be selectable, e.g., by a logic device and/or according to user input. In order to introduce controlled chaos into the decorrelation signals 227, the decorrelation filter control 405 may apply decorrelation filter parameters to control the poles of the all-pass filter(s) so that one or more of the poles move randomly or pseudo-randomly in a constrained region.
Accordingly, the decorrelation filter parameters may include parameters for moving at least one pole of the all-pass filter. Such parameters may include parameters for dithering one or more poles of the all-pass filter. Alternatively, the decorrelation filter parameters may include parameters for selecting a pole location from among a plurality of predetermined pole locations for each pole of the all-pass filter. At a predetermined time interval (for example, once every Dolby Digital Plus block), a new location for each pole of the all-pass filter may be chosen randomly or pseudo-randomly.
Some such implementations will now be described with reference to
In this example, the constraint areas 510a, 510b and 510c are circular. The initial (or “seed”) locations of the poles 505a, 505b and 505c are indicated by the circles in the centers of the constraint areas 510a, 510b and 510c. In the example of
However, other implementations may include more or fewer poles. Alternative implementations also may include constraint areas of different sizes or shapes. Some examples are shown in
In some implementations, different channels of the audio data share the same constraint areas. However, in alternative implementations, channels of the audio data do not share the same constraint areas. Whether or not channels of the audio data share the same constraint areas, the poles may be dithered (or otherwise moved) independently for each audio channel.
A sample trajectory of the pole 505a is indicated by arrows within the constraint area 510a. Each arrow represents a movement or “stride” 520 of the pole 505a. Although not shown in
In some implementations, the movement of a pole may be controlled by changing a maximum stride value. The maximum stride value may correspond to a maximum pole displacement from the most recent pole location. The maximum stride value may define a circle having a radius equal to the maximum stride value.
One such example is shown in
In some implementations, the maximum stride value may be reset after each stride. In other implementations, the maximum stride value may be reset after multiple strides and/or according to changes in the audio data.
The maximum stride value may be determined and/or controlled in various ways. In some implementations, the maximum stride value may be based, at least in part, on one or more attributes of the audio data to which the decorrelation filter will be applied.
For example, the maximum stride value may be based, at least in part, on tonality information and/or transient information. According to some such implementations, the maximum stride value may be at or near zero for highly tonal signals of the audio data (such as audio data for a pitch pipe, a harpsichord, etc.), which causes little or no variation in the poles to occur. In some implementations, the maximum stride value may be at or near zero at the instant of an attack in a transient signal (such as audio data for an explosion, a door slam, etc.). Subsequently (for example, over a time period of a few blocks), the maximum stride value may be ramped to a larger value.
In some implementations, tonality and/or transient information may be detected at the decoder, based on one or more attributes of the audio data. For example, tonality and/or transient information may be determined according to one or more attributes of the audio data by a module such as the control information receiver/generator 640, which is described below with reference to
In this implementation, the movement of a pole may be controlled according to dithering parameters. Accordingly, while the movement of a pole may be constrained according to a maximum stride value, the direction and/or extent of the pole movement may include a random or quasi-random component. For example, the movement of a pole may be based, at least in part, on the output of a random number generator or pseudo-random number generator algorithm implemented in software. Such software may be stored on a non-transitory medium and executed by a logic system.
However, in alternative implementations the decorrelation filter parameters may not involve dithering parameters. Instead, pole movement may be restricted to predetermined pole locations. For example, a number of predetermined pole locations may lie within a radius defined by a maximum stride value. A logic system may randomly or pseudo-randomly select one of these predetermined pole locations as the next pole location.
Various other methods may be employed to control pole movement. In some implementations, if a pole is approaching the boundary of a constraint area, the selection of pole movements may be biased towards new pole locations that are closer to the center of the constraint area. For example, if the pole 505a moves towards the boundary of the constraint area 510a, the center of the maximum stride circle 525 may be shifted inwards towards the center of the constraint area 510a, so that the maximum stride circle 525 always lies within the boundary of the constraint area 510a.
In some such implementations, a weight function may be applied in order to create a bias that tends to move a pole location away from a constraint area boundary. For example, predetermined pole locations within the maximum stride circle 525 may not be assigned equal probabilities of being selected as the next pole location. Instead, predetermined pole locations that are closer to the center of the constraint area may be assigned a higher probability than predetermined pole locations that are relatively farther from the center of the constraint area. According to some such implementations, when the pole 505a is close to the boundary of the constraint area 510a, it is more likely that the next pole movement will be towards the center of the constraint area 510a.
In this example, locations of the pole 505b also change, but are controlled such that the pole 505b continues to remain real. Accordingly, locations of the pole 505b are constrained to lie along the diameter 530 of the constraint area 510b. In alternative implementations, however, the pole 505b may be moved to locations that have an imaginary component.
In yet other implementations, the locations of all poles may be constrained to move only along radii. In some such implementations, changes in pole location only increase or decrease the poles (in terms of magnitude) but do not affect their phase. Such implementations may be useful, for example, for imparting a selected reverberation time constant.
Poles for frequency coefficients corresponding to higher frequencies may be relatively closer to the center of the unit circle 515 than poles for frequency coefficients corresponding to lower frequencies. We will use
The poles for the filter used at any other frequency ƒ is obtained in this example by scaling the poles z1, z2 and z1* by a factor a(ƒ)/a(ƒ0), where a(ƒ) is a function that decreases with the audio data frequency ƒ. When ƒ=ƒ0 the scaling factor is equal to 1 and the poles are at the expected locations. According to some such implementations, smaller group delays may be applied to frequency coefficients corresponding to higher frequencies than to frequency coefficients corresponding to lower frequencies. In the embodiment described here the poles are dithered at one frequency and scaled to obtain pole locations for other frequencies. The frequency ƒ0 could be, for instance, the coupling begin frequency. In alternative implementations, the poles could be separately dithered at each frequency, and the constraint areas (510a, 510b, and 510c) may be substantially closer to the origin at higher frequencies compared to lower frequencies.
According to various implementations described herein, poles 505 may be moveable, but may maintain a substantially consistent spatial or angular relationship relative to one another. In some such implementations, movements of the poles 505 may not be limited according to constraint areas.
In this example, the pole 505b is constrained to move along the real axis. In some such implementations, the poles 505a and 505c also may be moveable towards or away from the center of the unit circle 515, e.g., as described above with reference to
In the examples shown in
Returning now to
Here, the mixer 215 and the decorrelation signal generator 218 also receive various types of decorrelation information. In some implementations, at least some of the decorrelation information may be received in a bitstream along with the audio data elements 220. Alternatively, or additionally, at least some of the decorrelation information may be determined locally, e.g., by other components of the decorrelator 205 or by one or more other components of the audio processing system 200.
In this example, the received decorrelation information includes decorrelation signal generator control information 625. The decorrelation signal generator control information 625 may include decorrelation filter information, gain information, input control information, etc. The decorrelation signal generator produces the decorrelation signals 227 based, at least in part, on the decorrelation signal generator control information 625.
Here, the received decorrelation information also includes transient control information 430. Various examples of how the decorrelator 205 may use and/or generate the transient control information 430 are provided elsewhere in this disclosure.
In this implementation, the mixer 215 includes the synthesizer 605 and the direct signal and decorrelation signal mixer 610. In this example, the synthesizer 605 is an output-channel-specific combiner of decorrelation or reverb signals, such as the decorrelation signals 227 received from the decorrelation signal generator 218. According to some such implementations, the synthesizer 605 may be a linear combiner of the decorrelation or reverb signals. In this example, the decorrelation signals 227 correspond to audio data elements 220 for a plurality of channels, to which one or more decorrelation filters have been applied by the decorrelation signal generator. Accordingly, the decorrelation signals 227 also may be referred to herein as “filtered audio data” or “filtered audio data elements.”
Here, the direct signal and decorrelation signal mixer 610 is an output-channel-specific combiner of the filtered audio data elements with the “direct” audio data elements 220 corresponding to a plurality of channels, to produce the decorrelated audio data 230. Accordingly, the decorrelator 205 may provide channel-specific and non-hierarchical decorrelation of audio data.
In this example, the synthesizer 605 combines the decorrelation signals 227 according to the decorrelation signal synthesizing parameters 615, which also may be referred to herein as “decorrelation signal synthesizing coefficients.” Similarly, the direct signal and decorrelation signal mixer 610 combines the direct and filtered audio data elements according to the mixing coefficients 620. The decorrelation signal synthesizing parameters 615 and the mixing coefficients 620 may be based, at least in part, on the received decorrelation information.
Here, the received decorrelation information includes the spatial parameter information 630, which is channel-specific in this example. In some implementations, the mixer 215 may be configured to determine the decorrelation signal synthesizing parameters 615 and/or the mixing coefficients 620 based, at least in part, on the spatial parameter information 630. In this example, the received decorrelation information also includes downmix/upmix information 635. For example, the downmix/upmix information 635 may indicate how many channels of audio data were combined to produce downmixed audio data, which may correspond to one or more coupling channels in a coupling channel frequency range. The downmix/upmix information 635 also may indicate a number of desired output channels and/or characteristics of the output channels. As described above with reference to
In this implementation, the control information receiver/generator 640 determines the decorrelation signal generator control information 625 and the mixer control information 645 according to the decorrelation information 240 and/or the audio data elements 220 and/or 245. Some examples of the control information receiver/generator 640 and its functionality are described below.
The control information receiver/generator 640 may have different functionality, according to the specific implementation. In this implementation, the control information receiver/generator 640 includes a filter control module 650, a transient control module 655, a mixer control module 660 and a spatial parameter module 665. As with other components of the audio processing system 200, the elements of the control information receiver/generator 640 may be implemented via hardware, firmware, software stored on a non-transitory medium and/or combinations thereof. In some implementations, these components may be implemented by a logic system such as described elsewhere in this disclosure.
The filter control module 650 may, for example, be configured to control the decorrelation signal generator as described above with reference to
In this example, the control information receiver/generator 640 receives the audio data elements 220 and 245, which may include at least a portion of the audio data received by switch 203 and/or the decorrelator 205. The audio data elements 220 are received by the mixer 215 and the decorrelation signal generator 218. In some implementations, the audio data elements 220 may correspond to audio data in a coupling channel frequency range, whereas the audio data elements 245 may correspond to audio data that is in a frequency range outside of the coupling channel frequency range. For example, the audio data elements 245 may correspond to audio data that is in a frequency range above and/or below that of the coupling channel frequency range.
In this implementation, the control information receiver/generator 640 determines the decorrelation signal generator control information 625 and the mixer control information 645 according to the decorrelation information 240, the audio data elements 220 and/or the audio data elements 245. The control information receiver/generator 640 provides the decorrelation signal generator control information 625 and the mixer control information 645 to the decorrelation signal generator 218 and the mixer 215, respectively.
In some implementations, the control information receiver/generator 640 may be configured to determine tonality information and to determine the decorrelation signal generator control information 625 and/or the mixer control information 645 based, at least in part, on the tonality information. For example, the control information receiver/generator 640 may be configured to receive explicit tonality information via explicit tonality information, such as tonality flags, as part of the decorrelation information 240. The control information receiver/generator 640 may be configured to process the received explicit tonality information and to determine tonality control information.
For example, if the control information receiver/generator 640 determines that the audio data in the coupling channel frequency range is highly tonal, the control information receiver/generator 640 may be configured to provide decorrelation signal generator control information 625 indicating that the maximum stride value should be set to zero or nearly zero, which causes little or no variation in the poles to occur. Subsequently (for example, over a time period of a few blocks), the maximum stride value may be ramped to a larger value. In some implementations, if the control information receiver/generator 640 determines that the audio data in the coupling channel frequency range is highly tonal, the control information receiver/generator 640 may be configured to indicate to the spatial parameter module 665 that a relatively higher degree of smoothing may be applied in calculating various quantities, such as energies used in the estimation of spatial parameters. Other examples of responses to determining highly tonal audio data are provided elsewhere herein.
In some implementations, the control information receiver/generator 640 may be configured to determine tonality information according to one or more attributes of the audio data 220 and/or according to information from a bitstream of a legacy audio code that is received via the decorrelation information 240, such as exponent information and/or exponent strategy information.
For example, in the bitstream of audio data encoded according to the E-AC-3 audio codec, the exponents for transform coefficients are differentially coded. The sum of absolute exponent differences in a frequency range is a measure of distance travelled along the spectral envelope of the signal in a log-magnitude domain. Signals such as pitch-pipe and harpsichord have a picket-fence spectrum and hence the path along which this distance is measure is characterized by many peaks and valleys. Thus, for such signals the distance travelled along the spectral envelope in the same frequency range is larger than for signals for audio data corresponding to, e.g., applause or rain, which have a relatively flat spectrum.
Therefore, in some implementations the control information receiver/generator 640 may be configured to determine a tonality metric based, at least in part, according to exponent differences in the coupling channel frequency range. For example, the control information receiver/generator 640 may be configured to determine a tonality metric based on the average absolute exponent difference in the coupling channel frequency range. According to some such implementations, the tonality metric is only calculated when the coupling exponent strategy is shared for all blocks in a frame and does not indicate exponent frequency sharing, in which case it is meaningful to define the exponent difference from one frequency bin to the next. According to some implementations, the tonality metric is only calculated if the E-AC-3 adaptive hybrid transform (“AHT”) flag is set for the coupling channel.
If the tonality metric is determined as the absolute exponent difference of E-AC-3 audio data, in some implementations the tonality metric may take a value between 0 and 2, because −2, −1, 0, 1, and 2 are the only exponent differences allowed according to E-AC-3. One or more tonality thresholds may be set in order to differentiate tonal and non-tonal signals. For example, some implementations involve setting one threshold for entering a tonality state and another threshold for exiting the tonality state. The threshold for exiting the tonality state may be lower than the threshold for entering the tonality state. Such implementations provide a degree of hysteresis, such that tonality values slightly below the upper threshold will not inadvertently cause a tonality state change. In one example, the threshold for exiting the tonality state is 0.40, whereas the threshold for entering the tonality state is 0.45. However, other implementations may include more or fewer thresholds, and the thresholds may have different values.
In some implementations, the tonality metric calculation may be weighted according to the energy present in the signal. This energy may be derived directly from the exponents. The log energy metric may be inversely proportional to the exponents, because the exponents are represented as negative powers of two in E-AC-3. According to such implementations, those parts of the spectrum that are low in energy will contribute less to the overall tonality metric than those parts of the spectrum that are high in energy. In some implementations, the tonality metric calculation may only be performed on block zero of a frame.
In the example shown in
In various implementations of the audio processing system 200, the control information receiver/generator 640 may be configured to determine one or more types of spatial parameters of the audio data 220. In some implementations, at least some such functionality may be provided by the spatial parameter module 665 shown in
Other spatial parameters may be inter-channel correlation coefficients that indicate a correlation between pairs of individual discrete channels. Such parameters may sometimes be referred to herein as reflecting “inter-channel coherence” or “ICC.” In the four-channel example referenced above, there may be six ICC values involved, for the L-R pair, the L-Ls pair, the L-Rs pair, the R-Ls pair, the R-Rs pair and the Ls-Rs pair.
In some implementations, the determination of spatial parameters by the control information receiver/generator 640 may involve receiving explicit spatial parameters in a bitstream, e.g., via the decorrelation information 240. Alternatively, or additionally, the control information receiver/generator 640 may be configured to estimate at least some spatial parameters. The control information receiver/generator 640 may be configured to determine mixing parameters based, at least in part, on spatial parameters. Accordingly, in some implementations, functions relating to the determination and processing of spatial parameters may be performed, at least in part, by the mixer control module 660.
Referring first to the left panel of
The right panel of
However, restoring the spatial relationship between individual discrete channels and a coupling channel does not guarantee the restoration of the spatial relationships between the discrete channels (represented by the ICCs). This fact is illustrated in
In the examples shown in
Because the discrete channels are ultimately reproduced and presented to listeners, proper restoration of the spatial relationships between discrete channels (the ICCs) may significantly improve the restoration of spatial characteristics of the audio data. As may be seen by the examples of
In the left panel of
Accordingly, by setting the IDC between spatially adjacent individual channels to −1, the ICC between these channels may be minimized and the spatial relationship between the channels may be closely restored when these channels are dominant. This results in an overall sound image that is perceptually approximate to the sound image of the original audio signal. Such methods may be referred to herein as “sign-flip” methods. In such methods, no knowledge of the actual ICCs is required.
In this example, block 804 involves determining audio characteristics of the audio data. Here, the audio characteristics include spatial parameter data. The spatial parameter data may include alphas, the correlation coefficients between individual audio channels and the coupling channel. Block 804 may involve receiving spatial parameter data, e.g., via the decorrelation information 240 described above with reference to
Here, block 806 involves determining at least two decorrelation filtering processes for the audio data based, at least in part, on the audio characteristics. The decorrelation filtering processes may be channel-specific decorrelation filtering processes. According to some implementations, each of the decorrelation filtering processes determined in block 806 includes a sequence of operations relating to decorrelation.
Applying at least two decorrelation filtering processes determined in block 806 may produce channel-specific decorrelation signals. For example, applying the decorrelation filtering processes determined in block 806 may cause a specific inter-decorrelation signal coherence (“IDC”) between channel-specific decorrelation signals for at least one pair of channels. Some such decorrelation filtering processes may involve applying at least one decorrelation filter to at least a portion of the audio data (e.g., as described below with reference to block 820 of
In some implementations, it may be determined in block 806 that the same decorrelation filter will be used to produce filtered audio data corresponding to all of the channels that will be decorrelated, whereas in other implementations, it may be determined in block 806 that a different decorrelation filter will be used to produce filtered audio data for at least some channels that will be decorrelated. In some implementations, it may be determined in block 806 that audio data corresponding to a center channel will not be decorrelated, whereas in other implementations block 806 may involve determining a different decorrelation filter for audio data of a center channel. Moreover, although in some implementations each of the decorrelation filtering processes determined in block 806 includes a sequence of operations relating to decorrelation, in alternative implementations each of the decorrelation filtering processes determined in block 806 may correspond with a particular stage of an overall decorrelation process. For example, in alternative implementations each of the decorrelation filtering processes determined in block 806 may correspond with a particular operation (or a group of related operations) within a sequence of operations relating to generating a decorrelation signal for at least two channels.
In block 808, the decorrelation filtering processes determined in block 806 will be implemented. For example, block 808 may involve applying a decorrelation filter or filters to at least a portion of the received audio data, to produce filtered audio data. The filtered audio data may, for example, correspond with the decorrelation signals 227 produced by the decorrelation signal generator 218, as described above with reference to
Here, block 810 involves determining mixing parameters based, at least in part, on the audio characteristics. Block 810 may be performed, at least in part, by the mixer control module 660 of the control information receiver/generator 640 (see
In some four-channel examples, block 820 may involve applying a first decorrelation filter to audio data for a first and second channel to produce first channel filtered data and second channel filtered data, and applying a second decorrelation filter to audio data for a third and fourth channel to produce third channel filtered data and fourth channel filtered data. For example, the first channel may be a left channel, the second channel may be a right channel, the third channel may be a left surround channel and the fourth channel may be a right surround channel.
The decorrelation filters may be applied either before or after audio data is upmixed, depending on the particular implementation. In some implementations, for example, a decorrelation filter may be applied to a coupling channel of the audio data. Subsequently, a scaling factor appropriate for each channel may be applied. Some examples are described below with reference to
The process 808a of
In the example shown in
The polarity reversing module 840 provides the decorrelation signals 227, including the sign-flipped decorrelation signals 227, to channel-specific mixers 215a-215d. The channel-specific mixers 215a-215d also receive direct, unfiltered audio data 210 of the coupling channel and output-channel-specific spatial parameter information 630a-630d. Alternatively, or additionally, in some implementations the channel-specific mixers 215a-215d may receive the modified mixing coefficients 890 that are described below with reference to
In this implementation, the channel-specific mixers 215a-215d mix the decorrelation signals 227 with the direct audio data 210 of the coupling channel according to the output-channel-specific spatial parameter information 630a-630d and outputs the resulting output-channel-specific mixed audio data 845a-845d to the gain control modules 850a-850d. In this example, the gain control modules 850a-850d are configured to apply output-channel-specific gains, also referred to herein as scaling factors, to the output-channel-specific mixed audio data 845a-845d.
An alternative sign-flip method will now be described with reference to
In this example, a channel-specific gain/scaling factor has been applied to the audio data 210a-210d before the audio data 210a-210d are received by the decorrelation signal generators 218a-218d. For example, if the audio data has been encoded according to the AC-3 or E-AC-3 audio codecs, the scaling factors may be coupling coordinates or “cplcoords” that are encoded with the rest of the audio data and received in a bitstream by an audio processing system such as a decoding device. In some implementations, cplcoords also may be the basis for the output-channel-specific scaling factors applied by the gain control modules 850a-850d to the output-channel-specific mixed audio data 845a-845d (see
Accordingly, the decorrelation signal generators 218a-218d output channel-specific decorrelation signals 227a-227d for all channels that will be decorrelated. The decorrelation signals 227a-227d are also referenced as yL, yR, yLS and yRS, respectively, in
The decorrelation signals 227a-227d are received by the polarity reversing module 840. The polarity reversing module 840 is configured to reverse the polarity of decorrelation signals for adjacent channels. In this example, the polarity reversing module 840 is configured to reverse the polarity of decorrelation signals for the right channel and the left surround channel. However, in other implementations, the polarity reversing module 840 may be configured to reverse the polarity of decorrelation signals for other channels. For example, the polarity reversing module 840 may be configured to reverse the polarity of decorrelation signals for the left and right surround channels. Other implementations may involve reversing the polarity of decorrelation signals for yet other channels, depending on the number of channels involved and their spatial relationships.
The polarity reversing module 840 provides the decorrelation signals 227a-227d, including the sign-flipped decorrelation signals 227b and 227c, to channel-specific mixers 215a-215d. Here, the channel-specific mixers 215a-215d also receive direct audio data 210a-210d and output-channel-specific spatial parameter information 630a-630d. In this example, the output-channel-specific spatial parameter information 630a-630d has been modified according to transient data.
In this implementation, the channel-specific mixers 215a-215d mix the decorrelation signals 227 with the direct audio data 210a-210d according to the output-channel-specific spatial parameter information 630a-630d and outputs the output-channel-specific mixed audio data 845a-845d.
Alternative methods for restoring the spatial relationship between discrete input channels are provided herein. The methods may involve systematically determining synthesizing coefficients to determine how decorrelation or reverb signals will be synthesized. According to some such methods, the optimal IDCs are determined from alphas and target ICCs. Such methods may involve systematically synthesizing a set of channel-specific decorrelation signals according to the IDCs that are determined to be optimal.
An overview of some such systematic methods will now be described with reference to
However, in this example, the decorrelation processes determined in block 806 may involve performing operations on the filtered audio data according to synthesizing coefficients. Some examples are provided below.
Optional block 855 may involve converting from one form of spatial parameters to an equivalent representation. Referring to
In alternative audio processing system implementations, at least some of the functionality of the synthesizing and mixing coefficient generating module 880 may be performed by elements other than the mixer 215. For example, in some alternative implementations, at least some of the functionality of the synthesizing and mixing coefficient generating module 880 may be performed by a control information receiver/generator 640 such as that shown in
In this implementation, block 860 involves determining a desired spatial relationship between output channels in terms of a spatial parameter representation. As shown in
In this example, block 865 involves determining synthesizing coefficients based on the desired spatial relationships Mixing coefficients may also be determined, based at least in part on the desired spatial relationships. Referring again to
The synthesizing and mixing coefficient generating module 880 may provide the decorrelation signal synthesizing parameters 615 to the synthesizer 605. In some implementations, the decorrelation signal synthesizing parameters 615 may be output-channel-specific. In this example, the synthesizer 605 also receives the decorrelation signals 227, which may be produced by a decorrelation signal generator 218 such as that shown in
In this example, block 820 involves applying one or more decorrelation filters to at least a portion of the received audio data, to produce filtered audio data. The filtered audio data may, for example, correspond with the decorrelation signals 227 produced by the decorrelation signal generator 218, as described above with reference to
Block 870 may involve synthesizing decorrelation signals according to the synthesizing coefficients. In some implementations, block 870 may involve synthesizing decorrelation signals by performing operations on the filtered audio data produced in block 820. As such, the synthesized decorrelation signals may be considered a modified version of the filtered audio data. In the example shown in
The synthesizing and mixing coefficient generating module 880 may provide the mixing coefficients 620 to a mixer transient control module 888. In this implementation, the mixing coefficients 620 are output-channel-specific mixing coefficients. The mixer transient control module 888 may receive transient control information 430. The transient control information 430 may be received along with the audio data or may be determined locally, e.g., by a transient control module such as the transient control module 655 shown in
The direct signal and decorrelation signal mixer 610 may mix the synthesized decorrelation signals 886 with the direct, unfiltered audio data 220. In this example, the audio data 220 includes audio data elements corresponding to N input channels. The direct signal and decorrelation signal mixer 610 mixes the audio data elements and the channel-specific synthesized decorrelation signals 886 on an output-channel-specific basis and outputs decorrelated audio data 230 for N or M output channels, depending on the particular implementation (see, e.g.,
Following are detailed examples of some of the processes of method 851. Although these methods are described, at least in part, with reference to features of the AC-3 and E-AC-3 audio codecs, the methods have wide applicability to many other audio codecs.
The goal of some such methods is to reproduce all ICCs (or a selected set of ICCs) precisely, in order to restore the spatial characteristics of the source audio data that may have been lost due to channel coupling. The functionality of a mixer may be formulated as:
yi=gi└αix+√{square root over (1−|αi|2)}Di(x)┘, ∀i (Equation 1)
In Equation 1, x represents a coupling channel signal, αi represents the spatial parameter alpha for channel I, gi represents the “cplcoord” (corresponding to a scaling factor) for channel I, yi represents the decorrelated signal and Di(x) represents the decorrelation signal generated from decorrelation filter Di. It is desirable for the output of the decorrelation filter to have the same spectral power distribution as the input audio data, but to be uncorrelated to the input audio data. According to the AC-3 and E-AC-3 audio codecs, cplcoords and alphas are per coupling channel frequency band, while the signals and the filter are per frequency bin. Also, the samples of the signals correspond to the blocks of the filterbank coefficients. These time and frequency indices are omitted here for the sake of simplicity.
The alpha values represent the correlation between discrete channels of the source audio data and the coupling channel, which may be expressed as follows:
In Equation 2, E represents the expectation value of the term(s) within the curly brackets, x* represents the complex conjugate of x and si represents a discrete signal for the channel I.
The inter-channel coherence or ICC between a pair of decorrelated signals can be derived as follows:
In Equation 3, IDCi1,i2 represents the inter-decorrelation-signal coherence (“IDC”) between Di1(x) and Di2(x). With fixed alphas, the ICC is maximized when IDC is +1 and minimized when IDC is −1. When the ICC of the source audio data is known, the optimal IDC required to replicate it can be solved as:
The ICC between the decorrelated signals may be controlled by selecting decorrelation signals that satisfy the optimal IDC conditions of Equation 4. Some methods of generating such decorrelation signals will be discussed below. Before that discussion, it may be useful to describe the relationships between some of these spatial parameters, particularly that between ICCs and alphas.
As noted above with reference to optional block 855 of method 851, some implementations provided herein may involve converting from one form of spatial parameters to an equivalent representation. In some such implementations, optional block 855 may involve converting from alphas to ICCs or vice versa. For example, alphas may be uniquely determined if both the cplcoords (or comparable scaling factors) and ICCs are known.
A coupling channel may be generated as follows:
In Equation 5, si represents the discrete signal for channel i involved in the coupling and gx represents an arbitrary gain adjustment applied on x. By replacing the x term of Equation 2 with the equivalent expression of Equation 5, an alpha for channel i can be expressed as follows:
The power of each discrete channel can be represented by the power of the coupling channel and the power of the corresponding cplcoord as follows:
E{si|2}=gi2E{x|2}
The cross-correlation terms can be substituted as follows:
E{sisj*}=gigjE{|x|2}ICCi,j
Therefore, the alphas may be expressed in this manner:
Based on Equation 5, the power of x may be expressed as follows:
Therefore, the gain adjustment gx may be expressed as follows:
Accordingly, if all cplcoords and ICCs are known, alphas can be computed according to the following expression:
As noted above, the ICC between decorrelated signals may be controlled by selecting decorrelation signals that satisfy Equation 4. In the stereo case, a single decorrelation filter may be formed that generates decorrelation signals uncorrelated to the coupling channel signal. The optimal IDC of −1 can be achieved by simply sign-flipping, e.g., according to one of the sign-flip methods described above.
However, the task of controlling ICCs for multichannel cases is more complex. In addition to ensuring that all decorrelation signals are substantially uncorrelated to the coupling channel, the IDCs among the decorrelation signals also should satisfy Equation 4.
In order to generate decorrelation signals with the desired IDCs, a set of mutually uncorrelated “seed” decorrelation signals may first be generated. For example, the decorrelation signals 227 may be generated according to methods described elsewhere herein. Subsequently, the desired decorrelation signals may be synthesized by linearly combining these seeds with proper weights. An overview of some examples is described above with reference to
It may be challenging to generate many high-quality and mutually-uncorrelated (e.g., orthogonal) decorrelation signals from one downmix. Furthermore, calculating the proper combination weights may involve matrix inversion, which could pose challenges in terms of complexity and stability.
Accordingly, in some examples provided herein, an “anchor-and-expand” process may be implemented. In some implementations, some IDCs (and ICCs) may be more significant than others. For example, lateral ICCs may be perceptually more important than diagonal ICCs. In a Dolby 5.1 channel example, the ICCs for the L-R, L-Ls, R-Rs and Ls-Rs channel pairs may be perceptually more important than the ICCs for the L-Rs and R-Ls channel pairs. Front channels may be perceptually more important than rear or surround channels.
In some such implementations, the terms of Equation 4 for the most important IDC can be first satisfied by combining two orthogonal (seed) decorrelation signals to synthesize the decorrelation signals for the two channels involved. Then, using these synthesized decorrelation signals as anchors and adding new seeds, the terms of Equation 4 for the secondary IDCs can be satisfied and the corresponding decorrelation signals can be synthesized. This process may be repeated until the terms of Equation 4 are satisfied for all of the IDCs. Such implementations allow the use of decorrelation signals of higher quality to control relatively more critical ICCs.
In this example, blocks 905-915 involve calculating synthesis parameters to be applied to a set of mutually uncorrelated seed decorrelation signals, Dni(x), that are generated in block 920. In some 5.1 channel implementations, i={1, 2, 3, 4}. If the center channel will be decorrelated, a fifth seed decorrelation signal may be involved. In some implementations, uncorrelated (orthogonal) decorrelation signals, Dni(x) may be generated by inputting the mono downmix signal into several different decorrelation filters. Alternatively, the initial upmixed signals can each be inputted into a unique decorrelation filter. Various examples are provided below.
As noted above, front channels may be perceptually more important than rear or surround channels. Therefore, in method 900, the decorrelation signals for L and R channels are jointly anchored on the first two seeds, then the decorrelation signals for Ls and Rs channels are synthesized using these anchors and the remaining seeds.
In this example, block 905 involves calculating synthesis parameters p and ρr for the front L and R channels. Here, ρ and ρr are derived from the L-R IDC as:
Therefore, block 905 also involves calculating the L-R IDC from Equation 4. Accordingly, in this example, ICC information is used to calculate the L-R IDC. Other processes of the method also may use ICC values as input. ICC values may be obtained from the coded bitstream or by estimation at the decoder side, e.g., based on uncoupled lower-frequency or higher-frequency bands, cplcoords, alphas, etc.
The synthesis parameters ρ and ρr may be used to synthesize the decorrelation signals for the L and R channels in block 925. The decorrelation signals for the Ls and Rs channels may be synthesized using the decorrelation signals for the L and R channels as anchors.
In some implementations, it may be desirable to control the Ls-Rs ICC. According to method 900, synthesizing intermediate decorrelation signals D′Ls(x) and D′Rs(x) with two of the seed decorrelation signals involves calculating the synthesis parameters σ and σr. Therefore, optional block 910 involves calculating the synthesis parameters σ and σr for the surround channels. It can be derived that the required correlation coefficient between intermediate decorrelation signals D′Ls(x) and D′Rs(x) may be expressed as follows:
The variables σ and σr may be derived from their correlation coefficient:
Therefore, D′Ls(x) and D′Rs(x) can be defined as:
D′Ls(x)=σDn3(x)+σrDn4(x)
D′Rs(x)=σDn4(x)+σrDn3(x)
However, if the Ls-Rs ICC is not a concern, the correlation coefficient between D′Ls(x) and D′Rs(x) can be set to −1. Accordingly, the two signals can simply be sign-flipped versions of each other constructed by the remaining seed decorrelation signals.
The center channel may or may not be decorrelated, depending on the particular implementation. Accordingly, block 915's process of calculating synthesis parameters t1 and t2 for the center channel is optional. Synthesis parameters for the center channel may be calculated, for example, if controlling the L-C and R-C ICCs is desirable. If so, a fifth seed, Dn5(x) can be added and the decorrelation signal for the C channel may be expressed as follows:
DC(x)=t1Dn1(x)+t2Dn2(x)+√{square root over (1−|t1|2−|t2|2)}Dn5(x)
In order to achieve the desired L-C and R-C ICCs, Equation 4 should be satisfied for the L-C and R-C IDCs:
IDCL,C=ρt1*+ρrt2*
IDCR,C=ρrt1*+ρt2*
The asterisks indicate complex conjugates. Accordingly, synthesis parameters t1 and t2 for the center channel may be expressed as follows:
In block 920, a set of mutually uncorrelated seed decorrelation signals, Dni(x), i={1, 2, 3, 4}, may be generated. If the center channel will be decorrelated, a fifth seed decorrelation signal may be generated in block 920. These uncorrelated (orthogonal) decorrelation signals, Dni(x) may be generated by inputting the mono downmix signal into several different decorrelation filters.
In this example, block 925 involves applying the above-derived terms to synthesize decorrelation signals, as follows:
DL(x)=ρDn1(x)+ρrDn2(x)
DR(x)=ρDn2(x)+ρrDn1(x)
DLs(x)=IDCL,Ls*ρDn1(x)+IDCL,Ls*ρrDn2(x) +√{square root over (1−|IDCL,Ls|2)}σDn3(x)+√{square root over (1−|IDCL,Ls|2)}σrDn4(x)
DRs(x)=IDCR,Rs*ρDn2(x)+IDCR,Rs*ρrDn1(x) +√{square root over (1−|IDCR,Rs|2)}σDn4(x)+√{square root over (1−|IDCR,Rs|2)}σrDn3(x)
DC(x)=t1Dn1(x)+t2Dn2(x)+√{square root over (1−|t1|2−|t2|2)}Dn5(x)
In this example, the equations for synthesizing decorrelation signals for the Ls and Rs channels (DLs(x) and DRs(x)) are dependent on the equations for synthesizing the decorrelation signals for the L and R channels (DL(x) and DR(x)). In method 900, the decorrelation signals for the L and R channels are jointly anchored to mitigate potential left-right bias due to imperfect decorrelation signals.
In the example above, the seed decorrelation signals are generated from the mono downmix signal x in block 920. Alternatively, the seed decorrelation signals can be generated by inputting each initial upmixed signal into a unique decorrelation filter. In this case, the generated seed decorrelation signals would be channel-specific: Dni(gix), i={L, R, Ls, Rs, C}. These channel-specific seed decorrelation signals would generally have different power levels due to the upmixing process. Accordingly, it is desirable to align the power level among these seeds when combining them. To achieve this, the synthesizing equations for block 925 can be modified as follows:
DL(x)=ρDnL(gLx)+ρrλL,RDnR(gRx)
DR(x)=ρDnR(gRx)+ρrλR,LDnL(gLx)
DLs(x)=IDCL,Ls*ρλLs,LDnL(gLx)+IDCL,Ls*ρrλLs,RDnR(gRx) +√{square root over (1−|IDCL,Ls|2)}σDnLs(gLsx)+√{square root over (1−|IDCL,Ls|2)}σrλLs,RsDnL(gLx)
DRs(x)=IDCR,Rs*ρλRs,RDnR(gRx)+IDCR,Rs*ρrλRs,LDnL(gLx) +√{square root over (1−|IDCR,Rs|2)}σDnRs(gRsx)+√{square root over (1−|IDCR,Rs|2)}σrλRs,LsDnLs(gLsx)
DC(x)=t1λC,LDnL(gLx)+t2λC,RDnR(gRx)+√{square root over (1−|t1|2−|t2|2)}DnC(gCx)
In the modified synthesizing equations, all synthesizing parameters remain the same. However, level adjusting parameters λi,j are required to align the power level when using a seed decorrelation signal generated from channel j to synthesize the decorrelation signal for channel i. These channel-pair-specific level adjusting parameters can be computed based on the estimated channel level differences, such as:
Furthermore, since the channel-specific scaling factors are already incorporated into the synthesized decorrelation signals in this case, the mixer equation for block 812 (
yi=αigix+√{square root over (1−|αi|2)}Di(x), ∀i
As noted elsewhere herein, in some implementations spatial parameters may be received along with audio data. The spatial parameters may, for example, have been encoded with the audio data. The encoded spatial parameters and audio data may be received in a bitstream by an audio processing system such as a decoder, e.g., as described above with reference to
However, in alternative implementations, no encoded spatial parameters (or an incomplete set of spatial parameters) are received by the decorrelator 205. According to some such implementations, the control information receiver/generator 640, described above with reference to
The first set of frequency coefficients may correspond to a first frequency range and the second set of frequency coefficients may correspond to a second frequency range. For example, the first frequency range may correspond to an individual channel frequency range and the second frequency range may correspond to a received coupling channel frequency range. In some implementations, the first frequency range may be below the second frequency range. However, in alternative implementations, the first frequency range may be above the second frequency range.
Referring to
In block 1010 spatial parameters for at least part of the second set of frequency coefficients are estimated. In some implementations, the estimation is based upon one or more aspects of estimation theory. For example, the estimating process may be based, at least in part, on a maximum likelihood method, a Bayes estimator, a method of moments estimator, a minimum mean squared error estimator and/or a minimum variance unbiased estimator.
Some such implementations may involve estimating the joint probability density functions (“PDFs”) of the spatial parameters of the lower frequencies and the higher frequencies. For instance, let us say we have two channels L and R and in each channel we have a low band in the individual channel frequency range and a high band in the coupling channel frequency range. We may thus have an ICC_lo which represents the inter-channel-coherence between the L and R channels in the individual channel frequency range, and an ICC_hi which exists in the coupling channel frequency range.
If we have a large training set of audio signals, we can segment them and for each segment ICC_lo and ICC_hi can be calculated. Thus we may have a large training set of ICC pairs (ICC_lo, ICC_hi). A joint PDF of this pair of parameters may be calculated as histograms and/or modeled via parametric models (for instance, Gaussian Mixture Models). This model could be a time-invariant model that is known at the decoder. Alternatively, the model parameters may be regularly sent to the decoder via the bitstream.
At the decoder, ICC_lo for a particular segment of received audio data may be calculated, e.g., according to how cross-correlation coefficients between individual channels and the composite coupling channel are calculated as described herein. Given this value of the ICC_lo and the model of the joint PDF of the parameters the decoder may try to estimate what ICC_hi is. One such estimate is the Maximum-likelihood (“ML”) estimate, wherein the decoder may calculate the conditional PDF of ICC_hi given the value of ICC_lo. This conditional PDF is now essentially a positive-real-valued function that can be represented on an x-y axis, the x axis representing the continuum of ICC-hi values and the y axis representing the conditional probability of each such value. The ML estimate may involve choosing as the estimate of ICC_hi that value where this function peaks. On the other hand, the minimum-mean-squared-error (“MMSE”) estimate is the mean of this conditional PDF, which is another valid estimate of ICC_hi. Estimation theory provides many such tools to come up with an estimate of ICC_hi.
The above two-parameter example is a very simple case. In some implementations there may be a larger number of channels as well as bands. The spatial parameters may be alphas or ICCs. Moreover, the PDF model may be conditioned on signal type. For example, there may be a different model for transients, a different model for tonal signals, etc.
In this example, the estimation of block 1010 is based at least in part on the first set of frequency coefficients. For example, the first set of frequency coefficients may include audio data for two or more individual channels in a first frequency range that is outside of a received coupling channel frequency range. The estimating process may involve calculating combined frequency coefficients of a composite coupling channel within the first frequency range, based on the frequency coefficients of the two or more channels. The estimating process also may involve computing cross-correlation coefficients between the combined frequency coefficients and frequency coefficients of the individual channels within the first frequency range. The results of the estimating process may vary according to temporal changes of input audio signals.
In block 1015, the estimated spatial parameters may be applied to the second set of frequency coefficients, to generate a modified second set of frequency coefficients. In some implementations, the process of applying the estimated spatial parameters to the second set of frequency coefficients may be part of a decorrelation process. The decorrelation process may involve generating a reverb signal or a decorrelation signal and applying it to the second set of frequency coefficients. In some implementations, the decorrelation process may involve applying a decorrelation algorithm that operates entirely on real-valued coefficients. The decorrelation process may involve selective or signal-adaptive decorrelation of specific channels and/or specific frequency bands.
A more detailed example will now be described with reference to
In this example, the first set of frequency coefficients is in an individual channel frequency range. The second set of frequency coefficients corresponds to a coupling channel that is received by an audio processing system. The second set of frequency coefficients is in a received coupling channel frequency range, which is above the individual channel frequency range in this example.
Accordingly, block 1022 involves receiving audio data for the individual channels and for received coupling channel. In some implementations, the audio data may have been encoded according to a legacy encoding process. Applying spatial parameters that are estimated according to method 1000 or method 1020 to audio data of the received coupling channel may yield a more spatially accurate audio reproduction than that obtained by decoding the received audio data according to a legacy decoding process that corresponds with the legacy encoding process. In some implementations, the legacy encoding process may be a process of the AC-3 audio codec or the Enhanced AC-3 audio codec. Accordingly, in some implementations, block 1022 may involve receiving real-valued frequency coefficients but not frequency coefficients having imaginary values. However, method 1020 is not limited to these codecs, but is broadly applicable to many audio codecs.
In block 1025 of method 1020, at least a portion of the individual channel frequency range is divided into a plurality of frequency bands. For example, the individual channel frequency range may be divided into 2, 3, 4 or more frequency bands. In some implementations, each of the frequency bands may include a predetermined number of consecutive frequency coefficients, e.g., 6, 8, 10, 12 or more consecutive frequency coefficients. In some implementations, only part of the individual channel frequency range may be divided into frequency bands. For example, some implementations may involve dividing only a higher-frequency portion of the individual channel frequency range (relatively closer to the received coupled channel frequency range) into frequency bands. According to some E-AC-3-based examples, a higher-frequency portion of the individual channel frequency range may be divided into 2 or 3 bands, each of which includes 12 MDCT coefficients. According to some such implementations, only that portion of the individual channel frequency range that is above 1 kHz, above 1.5 kHz, etc. may be divided into frequency bands.
In this example, block 1030 involves computing the energy in the individual channel frequency bands. In this example, if an individual channel has been excluded from coupling, then the banded energy of the excluded channel will not be computed in block 1030. In some implementations, the energy values computed in block 1030 may be smoothed.
In this implementation, a composite coupling channel, based on audio data of the individual channels in the individual channel frequency range, is created in block 1035. Block 1035 may involve calculating frequency coefficients for the composite coupling channel, which may be referred to herein as “combined frequency coefficients.” The combined frequency coefficients may be created using frequency coefficients of two or more channels in the individual channel frequency range. For example, if the audio data has been encoded according to the E-AC-3 codec, block 1035 may involve computing a local downmix of MDCT coefficients below the “coupling begin frequency,” which is the lowest frequency in the received coupling channel frequency range.
The energy of the composite coupling channel, within each frequency band of the individual channel frequency range, may be determined in block 1040. In some implementations, the energy values computed in block 1040 may be smoothed.
In this example, block 1045 involves determining cross-correlation coefficients, which correspond to the correlation between frequency bands of the individual channels and corresponding frequency bands of the composite coupling channel. Here, computing cross correlation coefficients in block 1045 also involves computing the energy in the frequency bands of each of the individual channels and the energy in the corresponding frequency bands of the composite coupling channel. The cross-correlation coefficients may be normalized. According to some implementations, if an individual channel has been excluded from coupling, then frequency coefficients of the excluded channel will not be used in the computation of the cross-correlation coefficients.
Block 1050 involves estimating spatial parameters for each channel that has been coupled into the received coupling channel. In this implementation, block 1050 involves estimating the spatial parameters based on the cross-correlation coefficients. The estimating process may involve averaging normalized cross-correlation coefficients across all of the individual channel frequency bands. The estimating process also may involve applying a scaling factor to the average of the normalized cross-correlation coefficients to obtain the estimated spatial parameters for individual channels that have been coupled into the received coupling channel. In some implementations, the scaling factor may decrease with increasing frequency.
In this example, block 1055 involves adding noise to the estimated spatial parameters. The noise may be added to model the variance of the estimated spatial parameters. The noise may be added according to a set of rules corresponding to an expected prediction of the spatial parameter across frequency bands. The rules may be based on empirical data. The empirical data may correspond to observations and/or measurements derived from a large set of audio data samples. In some implementations, the variance of the added noise may be based on the estimated spatial parameter for a frequency band, a frequency band index and/or a variance of the normalized cross-correlation coefficients.
Some implementations may involve receiving or determining tonality information regarding the first or second set of frequency coefficients. According to some such implementations, the process of block 1050 and/or 1055 may be varied according to the tonality information. For example, if the control information receiver/generator 640 of
In some implementations, the estimated spatial parameters may be estimated alphas for the received coupling channel frequency bands. Some such implementations may involve applying the alphas to audio data corresponding to the coupling channel, e.g., as part of a decorrelation process.
More detailed examples of the method 1020 will now be described. These examples are provided in the context of the E-AC-3 audio codec. However, the concepts illustrated by these examples are not limited to the context of the E-AC-3 audio codec, but instead are broadly applicable to many audio codecs.
In this example, the composite coupling channel is computed as a mixture of discrete sources:
In Equation 8, where sDi represents the row vector of a decoded MDCT transform of a specific frequency range (kstart . . . kend) of channel i, with kend=KCPL, the bin index corresponding to the E-AC-3 coupling begin frequency, the lowest frequency of the received coupling channel frequency range. Here, gx represents a normalization term that does not impact the estimation process. In some implementations, gx may be set to 1.
The decision regarding the number of bins analyzed between kstart and kend may be based on a trade-off between complexity constraints and the desired accuracy of estimating alpha. In some implementations, kstart may correspond to a frequency at or above a particular threshold (e.g., 1 kHz), such that audio data in a frequency range that is relatively closer to the received coupling channel frequency range are used, in order to improve the estimation of alpha values. The frequency region (kstart . . . kend) may be divided into frequency bands. In some implementations, cross-correlation coefficients for these frequency bands may be computed as follows:
In Equation 9, sDi(l) represents that segment of SDi that corresponds to band l of the lower frequency range, and xD(l) represents the corresponding segment of xD. In some implementations, the expectation E{ } may be approximated using a simple pole-zero infinite impulse response (“IIR”) filter, e.g., as follows:
Ê{y}(n)=y(n)·a+Ê{y}(n−1)·(1−a) (Equation 10)
In Equation 10, Ê{y}(n) represents the estimate of E{y} using samples up to block n. In this example, cci(l) is only computed for those channels that are in coupling for the current block. For the purpose of smoothing out the power estimation given only real-based MDCT coefficients, a value of a=0.2 was found to be sufficient. For transforms other than the MDCT, and specifically for complex transforms, a larger value of a may be used. In such cases, a value of a in the range of 0.2<a<0.5 would be reasonable. Some lower-complexity implementations may involve time smoothing of the computed correlation coefficient cci(l) instead of the powers and cross-correlation coefficients. Though not mathematically equivalent to estimating the numerator and denominator separately, such lower-complexity smoothing was found to provide a sufficiently accurate estimate of the cross-correlation coefficients. The particular implementation of the estimation function as a first order IIR filter does not preclude the implementation via other schemes, such as one based on a first-in-last-out (“FILO”) buffer. In such implementations, the oldest sample in the buffer may be subtracted from the current estimate E{ }, while the newest sample may be added to the current estimate E{ }.
In some implementations, the smoothing process takes into consideration whether for the previous block the coefficients SDi were in coupling. For example, if in the previous block, channel i was not in coupling, then for the current block, a may be set to 1.0, since the MDCT coefficients for the previous block would not have been included in the coupling channel. Also, the previous MDCT transform could have been coded using the E-AC-3 short block mode, which further validates setting a to 1.0 in this case.
At this stage, cross-correlation coefficients between individual channels and a composite coupling channel have been determined. In the example of
In one example, using the cross-correlation coefficients for the frequency bands below KCPL (the lowest frequency of the received coupling channel frequency range), an estimate of the alphas to be used for decorrelation of MDCT coefficients above KCPL may be generated. The pseudo-code for computing the estimated alphas from the cci(l) values according to one such implementation is as follows:
A principal input to the above extrapolation process that generates alphas is CCm, which represents the mean of the correlation coefficients (cci(l)) over the current region. A “region” may be an arbitrary grouping of consecutive E-AC-3 blocks. An E-AC-3 frame could be composed of more than one region. However, in some implementations regions do not straddle frame boundaries. CCm may be computed as follows (indicated as the function MeanRegion( ) in the above pseudo-code):
In Equation 11, i represents the channel index, L represents the number of low-frequency bands (below KCPL) used for estimation, and N represents the number of blocks within the current region. Here we extend the notation cci(l) to include the block index n. The mean cross-correlation coefficient may next be extrapolated to the received coupling channel frequency range via repeated application of the following scaling operation to generate a predicted alpha value for each coupling channel frequency band:
fAlphaRho=fAlphaRho
When applying Equation 12, fAlphaRho for the first coupling channel frequency band may be CCm(i)*MAPPED_VAR_RHO. In the pseudo-code example, the variable
At this stage, spatial parameters (alphas in this example) have been estimated. In the example of
Based on an analysis of how the prediction error varies with frequency for a large corpus of different types of multichannel input signals, the inventors have formulated heuristic rules that control the degree of randomization that is imposed on the estimated alpha values. The estimated spatial parameters in the coupling channel frequency range (obtained by correlation calculation from lower frequencies followed by extrapolation) may eventually have the same statistics as if these parameters had been calculated directly in the coupling channel frequency range from the original signal, when all the individual channels were available without being coupled. The goal of adding noise is to impart a statistical variation similar to that which was empirically observed. In the pseudo-code above, VB represents an empirically-derived scaling term that dictates how the variance changes as a function of band index. VM represents an empirically-derived feature that is based on the prediction for alpha before the synthesized variance is applied. This accounts for the fact that the variance of prediction error is actually a function of the prediction. For instance, when the linear prediction of the alpha for a band is close to 1.0 the variance is very low. The term CCv represents a control based on the local variance of the computed cci values for the current shared block region. CCv may be computed as follows (indicated by VarRegion( ) in the above pseudo-code):
In this example, VB controls the dither variance according to the band index. VB was derived empirically by examining the variance across bands of the alpha prediction error calculated from the source. The inventors discovered that the relationship between normalized variance and the band index l may be modeled according to the following equation:
The VM parameter was derived by examining the behavior of the alpha prediction error as a function of the prediction itself. In particular, the inventors discovered through analysis of a large corpus of multichannel content that when the predicted alpha value is negative the variance of prediction error increases, with a peak at alpha=−0.59375. This implies that when the current channel under analysis is negatively correlated to the downmix xD, the estimated alpha may generally be more chaotic. Equation 14, below, models the desired behavior:
In Equation 14, q represents the quantized version of the prediction (denoted by fAlphaRho in the pseudo-code), and may be computed according to:
q=floor(fAlphaRho*128)
In this implementation, the next step is to scale the random variable w by the three factors VM, Vb and CCv. The geometric mean between VM and CCv may be computed and applied as the scaling factor to the random variable. In some implementations, w may be implemented as a very large table of random numbers with a zero mean unit variance Gaussian distribution.
After the scaling process, a smoothing process may be applied. For example, the dithered estimated spatial parameters may be smoothed across time, e.g., by using a simple pole-zero or FILO smoother. The smoothing coefficient may be set to 1.0 if the previous block was not in coupling, or if the current block is the first block in a region of blocks. Accordingly, the scaled random number from the noise record w may be low-pass filtered, which was found to better match the variance of the estimated alpha values to the variance of alphas in the source. In some implementations, this smoothing process may be less aggressive (i.e., IIR with a shorter impulse response) than the smoothing used for the cci(l) s.
As noted above, the processes involved in estimating alphas and/or other spatial parameters may be performed, at least in part, by a control information receiver/generator 640 such as the one that is illustrated in
In addition to the audio data elements 245 and 220, the transient control module 655 may receive other associated audio information, such as the decorrelation information 240a and 240b, in block 1105. In this example, the decorrelation information 240a may include explicit decorrelator-specific control information. For example, the decorrelation information 240a may include explicit transient information such as that described below. The decorrelation information 240b may include information from a bitstream of a legacy audio codec. For example, the decorrelation information 240b may include time segmentation information that is available in a bitstream encoded according to the AC-3 audio codec or the E-AC-3 audio codec. For example, the decorrelation information 240b may include coupling-in-use information, block-switching information, exponent information, exponent strategy information, etc. Such information may have been received by an audio processing system in a bitstream along with audio data 220.
Block 1110 involves determining audio characteristics of the audio data. In various implementations, block 1110 involves determining transient information, e.g., by the transient control module 655. Block 1115 involves determining an amount of decorrelation for the audio data based, at least in part, on the audio characteristics. For example, block 1115 may involve determining decorrelation control information based, at least in part, on transient information.
In block 1115, the transient control module 655 of
In some implementations, block 1110 of
In some implementations, the explicit transient information may indicate a transient value corresponding to a definite transient event. Such a transient value may be a relatively high (or maximum) transient value. A high transient value may correspond to a high likelihood and/or a high severity of a transient event. For example, if possible transient values range from 0 to 1, a range of transient values between 0.9 and 1 may correspond to a definite and/or a severe transient event. However, any appropriate range of transient values may be used, e.g., 0 to 9, 1 to 100, etc.
The explicit transient information may indicate a transient value corresponding to a definite non-transient event. For example, if possible transient values range from 1 to 100, a value in the range of 1-5 may correspond to a definite non-transient event or a very mild transient event.
In some implementations, the explicit transient information may have a binary representation, e.g. of either 0 or 1. For example, a value of 1 may correspond with a definite transient event. However, a value of 0 may not indicate a definite non-transient event. Instead, in some such implementations, a value of 0 may simply indicate the lack of a definite and/or a severe transient event.
However, in some implementations, the explicit transient information may include intermediate transient values between a minimum transient value (e.g., 0) and a maximum transient value (e.g., 1). An intermediate transient value may correspond to an intermediate likelihood and/or an intermediate severity of a transient event.
The decorrelation filter input control module 1125 of
Based on the decorrelation information 240a and/or 240b, the decorrelation filter input control module 1125 may sometimes determine a transient value corresponding to a definite transient event in block 1110. If so, in some implementations the decorrelation filter input control module 1125 may determine in block 1115 that a decorrelation process (and/or a decorrelation filter dithering process) should be temporarily halted. Accordingly, in block 1120 the decorrelation filter input control module 1125 may generate decorrelation signal generator control information 625e indicating that a decorrelation process (and/or a decorrelation filter dithering process) should be temporarily halted. Alternatively, or additionally, in block 1120 the soft transient calculator 1130 may generate decorrelation signal generator control information 625f, indicating that a decorrelation filter dithering process should be temporarily halted or slowed down.
In alternative implementations, block 1110 may involve receiving no explicit transient information with the audio data. However, whether or not explicit transient information is received, some implementations of method 1100 may involve detecting a transient event according to an analysis of the audio data 220. For example, in some implementations, a transient event may be detected in block 1110 even when explicit transient information does not indicate a transient event. A transient event that is determined or detected by a decoder, or a similar audio processing system, according to an analysis of the audio data 220 may be referred to herein as a “soft transient event.”
In some implementations, whether a transient value is provided as an explicit transient value or determined as a soft transient value, the transient value may be subject to an exponential decay function. For example, the exponential decay function may cause the transient value to smoothly decay from an initial value to zero over a period of time. Subjecting a transient value to an exponential decay function may prevent artifacts associated with abrupt switching.
In some implementations, detecting a soft transient event may involve evaluating the likelihood and/or the severity of a transient event. Such evaluations may involve calculating a temporal power variation in the audio data 220.
The method 1150 begins with block 1152, wherein upmixed audio data in a coupling channel frequency range are received. In
Block 1156 involves computing the frequency-band-weighted logarithmic power (“WLP”) for each channel and block of the upmixed audio data. To compute the WLP, the power of each power band may be determined. These powers may be converted into logarithmic values and then averaged across the power bands. In some implementations, block 1156 may be performed according to the following expression:
WLP[ch][blk]=meanpwr_bnd{log(P[ch][blk][pwr_bnd])} (Equation 15)
In Equation 15, WLP[ch][blk] represents the weighted logarithmic power for a channel and block, [pwr_bnd] represents a frequency band or “power band” into which the received coupling channel frequency range has been divided and meanpwr_bnd {log(P[ch][blk][pwr_bnd])} represents a mean of the logarithms of power across the power bands of the channel and block.
Banding may pre-emphasize the power variation in higher frequencies, for the following reasons. If the entire coupling channel frequency range were one band, then P[ch][blk][pwr_bnd] would be the arithmetic mean of the power at each frequency in the coupling channel frequency range and the lower frequencies that typically have higher power would tend to swamp the value of P[ch][blk][pwr_bnd] and hence the value of log(P[ch][blk][pwr_bnd]). (In this case log(P[ch][blk][pwr_bnd]) would have the same value as mean log(P[ch][blk][pwr_bnd]), because there would be only one band.) Accordingly, the transient detection would be based to a large extent on the temporal variation in the lower frequencies. Dividing the coupling channel frequency range into, for example, a lower frequency band and a higher frequency band and then averaging the power of the two bands in the log-domain rather is equivalent to calculating the geometric mean of the power of the lower frequencies and the power of the higher frequencies. Such a geometric mean would be closer to the power of the higher frequencies than would be an arithmetic mean. Therefore banding, determining the log (power) and then determining the mean would tend to result in a quantity that is more sensitive to temporal variation at the higher frequencies.
In this implementation, block 1158 involves determining an asymmetric power differential (“APD”) based on the WLP. For example, the APD may be determined as follows:
In Equation 16, dWLP[ch][blk] represents the differential weighted logarithmic power for a channel and block and WLP[ch][blk][blk-2] represents the weighted logarithmic power for the channel two blocks ago. The example of Equation 16 is useful for processing audio data encoded via audio codecs such as E-AC-3 and AC-3, in which there is a 50% overlap between consecutive blocks. Accordingly, the WLP of the current block is compared to the WLP two blocks ago. If there is no overlap between consecutive blocks, the WLP of the current block may be compared to the WLP of the previous block.
This example takes advantage of the possible temporal masking effect of prior blocks. Accordingly, if the WLP of the current block is greater than or equal to that of the prior block (in this example, the WLP two blocks prior), the APD is set to the actual WLP differential. However, if the WLP of the current block is less than that of the prior block, the APD is set to half of the actual WLP differential. Accordingly, the APD emphasizes increasing power and de-emphasizes decreasing power. In other implementations, a different fraction of the actual WLP differential may be used, e.g., ¼ of the actual WLP differential.
Block 1160 may involve determining a raw transient measure (“RTM”) based on the APD. In this implementation, determining the raw transient measure involves calculating a likelihood function of transient events based on an assumption that the temporal asymmetric power differential is distributed according to a Gaussian distribution:
In Equation 17, RTM[ch][blk] represents a raw transient measure for a channel and block, and SAPD represents a tuning parameter. In this example, when SAPD is increased, a relatively larger power differential will be required to produce the same value of RTM.
A transient control value, which may also be referred to herein as a “transient measure,” may be determined from the RTM in block 1162. In this example, the transient control value is determined according to Equation 18:
In Equation 18, TM[ch][blk] represents the transient measure for a channel and block, TH represents an upper threshold and TL represents a lower threshold.
If a raw transient value is less than or equal to the lower threshold TL, the transient control value is set to its minimum value, which is 0.0 in this example. In some implementations, a minimum transient control value may correspond with a definite non-transient event.
However, if a raw transient value is within the range 1166 between the lower threshold TL and the upper threshold TH, the transient control value may be scaled to an intermediate transient control value, which is between 0.0 and 1.0 in this example. The intermediate transient control value may correspond with a relative likelihood and/or a relative severity of a transient event.
Referring again to
Transient information, whether received along with other audio data or determined by a decoder, may be used to control decorrelation processes. The transient information may include transient control values such as those described above. In some implementations, an amount of decorrelation for the audio data may be modified (e.g. reduced), based at least in part on such transient information.
As described above, such decorrelation processes may involve applying a decorrelation filter to a portion of the audio data, to produce filtered audio data, and mixing the filtered audio data with a portion of the received audio data according to a mixing ratio. Some implementations may involve controlling the mixer 215 according to transient information. For example, such implementations may involve modifying the mixing ratio based, at least in part, on transient information. Such transient information may, for example, be included in the mixer control information 645 by the mixer transient control module 1145. (See
According to some such implementations, transient control values may be used by the mixer 215 to modify alphas in order to suspend or reduce decorrelation during transient events. For example, the alphas may be modified according to the following pseudo code:
In the foregoing pseudo code, alpha[ch][bnd] represents an alpha value of a frequency band for one channel. The term decorrelationDecayArray[ch] represents an exponential decay variable that takes a value ranging from 0 to 1. In some examples, the alphas may be modified toward +/−1 during transient events. The extent of modification may be proportional to decorrelationDecayArray[ch], which would reduce the mixing weights for the decorrelation signals toward 0 and thus suspend or reduce decorrelation. The exponential decay of decorrelationDecayArray[ch] slowly restores the normal decorrelation process.
In some implementations, the soft transient calculator 1130 may provide soft transient information to the spatial parameter module 665. Based at least in part on the soft transient information, the spatial parameter module 665 may select a smoother either for smoothing spatial parameters received in the bitstream or for smoothing energy and other quantities involved in spatial parameter estimation.
Some implementations may involve controlling the decorrelation signal generator 218 according to transient information. For example, such implementations may involve modifying or temporarily halting a decorrelation filter dithering process based, at least in part, on transient information. This may be advantageous because dithering the poles of the all-pass filters during transient events may cause undesired ringing artifacts. In some such implementations, the maximum stride value for dithering poles of a decorrelation filter may be modified based, at least in part, on transient information.
For example, the soft transient calculator 1130 may provide the decorrelation signal generator control information 625f to the decorrelation filter control module 405 of the decorrelation signal generator 218 (see also
For example, the maximum stride value may be multiplied by the forgoing expression when transient events are detected in any channel. The dithering process may be halted or slowed accordingly.
In some implementations, a gain may be applied to filtered audio data based, at least in part, on transient information. For example, the power of the filtered audio data may be matched with the power of the direct audio data. In some implementations, such functionality may be provided by the ducker module 1135 of
The ducker module 1135 may receive transient information, such as transient control values, from the soft transient calculator 1130. The ducker module 1135 may determine the decorrelation signal generator control information 625h according to the transient control values. The ducker module 1135 may provide the decorrelation signal generator control information 625h to the decorrelation signal generator 218. For example, the decorrelation signal generator control information 625h includes a gain value that the decorrelation signal generator 218 can apply to the decorrelation signals 227 in order to maintain the power of the filtered audio data at a level that is less than or equal to the power of the direct audio data. The ducker module 1135 may determine the decorrelation signal generator control information 625h by calculating, for each received channel in coupling, the energy per frequency band in the coupling channel frequency range.
The ducker module 1135 may, for example, include a bank of duckers. In some such implementations, the duckers may include buffers for temporarily storing the energy per frequency band in the coupling channel frequency range determined by the ducker module 1135. A fixed delay may be applied to the filtered audio data and the same delay may be applied to the buffers.
The ducker module 1135 also may determine mixer-related information and may provide the mixer-related information to the mixer transient control module 1145. In some implementations, the ducker module 1135 may provide information for controlling the mixer 215 to modify the mixing ratio based on a gain to be applied to the filtered audio data. According to some such implementations, the ducker module 1135 may provide information for controlling the mixer 215 to suspend or reduce decorrelation during transient events. For example, the ducker module 1135 may provide the following mixer-related information:
In the foregoing pseudo code, TransCtrlFlag represents a transient control value and DecorrGain[ch][bnd] represents the gain to apply to a band of a channel of filtered audio data.
In some implementations, a power estimation smoothing window for the duckers may be based, at least in part, on transient information. For example, a shorter smoothing window may be applied when a transient event is relatively more likely or when a relatively stronger transient event is detected. A longer smoothing window may be applied when a transient event is relatively less likely, when a relatively weaker transient event is detected or when no transient event is detected. For example, the smoothing window length may be dynamically adjusted based on the transient control values such that the window length is shorter when the flag value is close to a maximum value (e.g., 1.0) and longer when the flag value is close to a minimum value (e.g., 0.0). Such implementations may help to avoid time smearing during transient events while resulting in smooth gain factors during non-transient situations.
As noted above, in some implementations transient information may be determined by an encoding device.
In block 1176, audio characteristics, including transient information, are determined. For example, the transient information may be determined as described above with reference to
In some implementations, the audio characteristics determined in block 1176 may include spatial parameters, which may be determined substantially as described elsewhere herein. However, instead of calculating correlations outside of the coupling channel frequency range, the spatial parameters may be determined by calculating correlations within the coupling channel frequency range. For example, alphas for an individual channel that will be encoded with coupling may be determined by calculating correlations between transform coefficients of that channel and the coupling channel on a frequency band basis. In some implementations, the encoder may determine the spatial parameters by using complex frequency representations of the audio data.
Block 1178 involves coupling at least a portion of two or more channels of the audio data into a coupled channel. For example, frequency domain representations of the audio data for the coupled channel, which are within a coupling channel frequency range, may be combined in block 1178. In some implementations, more than one coupled channel may be formed in block 1178.
In block 1180, encoded audio data frames are formed. In this example, the encoded audio data frames include data corresponding to the coupled channel(s) and encoded transient information determined in block 1176. For example, the encoded transient information may include one or more control flags. The control flags may include a channel block switch flag, a channel out-of-coupling flag and/or a coupling-in-use flag. Block 1180 may involve determining a combination of one or more of the control flags to form encoded transient information that indicates a definite transient event, a definite non-transient event, the likelihood of a transient event or the severity of a transient event.
Whether or not formed by combining control flags, the encoded transient information may include information for controlling a decorrelation process. For example, the transient information may indicate that a decorrelation process should be temporarily halted. The transient information may indicate that an amount of decorrelation in a decorrelation process should be temporarily reduced. The transient information may indicate that a mixing ratio of a decorrelation process should be modified.
The encoded audio data frames also may include various other types of audio data, including audio data for individual channels outside the coupling channel frequency range, audio data for channels not in coupling, etc. In some implementations, the encoded audio data frames also may include spatial parameters, coupling coordinates, and/or other types of side information such as that described elsewhere herein.
In this example, the device includes an interface system 1205. The interface system 1205 may include a network interface, such as a wireless network interface. Alternatively, or additionally, the interface system 1205 may include a universal serial bus (USB) interface or another such interface.
The device 1200 includes a logic system 1210. The logic system 1210 may include a processor, such as a general purpose single- or multi-chip processor. The logic system 1210 may include a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components, or combinations thereof. The logic system 1210 may be configured to control the other components of the device 1200. Although no interfaces between the components of the device 1200 are shown in
The logic system 1210 may be configured to perform various types of audio processing functionality, such as encoder and/or decoder functionality. Such encoder and/or decoder functionality may include, but is not limited to, the types of encoder and/or decoder functionality described herein. For example, the logic system 1210 may be configured to provide the decorrelator-related functionality described herein. In some such implementations, the logic system 1210 may be configured to operate (at least in part) according to software stored on one or more non-transitory media. The non-transitory media may include memory associated with the logic system 1210, such as random access memory (RAM) and/or read-only memory (ROM). The non-transitory media may include memory of the memory system 1215. The memory system 1215 may include one or more suitable types of non-transitory storage media, such as flash memory, a hard drive, etc.
For example, the logic system 1210 may be configured to receive frames of encoded audio data via the interface system 1205 and to decode the encoded audio data according to the methods described herein. Alternatively, or additionally, the logic system 1210 may be configured to receive frames of encoded audio data via an interface between the memory system 1215 and the logic system 1210. The logic system 1210 may be configured to control the speaker(s) 1220 according to decoded audio data. In some implementations, the logic system 1210 may be configured to encode audio data according to conventional encoding methods and/or according to encoding methods described herein. The logic system 1210 may be configured to receive such audio data via the microphone 1225, via the interface system 1205, etc.
The display system 1230 may include one or more suitable types of display, depending on the manifestation of the device 1200. For example, the display system 1230 may include a liquid crystal display, a plasma display, a bistable display, etc.
The user input system 1235 may include one or more devices configured to accept input from a user. In some implementations, the user input system 1235 may include a touch screen that overlays a display of the display system 1230. The user input system 1235 may include buttons, a keyboard, switches, etc. In some implementations, the user input system 1235 may include the microphone 1225: a user may provide voice commands for the device 1200 via the microphone 1225. The logic system may be configured for speech recognition and for controlling at least some operations of the device 1200 according to such voice commands.
The power system 1240 may include one or more suitable energy storage devices, such as a nickel-cadmium battery or a lithium-ion battery. The power system 1240 may be configured to receive power from an electrical outlet.
Various modifications to the implementations described in this disclosure may be readily apparent to those having ordinary skill in the art. The general principles defined herein may be applied to other implementations without departing from the spirit or scope of this disclosure. For example, while various implementations have been described in terms of Dolby Digital and Dolby Digital Plus, the methods described herein may be implemented in conjunction with other audio codecs. Thus, the claims are not intended to be limited to the implementations shown herein, but are to be accorded the widest scope consistent with this disclosure, the principles and the novel features disclosed herein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2014/012457 | 1/22/2014 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2014/126683 | 8/21/2014 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
4701936 | Clark | Oct 1987 | A |
7177431 | Davis | Feb 2007 | B2 |
7218665 | McElwain | May 2007 | B2 |
7564978 | Engdegard | Jul 2009 | B2 |
7961890 | Holzer | Jun 2011 | B2 |
7974713 | Disch et al. | Jul 2011 | B2 |
7986789 | Purnhagen | Jul 2011 | B2 |
8015018 | Seefeldt | Sep 2011 | B2 |
8019350 | Purnhagen | Sep 2011 | B2 |
8046214 | Mehrotra | Oct 2011 | B2 |
8081764 | Takagi | Dec 2011 | B2 |
8116459 | Disch | Feb 2012 | B2 |
8145499 | Herre | Mar 2012 | B2 |
8255228 | Hilpert | Aug 2012 | B2 |
8515759 | Engdegard | Aug 2013 | B2 |
8527264 | Davidson | Sep 2013 | B2 |
8793125 | Schuijers | Jul 2014 | B2 |
8885836 | McGrath | Nov 2014 | B2 |
8983834 | Davis | Mar 2015 | B2 |
9305564 | Disch | Apr 2016 | B2 |
20070140499 | Davis | Jun 2007 | A1 |
20070269063 | Goodwin | Nov 2007 | A1 |
20080263285 | Sharma | Oct 2008 | A1 |
20080279388 | Oh | Nov 2008 | A1 |
20090052681 | Schobben | Feb 2009 | A1 |
20090299756 | Davis | Dec 2009 | A1 |
20090304198 | Herre | Dec 2009 | A1 |
20090306993 | Wuebbolt | Dec 2009 | A1 |
20100040243 | Johnston | Feb 2010 | A1 |
20100080397 | Suzuki | Apr 2010 | A1 |
20100327147 | Toliver | Dec 2010 | A1 |
20120020499 | Neusinger | Jan 2012 | A1 |
20120076323 | Disch | Mar 2012 | A1 |
20120243692 | Ramamoorthy | Sep 2012 | A1 |
20130173273 | Kuntz | Jul 2013 | A1 |
Number | Date | Country |
---|---|---|
2144229 | Jan 2010 | EP |
2209114 | Jul 2010 | EP |
2477188 | Jul 2012 | EP |
2008-522551 | Jun 2008 | JP |
2008-543227 | Nov 2008 | JP |
2009-506372 | Feb 2009 | JP |
2010-504017 | Feb 2010 | JP |
2014-512020 | May 2014 | JP |
9941947 | Aug 1999 | WO |
2005101370 | Oct 2005 | WO |
2006008697 | Jan 2006 | WO |
2006026452 | Mar 2006 | WO |
2006048227 | May 2006 | WO |
2007109338 | Sep 2007 | WO |
2011090834 | Jul 2011 | WO |
2011012608 | Oct 2011 | WO |
2012025429 | Jan 2012 | WO |
Entry |
---|
Advanced Television Systems Committee, Inc., “A/52B, ATSC Standard, Digital Audio Compression Standard (AC-3, E-AC-3) Revision B”, Jun. 14, 2005. |
Chen, S. et al “Spatial Parameters for Audio Coding: MDCT Domain Analysis and Synthesis” Multimedia Tools and Applications, Jun. 2010, vol. 48, Issue 2, pp. 225-246. |
Roden, J. et al “A Study of the MPEG Surround Quality Versus Bit-Rate Curve” AES presented at the 123rd Convention, Oct. 5-8, 2007, New York, USA. |
Herre, J. et al “MPEG-4 High-Efficiency AAC Coding” IEEE Signal Processing Magazine, May 2008, vol. 25, Issue 3. |
EBU-TECH 3324 “EBU Evaluations of Multichannel Audio Codecs”, Geneva, Sep. 2007. |
Recommendation ITU-R BS.1534-1, “Method for the Subjective Assessment of Intermediate Quality Level of Coding Systems” Mar. 2003. |
ATSC, “Digital Audio Compression Standard (AC-3, E-AC-3)” Nov. 22, 2010. |
Uhle, Christian “Applause Sound Detection” JAES, vol. 59, No. 4, Apr. 1, 2011, pp. 213-224. |
Laitinen, Mikko-Ville, et al “Reproducing Applause-Type Signals with Directional Audio Coding” JAES, vol. 59, No. 1/2, Jan. 1, 2011, pp. 29-43. |
Briand, M. et al “Parametric Coding of Stereo Audio Based on Principal Component Analysis” Proc. of the 9th Int. Conference on Digital Audio Effects, Montreal, Canada, Sep. 18-20, 2006, pp. 291-296. |
Kuntz, A. et al “The Transient Steering Decorrelator Tool in the Upcoming MPEG Unified Speech and Audio Coding Standard”, AES Convention, Oct. 19, 2011. |
Gupta, S. et al “Feature Predictive Vector Quantization of Multispectral Images” IEEE Transactions on Geoscience and Remote Sensing, vol. 30, No. 3, May 1992, pp. 491-501. |
ISO/IEC “Definitions and Interface Descriptions for NBC RM2” MPEG Meeting, Jul. 1995, ISO/IEC JTC1/SC29/WG11, Coding of Moving Pictures and Associated Audio. |
Yang, D. et al “Adaptive Karhunen-Loeve Transform for Enhanced Multichannel Audio Coding” SPIE, vol. 4475, Dec. 5, 2001. |
Ortega, A. et al “Adaptive Scalar Quantization Without Side Information” IEEE Transactions on Image Processing, Piscataway, NJ, USA, vol. 6, No. 5, May 1, 1997. |
Fielder, L. D. et al “Introduction to Dolby Digital Plus, an Enhancement to the Dolby Digital Coding System” AES Convention Paper 6196, presented at the 117th Convention, Oct. 28-31, 2004, San Francisco, CA, USA. |
Melkote, V. et al “Transform-Domain Decorrelation in Dolby Digital Plus” IEEE International Conference on Acoustic, Speech and Signal Processing, May 4, 2014, pp. 6949-6953. |
Varner, L.W. et al “A Simple Adaptive Filtering Technique for Speech Enhancement” IEEE on Acoustics, Speech and Signal Processing, vol. 8, pp. 1126-1128, Apr. 1983. |
Yu, Shiang-Hwua, “Feedback Dithering for Decorrelating Quantization Noise and Enhancing SNDR” IEEE Transactions on Control Systems Technology, vol. 20, No. 3, pp. 621-631, May 2012. |
Neuendorf, M. et al “MPEG Unified Speech and Audio Coding—the ISO/MPEG Standard for High-Efficiency Audio Coding of all Content Types” Audio Engineering Society Convention, AES Convention 132, published on Apr. 26, 2012. |
Breebaart, J. et al “MPEG Spatial Audio Coding/MPEG Surround: Overview and Current Status” AES Convention 119, Oct. 1, 2005. |
ISO/IEC FDIS 23003-3:2011-Information Technology—MPEG Audio Technologies—Part 3: Unified Speech and Audio Coding, ISO/IEC JTC 1/SC 29 WG 11, Sep. 2011. |
Number | Date | Country | |
---|---|---|---|
20160005413 A1 | Jan 2016 | US |
Number | Date | Country | |
---|---|---|---|
61764869 | Feb 2013 | US |