This disclosure relates to digital processing or coding of two or more audio channels of a sound program, for bit rate reduction.
Two-channel stereo is an audio format that conveys a stereo “image” to the listener. The image is the perceptual product invoked by similarities between the audio signals in the two channels. Several methods have been applied to take advantage of these signal similarities for bit rate reduction. The similarities are associated with redundant signal components, from a signal processing point of view. Furthermore, the limited abilities of human listeners to perceive all details of the image can be considered thereby achieving further bit rate reduction. For instance, with Intensity Stereo coding, only a sum channel of the left and right channel is transmitted along with a panning value, to pan the mono image of the sum signal at the receiver back to the position of the original image. If the original stereo channels are highly correlated, then a strong bit rate reduction is possible. In another technique called Sum-Difference coding, it is possible to fully reconstruct the stereo channels because the difference signal is also transmitted in addition to the sum signal. Sum-Difference coding is also referred to as Mid-Side coding.
One aspect of the disclosure here is a new method for stereo and multichannel coding in which i) a single selected channel or a sum of two or more channels, ii) one or more residual signals, and iii) one or more parameters, are transmitted to a decoder side process that uses the parameters to undo the coding, to recover the audio channels of the sound program. The method may achieve bit rate reduction even though the two or more input audio channels differ by delay (time delay) and gain. It may achieve similar performance (bit rate reduction) as Sum-Difference coding when the channels are identical, and similar performance as Intensity Stereo coding for stereo signals which only differ by a gain factor. Other aspects are also described.
The above summary does not include an exhaustive list of all aspects of the present disclosure. It is contemplated that the disclosure includes all systems and methods that can be practiced from all suitable combinations of the various aspects summarized above, as well as those disclosed in the Detailed Description below and particularly pointed out in the Claims section. Such combinations may have advantages not specifically recited in the above summary.
Several aspects of the disclosure here are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. In the interest of conciseness and reducing the total number of figures, a given figure may be used to illustrate the features of more than one aspect of the disclosure, and not all elements in the figure may be required for a given aspect.
Several aspects of the disclosure with reference to drawings in the figures are now explained. It should be noted that references to “an” or “one” aspect in this disclosure are not necessarily to the same aspect, and they mean at least one. Whenever the shapes, relative positions and other aspects of the parts described are not explicitly defined, the scope of the invention is not limited only to the parts shown, which are meant merely for the purpose of illustration. Also, while numerous details are set forth, it is understood that some aspects of the disclosure may be practiced without these details. In other instances, well-known circuits, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description. Note that all of the operations described below which are part of an encoder-side method or a decoder-side method may be performed by a suitable, programmed digital computing system, for example a server, a computer workstation, or a consumer electronics product such as a television, a set top box, or a digital media player. In such systems, one or more digital processors (generically referred to here as “a” processor) are executing instructions stored in a machine-readable medium such as solid state memory, to perform the emcoding/coding and decoding methods described below.
A diagram of Sum-Difference (SD) Coding, also known as Mid-Side (MS) Coding, is shown in
However, the efficiency of conventional SD coding for highly correlated signals is drastically reduced when the two input audio channels deviate by even only a small time-delay or level difference. For instance, consider the arrangement in
One aspect of the disclosure here is a new, Channel-Aligned Coding (CAC) method that may have better efficiency than conventional SD coding. The CAC method is based on aligning one input channel to the other before the difference signal is calculated.
In one aspect, the Channel-Aligned Coding (CAC) method may be as shown in
Modern perceptual audio codecs take advantage of coding the audio signal in a sub-band domain. For example, the modified discrete cosine transform, MDCT, domain is used in many recent audio codecs such as MPEG-4 AAC to represent each audio channel in the MDCT domain, and the coding process is applied to the sub-band signals of each channel. The following description is based on the MDCT representation, but it is also applicable to other filter banks and transforms.
CAC coding in sub-bands has several advantages. The codec can selectively apply CAC only to those bands that can be more efficiently encoded by CAC, rather than by other coding methods. Furthermore, the side information which contains a gain parameter, and in some cases also a delay parameter (which are used in the decoder side to control the CAC method), can be shared across several or all sub-bands, which reduces the side information bit rate. In one aspect of the disclosure here, there are two parameters, delay, and gain, which are chosen to be transmitted as side information because they are likely to be consistent across several sub-bands.
The bitrate of the side information may be reduced by quantizing the parameters (that are in the side information.)
The side information bit rate for the parameters can be further reduced by entropy coding. For example, differential Huffman coding can be applied to the parameters (before transmitting them in the side information.) When implemented in sub-band domain, the parameter differences can be calculated based on neighboring sub-bands or based on the same sub-band in subsequent audio frames, for example.
The CAC method described above can be extended to multi-channel signals (more than two audio channels), as shown in
For some multichannel audio signals, it can be advantageous to divide the total number of channels into channel groups and to apply CAC to each group independently. For example, if the rear channels of a 5.1 surround signal have only a small similarity to the front channels, the rear channels can be treated as an independent channel group and one of the rear channels will be aligned to the other rear channel to minimize the difference (residual) signal energy.
It is not necessary to always align the same channel to the others. Given that the digital signal processing here is on a per frame or window basis (e.g., where each frame or window contains the samples of a digital audio signal that span a few milliseconds or a few tens of milliseconds), which may also be on a per sub-band basis within each frame or window, the roles of the channels can be switched dynamically from one audio frame to a subsequent audio frame. For example, for a stereo signal with L and R channels, in frame n, A=L and B=R while in frame n+1, A=R and B=L. Also, CAC may be applied only to selected audio frames and sub-bands where it is beneficial depending on the audio signal.
Channel-Aligned Sum-Difference Coding
When comparing SD coding as shown in
SD coding is most efficient when L and R are similar. This is the case when the cross-correlation between L and R is exceedingly high and the stereo image is very narrow and focused, like a point source. For such a signal, uncorrelated noise may be easier to hear due to the spatial unmasking effect because the noise is spread across a wider angle than the signal. Therefore, it may be advantageous to use a quantization method that generates correlated noise between the two channels, as is possible for SD coding as shown below.
To combine the advantages of correlated quantization noise and aligned-channel coding, the basic SD coding system of
In mathematical terms, the sum and residual signals are calculated by:
S=0.5(A+B+[A−align(A)])=0.5(2A+B−align(A))
R=0.5(A−B−[A−align(A)])=0.5(B−align(A))
The alignment parameters to minimize the side signal energy are identical to the ones derived for the L/R coding case. In other words:
The quantization noise correlation can be approximated by assuming that we can replace each quantizer by an independent additive noise source, NS for the sum signal, and NR for the residual signal. This is shown in
ρA′B′(NS)=1
ρA′B′(NR)=−1
The overall quantization noise correlation can be controlled by adjusting the relative noise levels of the quantizers, for example by using different quantization step sizes.
For channel-aligned SD coding of
For the generic case, the normalized cross-correlation (NCC) of the quantization noise components in the output channels is written as:
ρA′B′(NS)=NCC(NS,align(NS))
ρA′B′(NR)=NCC(NR,align(NR)−2NR)
For the special case when B==A, the optimum alignment is align(A)==A, which means that the channel-aligned SD system behaves identically to the basic SD system. In that case the quantization noise correlation is:
ρA′B′(NS)=NCC(NS,NS)=1
ρA′B′(NR)=NCC(NR,−NR)=−1
For the special case where B==gA, with the constant gain g, the optimum alignment is align(A)==gA. In that case the quantization noise correlation is (for g<2):
ρA′B′(NS)=NCC(NS,gNS)=1
ρA′B′(NR)=NCC(NR,(g−2)NR)=−1
These cases illustrate that the noise component of NS in the output signal closely approaches the cross-correlation and panning of the signal, which is advantageous in terms of avoiding spatial unmasking. The noise component of NR in the output signal usually has negative cross-correlation, but the correlation can be positive if the gain is larger than 2 or when a nonzero delay results in a phase inversion. The noise level of NR can be reduced relative to NS to avoid spatial unmasking.
In addition to the negative cross-correlation of the NR related noise, unmasking may be caused when g<1 because there is more NR related noise energy located on the opposite side of the sound source. For example, for g=0.5 the sound source is expected to be located close to the location of A since the B is approximated as B=0.5 A. However, the noise NR is panned the opposite way, i.e. BN
Simplified Channel-Aligned Sum-Difference Coding
Many audio productions contain mono audio objects that are placed into the stereo image by panning (a common technique used in mixing on digital audio workstations). When applied to a single object, this technique results in a certain gain and zero delay between the channels. For stereo signals that contain such panned objects, a simplified alignment method with less complexity can therefore achieve good performance. In this case the alignment block uses a delay of zero, so that it can be simplified to just a multiplier for the gain factor. This is shown in
Channel-Aligned Coding For Multichannel Audio
The channel-aligned SD coding approach can be extended to audio formats with more than two channels. Since the core SD structure has two input channels and transmits a sum and residual channel, it can be extended for multichannel signals by cascading multiple SD structures. An example for four channels is shown in
Other topologies are possible that also result in a single transmitted sum channel and residual channels. This approach can be applied to any number of channels from a multi-channel signal. Since only a single sum channel is transmitted that has a comparable bitrate of a regular audio channel, we expect a significant bitrate reduction because each residual channel is expected to consume less bitrate than a regular audio channel for highly correlated input channels.
Channel-Aligned Coding Based On Adaptive Mixing Matrix
As mentioned above, bitrate reduction and audio quality optimization approaches may include the following strategies:
Both approaches are considered above, by combining CAC with L/R or M/S coding. Here we propose a combination of CAC with an adaptive mixing matrix, where the CAC does not include time-delay compensation.
The matrix coefficients for a 2×2 mixing matrix M may be defined as
An encoder-side process computes the mixing matrix coefficients so that the energy of R, the residual signal, is reduced or even minimized. In such a process, a single parameter, for example a gain parameter g, is sufficient for a complementary decoder-side process to compute the inverse of the mixing matrix. The inverse matrix may be defined as
where one can assume that the determinant |M|=ad−bc=1. The vector notation of the signal pairs in
X=[A B] (3)
X′=[A′ B′] (4)
Y=[S R] (5)
Y′=[S′ R′] (6)
Each vector contains a pair of samples. The samples can represent the time domain signal or frequency domain signal (sub-band sample), such as an MDCT sub-band sample. With the vector notation, the matrix multiplication operations can be written as:
X·M=Y and (7)
Y′·M
−1
=X′ (8)
To minimize the energy of R using the gain parameter g, one can use the same approach as in the previous sections to compute the residual as the difference of the gain-aligned channels:
R=−gA+B (9)
According to (7), R is calculated from the input signal X by
R=bA+dB (10)
By comparing the coefficients in (9) and (10), two matrix coefficients are determined:
b=−g and d=1 (11)
Using (8) to compute A′ and B′, one can eliminate one more variable by using (9) and (11):
Comparison of the coefficients in (13) and (15) results in
a=1−cg
At this point, the matrix coefficient c is the only remaining free parameter. It is used to minimize the quantization noise energy emanating from the residual signal quantization (NR). Two quantization noise sources N_S, N_R can be used to model the quantizers Q in
N
Y
=[N
S
N
R] (16)
N
X
=[N
A
N
B] (17)
N
Y
·M
−1
=N
X (18)
Using (18) results in the following expressions for the output noise signals:
N
A
=N
S
−cN
R (19)
N
B
=gN
S+(1−cg)NR (20)
The noise energy originating from NR is therefore:
E
R=(c2+(1−cg)2)NR2 (21)
The minimum energy is reached for any c that fulfills
(c2+(1−cg)2)→cMin. (22)
The solution is
hence, the minimum energy is
With that, the output quantization noise is
With (23) and 0, we obtain the matrix coefficient
In summary, the adaptive matrix and its inverse are
It can be shown that |M|==1. It is interesting to note that the adaptive matrix is equivalent to L/R coding if g=0 and it is equivalent to M/S coding if g=1.
Given the matrix coefficients, the stepwise processing of the channel signals can be written in scalar notation:
To limit the range of the gain so that |g|≥1, the L/R input channels can be swapped (when mapped to AB) if necessary to achieve that. A limited range of g is advantageous as it reduces the range that needs to be considered for parameter tuning of a codec to achieve the best bit rate versus quality tradeoff.
A comparison of the output noise level gain for each channel for noise that originates from the residual signal quantizer (NR) can be plotted based on the noise analysis for the enhanced SD coding above and the matrix-based coding in (25) and (26). For this comparison, the residual channel signal may be normalized to
for all coding methods. It shows that the noise gain of the matrix-based coding is significantly lower.
Normalized Adaptive Mixing Matrix
In one aspect, the adaptive mixing matrix introduced above can be normalized such that the forward and inverse matrix are identical:
When applying the channel-swapping method similarly as described above to achieve |g|≤1, the symmetric matrixes result in a similar energy for the Sum and Residual, ES,R signal compared with the input signal EA,B.
As was suggested earlier, g may be computed on a per audio frame basis, and on a per sub-band basis in the case of sub-band domain (as compared to time domain.) For example, the level difference between the two input audio channels (in the same frame), is measured by an encoder-side process, e.g., as a ratio, and this level difference may be used as (or may become a good estimate for) g; g increases as the level difference increases. The g values can be encoded as described below (for further bitrate reduction.)
It can be shown that the normalized matrix is a method which is a superset of traditional stereo coding techniques, and this is summarized in Table 1 below.
Coding of CAC Parameters
The normalized mixing matrix is determined by a single coded gain parameter gc.
The channel swapping can be controlled depending on the value of gc. For example, as described above, the channels are swapped when |gc|>1. Correspondingly, the gain used to determine the coefficients of the matrix is:
To reduce the bitrate, the gain values gc can be quantized, for example by using a logarithmic scale which corresponds to uniform intervals on a loudness scale. The quantized values can then be encoded to further reduce the bitrate. For example, entropy coding can be used to take advantage of the statistics of the coded value. More frequently occurring values are coded with less bits than others—this is known as variable length coding. A common technique for entropy coding is Huffman coding. To further reduce the bitrate, run length coding can be applied which encodes the number of repeated values instead of encoding the same value multiple times in a sequence. Run length coding can take advantage of the properties of CAC with respect to the expectation that gain values across sub-bands are similar or equal for a particular sound source.
Table 2 shows an example bitstream syntax for the CAC gain parameter encoding, for a single audio frame, which makes up an encoded audio bitstream that is being transmitted from the encoder side to the decoder side. The payload cacGain( ) may be present in every frame of the encoded audio bitstream, among other payloads, and it controls the application of the inverse CAC adaptive matrix in the decoder.
The decoder may be configured with a constant number of sub-bands for each channel, and a CAC mixing matrix can be applied to each sub-band with an individual (respective) CAC gain parameter. As described above, in this example, the index of the quantized gain parameter, cacGainIndex is what is being transmitted to the decoder side, not the actual gain parameter values. Also, the index is encoded using for example Huffman coding. The decoder has stored therein a table (a predefined table) like Table 3 which contains a list of gain parameter values, e.g., between 10 to 20 different values, and their respective index values. The run length repeatCount is also Huffman encoded. Starting from the lowest sub-band, the run-length indicates how many sub-bands the same cacGainIndex is applied. For the next sub-band, the next cacGainIndex is applied and repeated in repeatCount sub-bands, and so on. The last value of repeatCount for the frame is kRepeatAllRemaining, which indicates that the last cacGainIndex is used for all remaining sub-bands. As an example, consider the case where there are a total of 30 sub-bands. If the decoder process receives cacGainIndex=4, repeatCount=10, cacGainIndex=5, repeatCount=43, then it will set gains of the first 11 bands according to the gainIndex of 4, and the remaining 19 bands will be set to have a gainIndex of 5.
For a specific implementation, Table 3 is used for the coded gain parameter gc. For indices i>17, the gain parameter is gc(i)=−gc(i−17).
20
21
While certain aspects have been described and shown in the accompanying drawings, it is to be understood that such are merely illustrative of and not restrictive on the broad invention, and that the invention is not limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those of ordinary skill in the art. The description is to be regarded as illustrative instead of limiting.
Number | Date | Country | |
---|---|---|---|
63332199 | Apr 2022 | US |