1. Field of the Invention
This invention relates to lossless audio codecs and more specifically to a lossless multi-channel audio codec using adaptive segmentation with random access point (RAP) capability and multiple prediction parameter set (MPPS) capability.
2. Description of the Related Art
Numbers of low bit-rate lossy audio coding systems are currently in use in a wide range of consumer and professional audio playback products and services. For example, Dolby AC3 (Dolby digital) audio coding system is a world-wide standard for encoding stereo and 5.1 channel audio sound tracks for Laser Disc, NTSC coded DVD video, and ATV, using bit rates up to 640 kbit/s. MPEG I and MPEG II audio coding standards are widely used for stereo and multi-channel sound track encoding for PAL encoded DVD video, terrestrial digital radio broadcasting in Europe and Satellite broadcasting in the US, at bit rates up to 768 kbit/s. DTS (Digital Theater Systems) Coherent Acoustics audio coding system is frequently used for studio quality 5.1 channel audio sound tracks for Compact Disc, DVD video, Satellite Broadcast in Europe and Laser Disc and bit rates up to 1536 kbit/s.
Recently, many consumers have shown interest in these so-called “lossless” codecs. “Lossless” codecs rely on algorithms which compress data without discarding any information and produce a decoded signal which is identical to the (digitized) source signal. This performance comes at a cost: such codecs typically require more bandwidth than lossy codecs, and compress the data to a lesser degree.
Framing 10 is introduced to provide for editability, the sheer volume of data prohibits repetitive decompression of the entire signal preceding the region to be edited. The audio signal is divided into independent frames of equal time duration. This duration should not be too short, since significant overhead may result from the header that is prefixed to each frame. Conversely, the frame duration should not be too long, since this would limit the temporal adaptivity and would make editing more difficult. In many applications, the frame size is constrained by the peak bit rate of the media on which the audio is transferred, the buffering capacity of the decoder and desirability to have each frame be independently decodable.
Intra-channel decorrelation 12 removes redundancy by decorrelating the audio samples in each channel within a frame. Most algorithms remove redundancy by some type of linear predictive modeling of the signal. In this approach, a linear predictor is applied to the audio samples in each frame resulting in a sequence of prediction error samples. A second, less common, approach is to obtain a low bit-rate quantized or lossy representation of the signal, and then losslessly compress the difference between the lossy version and the original version. Entropy coding 14 removes redundancy from the error from the residual signal without losing any information. Typical methods include Huffman coding, run length coding and Rice coding. The output is a compressed signal that can be losslessly reconstructed.
The existing DVD specification and the preliminary HD DVD specification set a hard limit on the size of one data access unit, which represents a part of the audio stream that once extracted can be fully decoded and the reconstructed audio samples sent to the output buffers. What this means for a lossless stream is that the amount of time that each access unit can represent has to be small enough that the worst case of peak bit rate, the encoded payload does not exceed the hard limit. The time duration must be also be reduced for increased sampling rates and increased number of channels, which increase the peak bit rate.
To ensure compatibility, these existing coders will have to set the duration of an entire frame to be short enough to not exceed the hard limit in a worst case channel/sampling frequency/bit width configuration. In most configurations, this will be overkill and may seriously degrade compression performance. Furthermore, this worst case approach does not scale well with additional channels.
The present invention provides an audio codec that generates a lossless variable bit rate (VBR) bitstream with random access point (RAP) capability to initiate lossless decoding at a specified segment within a frame and/or multiple prediction parameter set (MPPS) capability partitioned to mitigate transient effects.
This is accomplished with an adaptive segmentation technique that determines segment start points to ensure boundary constraints on segments imposed by the existence of a desired RAP and/or one or more transients in the frame and selects a optimum segment duration in each frame to reduce encoded frame payload subject to an encoded segment payload constraint. In general, the boundary constraints specify that a desired RAP or transient must lie within a certain number of analysis blocks of the start of a segment. In an exemplary embodiment in which segments within a frame are of the same duration and a power of two of the analysis block duration, a maximum segment duration is determined to ensure the desired conditions are met. RAP and MPPS are particularly applicable to improve overall performance for longer frame durations.
In an exemplary embodiment, a lossless VBR audio bitstream is encoded with RAPs (RAP segments) aligned to within a specified tolerance of desired RAPs provided in an encoder timing code. Each frame is blocked into a sequence of analysis blocks with each segment having a duration equal to that of one or more analysis blocks. In each successive frame up to one RAP analysis block is determined from the timing code. The location of the RAP analysis block and a constraint that the RAP analysis block must lie within M analysis blocks of the start of the RAP segment fixes a start of a RAP segment. Prediction parameters are determined for the frame, two sets of parameters (per channel) if MPPS is enabled and a transient is detected in a channel. The samples in the audio frame are compressed with the prediction being disabled for the first samples up to the prediction order following the start of the RAP segment. Adaptive segmentation is employed on the residual samples to determine a segment duration and entropy coding parameters for each segment to minimize the encoded frame payload subject to the fixed start of the RAP segment and the encoded segment payload constraints. RAP parameters indicating the existence and location of the RAP segment and navigation data are packed into the header. In response to a navigation command to initiate playback such as user selection of a scene or surfing, the decoder unpacks the header of the next frame in the bitstream to read the RAP parameters until a frame including a RAP segment is detected. The decoder extracts segment duration and navigation data to navigate to the start of the RAP segment. The decoder disables prediction for the first samples until a prediction history is reconstructed and then decodes the remainder of the segments and subsequent frames in order, disabling the predictor each time a RAP segment is encountered. This construct allows a decoder to initiate decoding at or very near encoder-specified RAPs with a sub-frame resolution. This is particularly useful with longer frame durations when trying to sync audio playback to a video timing code that specifies RAPs at, for example, the beginning of chapters.
In another exemplary embodiment, a lossless VBR audio bitstream is encoded with MPPSs partitioned so that detected transients are located within the first L analysis blocks of a segment in their respective channels. In each successive frame up to one transient per channel per channel set and its location within the frame is detected. Prediction parameters are determined for each partition considering the segment start point(s) imposed by the transient(s). The samples in each partition are compressed with the respective parameter set. Adaptive segmentation is employed on the residual samples to determine a segment duration and entropy coding parameters for each segment to minimize the encoded frame payload subject to the segment start constraints imposed by the transient(s) (and RAP) and the encoded segment payload constraints. Transient parameters indicating the existence and location of the first transient segment (per channel) and navigation data are packed into the header. A decoder unpacks the frame header to extract the transient parameters and additional set of prediction parameters. For each channel in a channel set, the decoder uses the first set of prediction parameters until the transient segment is encountered and switches to the second set for the remainder of the segment. Although the segmentation of the frame is the same across channels and multiple channel sets, the location of a transient (if any) may vary between sets and within sets. This construct allows a decoder to switch prediction parameter sets at or very near the onset of detected transients with a sub-frame resolution. This is particularly useful with longer frame durations to improve overall coding efficiency.
Compression performance may be further enhanced by forming M/2 decorrelation channels for M-channel audio. The triplet of channels (basis, correlated, decorrelated) provides two possible pair combinations (basis, correlated) and (basis, decorrelated) that can be considered during the segmentation and entropy coding optimization to further improve compression performance. The channel pairs may be specified per segment or per frame. In an exemplary embodiment, the encoder frames the audio data and then extracts ordered channel pairs including a basis channel and a correlated channel and generates a decorrelated channel to form at least one triplet (basis, correlated, decorrelated). If the number of channels is odd, an extra basis channel is processed. Adaptive or fixed polynomial prediction is applied to each channel to form residual signals. For each triplet, the channel pair (basis, correlated) or (basis, decorrelated) with the smallest encoded payload is selected. Using the selected channel pair, a global set of coding parameters can be determined for each segment over all channels. The encoder selects the global set or distinct sets of coding parameters based on which has the smallest total encoded payload (header and audio data).
In either approach, once the optimal set of coding parameters and channel pairs for the current partition (segment duration) have been determined, the encoder calculates the encoded payload in each segment across all channels. Assuming the constraints on segment start and maximum segment payload size for any desired RAPs or detected transients are satisfied, the encoder determines whether the total encoded payload for the entire frame for the current partition is less than the current optimum for an earlier partition. If true, the current set of coding parameters and encoded payload is stored and the segment duration is increased. The segmentation algorithm suitably starts by partitioning the frame into the minimum segment sizes equal to the analysis block size and increases the segment duration by a power of two at each step. This process repeats until either the segment size violates the maximum size constraint or the segment duration grows to the maximum segment duration. The enablement of the RAP or MPPS features and the existence of a desired RAP or detected transient within a frame may cause the adaptive segmentation routine to choose a smaller segment duration than it otherwise would.
These and other features and advantages of the invention will be apparent to those skilled in the art from the following detailed description of preferred embodiments, taken together with the accompanying drawings, in which:
a and 2b are block diagrams of a lossless audio encoder and decoder, respectively, in accordance with the present invention;
a and 4b are block diagrams of the analysis window processing and inverse analysis window processing;
a and 6b are block diagrams of adaptive prediction analysis and processing and inverse adaptive prediction processing;
a and 7b are a flow chart of optimal segmentation and entropy code selection;
a and 8b are flow charts of entropy code selection for a channel set;
a and 11b are diagrams of additional header information related to the specification of RAPs and MPPSs;
a and 15b are diagrams illustrating the bitstream and decoding of the bitstream at a RAP segment and a transient; and
The present invention provides an adaptive segmentation algorithm that generates a lossless variable bit rate (VBR) bitstream with random access point (RAP) capability to initiate lossless decoding at a specified segment within a frame and/or multiple prediction parameter set (MPPS) capability partitioned to mitigate transient effects. The adaptive segmentation technique determines and fixes segment start points to ensure that boundary conditions imposed by desired RAPs and/or detected transients are met and selects a optimum segment duration in each frame to reduce encoded frame payload subject to an encoded segment payload constraint and the fixed segment start points. In general, the boundary constraints specify that a desired RAP or transient must lie within a certain number of analysis blocks of the start of a segment. The desired RAP can be plus or minus the number of analysis blocks from the segment start. The transient lies within the first number of analysis blocks of the segment. In an exemplary embodiment in which segments within a frame are of the same duration and a power of two of the analysis block duration, a maximum segment duration is determined to ensure the desired conditions. RAP and MPPS are particularly applicable to improve overall performance for longer frame durations.
As shown in
As shown in
As shown in
The bitstream includes header information and encoded data for at least one and preferably multiple different channel sets. For example, a first channel set may be a 2.0 configuration, a second channel set may be an additional 4 channels constituting a 5.1 channel presentation, and a third channel set may be an additional 2 surround channels constituting overall 7.1 channel presentation. A 8-channel decoder would extract and decode all 3 channel sets producing a 7.1 channel presentation at its outputs. A 6-channel decoder will extract and decode channel set 1 and channel set 2 completely ignoring the channel set 3 producing the 5.1 channel presentation. A 2-channel decoder will only extract and decode channel set 1 and ignore channel sets 2 and 3 producing a 2-channel presentation. Having the stream structured in this manner allows for scalability of decoder complexity.
During the encode, a time encoder performs so called “embedded down-mixing” such that 7.1->5.1 down-mix is readily available in 5.1 channels that are encoded in channel sets 1 and 2. Similarly a 5.1->2.0 down-mix is readily available in 2.0 channels that are encoded as a channel set 1. A 6-channel decoder by decoding channel sets 1 and 2 will obtain 5.1 down-mix after undoing the operation of 5.1->2.0 down-mix embedding performed on the encode side. Similarly a full 8-channel decoder will obtain original 7.1 presentation by decoding channel sets 1, 2 and 3 and undoing the operation of 7.1->5.1 and 5.1->2.0 down-mix embedding performed on the encode side.
As shown in
As shown in
As shown in
In case of MPPS the header further includes transient parameters 532 in the channel set header information. In this embodiment, each channel set header includes an ExtraPredSetsPrsent[ch] flag=TRUE if transient is detected in channel ch, StartSegment[ch]=index indicating the transient start segment for channel ch, and AdPredOrder[1][ch]=order of the Adaptive Predictor or FixedPredOrder[1][ch]=order of the Fixed Predictor for channel ch applicable to second partition in the frame post and including a transient. When adaptive prediction is selected (AdPredOrder[1][ch]>0) a second set of adaptive prediction coefficients are encoded and packed into AdPredCodes[1] [ch] [AdPredOrder[1] [ch]]. The existence and location of a transient may vary across the channels within a channel set and across channel sets.
As shown in
Cross-Channel Decorrelation
In accordance with the present invention, compression performance may be further enhanced by implementing cross channel decorrelation 54, which orders the M input channels into channel pairs according to a correlation measure between the channels (a different “M” than the M analysis block constraint on a desired RAP point). One of the channels is designated as the “basis” channel and the other is designated as the “correlated” channel. A decorrelated channel is generated for each channel pair to form a “triplet” (basis, correlated, decorrelated). The formation of the triplet provides two possible pair combinations (basis, correlated) and (basis, decorrelated) that can be considered during the segmentation and entropy coding optimization to further improve compression performance (see
The decision between (basis, correlated) and (basis, decorrelated) can be performed either prior to (based on some energy measure) or integrated with adaptive segmentation. The former approach reduces complexity while the latter increases efficiency. A ‘hybrid’ approach may be used where for triplets that have a decorrelated channel with considerably (based on a threshold) smaller variance then the correlated channel a simple replacement of the correlated channel by the decorrelated channel prior to adaptive segmentation is used while for all other triplets the decision about encoding correlated or decorrelated channel is left to the adaptive segmentation process. This simplifies the complexity of the adaptive segmentation process somewhat without sacrificing coding efficiency.
The original M-ch PCM 20 and the M/2-ch decorrelated PCM 56 are both forwarded to the adaptive prediction and fixed polynomial prediction operations, which generate residual signals for each of the channels. As shown in
As shown in
An exemplary process for performing cross channel decorrelation 54 is illustrated in
The process starts a channel pair loop (step 82), and selects a “basis” channel as the one with the smaller zero-lag auto-correlation estimate, which is indicative of a lower energy (step 84). In this example, the L, Ls and C channels form the basis channels. The channel pair decorrelation coefficient (ChPairDecorrCoeff) is calculated as the zero-lag cross-correlation estimate divided by the zero-lag auto-correlation estimate of the basis channel (step 86). The decorrelated channel is generated by multiplying the basis channel samples with the CHPairDecorrCoeff and subtracting that result from the corresponding samples of the correlated channel (step 88). The channel pairs and their associated decorrelated channel define “triplets” (L,R,R-ChPairDecorrCoeff[1]*L), (Ls,Rs,Rs-ChPairDecorrCoeff[2]*Ls), (C,LFE,LFE-ChPairDecorrCoeff[3]*C) (step 89). The ChPairDecorrCoeff[ ] for each channel pair (and each channel set) and the channel indices that define the pair configuration are stored in the channel set header information (step 90). This process repeats for each channel set in a frame and then for each frame in the windowed PCM audio (step 92).
Determine Segment Start Point for RAP and Transients
An exemplary approach for determining segment start and duration constraints to accommodate desired RAPs and/or detected transients is illustrated in
As shown in
An exemplary embodiment of the Start Adaptive/Fixed Prediction Analysis in a Channel Set routine (step 608) is provided in
More specifically, the routine performs a frame-based prediction analysis by calling the adaptive prediction routine diagrammed in
In parallel, the routine detects whether any transients exist in the original signal for each channel within the current frame (step 708). A threshold is used to balance between false detection and missed detection. The indices of the analysis block containing a transient are recorded. If a transient is detected, the routine fixes the start point of a transient segment that is positioned to ensure that the transient lies within the first L analysis blocks of the segment (step 709) and partitions the frame into first and second partitions with the second partition coincident with the start of the transient segment (step 710). The routine then calls the adaptive prediction routine diagrammed in
The routine compares the frame-based residual norm to the partition-based residual norm multiplied by a threshold to account for the increased header information required for multiple partitions for each channel (step 716). If the frame-based residual energy is smaller, then the frame-based residuals and prediction parameters are returned (step 718) otherwise the partition-based residuals, two sets of predictions parameters and the indices of the recorded transients are returned for that channel (step 720). The Channel Loop indexed by channel (step 722) and Adaptive/Fixed Prediction Analysis in a channel set (step 724) iterate over the channels in a set and all of the channel sets before ending.
The determination of the segment start points or maximum segment duration for a single frame 800 is illustrated in
In the constrained case, the routine determines a maximum segment duration that, in this example, satisfies the conditions on each of the desire RAP and the two transients. Since the desired RAP 806 falls within the 9th analysis block, the max segment duration that ensures the RAP would lie in the 1st analysis block of the RAP segment is 8× (scaled by duration of the analysis block). Therefore, the allowable segment sizes (as a multiple of two of the analysis block) are 1, 2, 4 and 8. Similarly, since Ch 1 transient 808 falls within the 5th analysis block the maximum segment duration is 4. Transient 810 in CH 2 is more problematic in that to ensure that it occurs in the first analysis block requires a segment duration equal to the analysis block (1×). However, if the transient can be positioned in the second analysis block than the max segment duration is 16×. Under these constraints, the routine may select a max segment duration of 4 thereby allowing the adaptive segmentation algorithm to select from 1×, 2× and 4× to minimize frame payload and satisfy the other constraints.
In an alternative embodiment, the first segment of every nth frame may by default be a RAP segment unless the timing code specifies a different RAP segment in that frame. The default RAP may be useful, for example, to allow a user to jump around or “surf” within the audio bitstream rather than being constrained to only those RAPs specified by the video timing code.
Adaptive Prediction
Adaptive Prediction Analysis and Residual Generation
Linear prediction tries to remove the correlation between the samples of an audio signal. The basic principle of linear prediction is to predict a value of sample s(n) using the previous samples s(n−1), s(n−2), . . . and to subtract the predicted value ŝ(n) from the original sample s(n). The resulting residual signal e(n)=s(n)+ŝ(n) ideally will be uncorrelated and consequently have a flat frequency spectrum. In addition, the residual signal will have a smaller variance then the original signal implying that fewer bits are necessary for its digital representation.
In an exemplary embodiment of the audio codec, a FIR predictor model is described by the following equation:
where Q{ } denotes the quantization operation, M denotes the predictor order and ak are quantized prediction coefficients. A particular quantization Q{ } is necessary for lossless compression since the original signal is reconstructed on the decode side, using various finite precision processor architectures. The definition of Q{ } is available to both coder and decoder and reconstruction of the original signal is simply obtained by:
where it is assumed that the same ak quantized prediction coefficients are available to both encoder and decoder. A new set of predictor parameters is transmitted per each analysis window (frame) allowing the predictor to adapt to the time varying audio signal structure. In the case of transient detection, two new sets of prediction parameters are transmitted for the frame for each channel in which a transient is detected; one to decode residuals prior to the transient and one to decode residuals including and subsequent to the transient.
The prediction coefficients are designed to minimize the mean-squared prediction residual. The quantization Q{ } makes the predictor a nonlinear predictor. However in the exemplary embodiment the quantization is done with 24-bit precision and it is reasonable to assume that the resulting non-linear effects can be ignored during predictor coefficient optimization. Ignoring the quantization Q{ }, the underlying optimization problem can be represented as a set of linear equations involving the lags of signal autocorrelation sequence and the unknown predictor coefficients. This set of linear equations can be efficiently solved using the Levinson-Durbin (LD) algorithm.
The resulting linear prediction coefficients (LPC) need to be quantized, such that they can be efficiently transmitted in an encoded stream. Unfortunately direct quantization of LPC is not the most efficient approach since the small quantization errors may cause large spectral errors. An alternative representation of LPCs is the reflection coefficient (RC) representation, which exhibits less sensitivity to the quantization errors. This representation can also be obtained from the LD algorithm. By definition of the LD algorithm the RCs are guaranteed to have magnitude ≦1 (ignoring numerical errors). When the absolute value of the RCs is close to 1 the sensitivity of linear prediction to the quantization errors present in quantized RCs becomes high. The solution is to perform non-uniform quantization of RCs with finer quantization steps around unity. This can be achieved in two steps:
As shown in
The first step is to calculate the autocorrelation sequence over the duration of analysis window (entire frame or partitions before and after a detected transient) (step 100). To minimize the blocking effects that are caused by discontinuities at the frame boundaries data is first windowed. The autocorrelation sequence for a specified number (equal to maximum LP order+1) of lags is estimated from the windowed block of data.
The Levinson-Durbin (LD) algorithm is applied to the set of estimated autocorrelation lags and the set of reflection coefficients (RC), up to the max LP order, is calculated (step 102). An intermediate result of the (LD) algorithm is a set of estimated variances of prediction residuals for each linear prediction order up to the max LP order. In the next block, using this set of residual variances, the linear predictor (AdPredOrder) order is selected (step 104).
For the selected predictor order the set of reflection coefficients (RC) is transformed to the set of log-area ratio parameters (LAR) using the above stated mapping function (step 106). A limiting of the RC is introduced prior to transformation in order to prevent division by 0:
where Tresh denotes number close to but smaller then 1. The LAR parameters are quantized (step 108) according to the following rule:
where QLARInd denotes the quantized LAR indices, └x┘ indicates operation of finding largest integer value smaller or equal to x and q denotes quantization step size. In the exemplary embodiment, region [−8 to 8] is coded using 8 bits i.e.,
and consequently QLARInd is limited according to:
QLARInd are translated from signed to unsigned values using the following mapping:
In the “RC LUT” block, an inverse quantization of LAR parameters and a translation to RC parameters is done in a single step using a look-up table (step 112). Look-up table consists of quantized values of the inverse RC->LAR mapping i.e., LAR->RC mapping given by:
The look-up table is calculated at quantized values of LARs equal to 0, 1.5*q, 2.5*q, . . . 127.5*q. The corresponding RC values, after scaling by 216, are rounded to 16 bit unsigned integers and stored as Q16 unsigned fixed point numbers in a 128 entry table.
Quantized RC parameters are calculated from the table and the quantization LAR indices QLARInd as
The quantized RC parameters QRCord for ord=1, . . . AdPredOrder are translated to the quantized linear prediction parameters (LPord for ord=1, . . . AdPredOrder) according to the following algorithm (step 114):
Since the quantized RC coefficients were represented in Q16 signed fixed point format the above algorithm will generate the LP coefficients also in Q16 signed fixed point format. The lossless decoder computation path is designed to support up to 24-bit intermediate results. Therefore it is necessary to perform a saturation check after each Cord+1, m is calculated. If the saturation occurs at any stage of the algorithm the saturation flag is set and the adaptive predictor order AdPredOrder, for a particular channel, is reset to 0 (step 116). For this particular channel with AdPredOrder=0 a fixed coefficient prediction will be performed instead of the adaptive prediction (See Fixed Coefficient Prediction). Note that the unsigned LAR quantization indices (PackLARInd[n] for n=1, . . . AdPredOrder[Ch]) are packed into the encoded stream only for the channels with AdPredOrder[Ch]>0.
Finally for each channel with AdPredOrder>0 the adaptive linear prediction is performed and the prediction residuals e(n) are calculated according to the following equations (step 118):
Since the design goal in the exemplary embodiment is that a specific RAP segment of certain frames are “random access points”, the sample history is not carried over from the preceding segment to the RAP segment. Instead the prediction is engaged only at the AdPredOrder+1 sample in the RAP segment.
The adaptive prediction residuals e(n) are further entropy coded and packed into the encoded bit-stream.
Inverse Adaptive Prediction on the Decode Side
On the decode side, the first step in performing inverse adaptive prediction is to unpack the header information (step 120). If the decoder is attempting to initiate decoding according to a playback timing code (e.g. user selection of a chapter or surfing), the decoder accesses the audio bitstream near but prior to that point and searches the header of the next frame until it finds a RAP_Flag=TRUE indicating the existence of a RAP segment in the frame. The decoder then extracts the RAP segment number (RAP ID) and navigation data (NAVI) to navigate to the beginning of the RAP segment, disables prediction until index >pred_order and initiates lossless decoding. The decoder decodes the remaining segments in the frames and subsequent frames, disabling prediction each time a RAP segment is encountered. If a ExtraPredSetsPrsnt=TRUE is encountered in a frame for a channel, the decoder extracts the first and second sets of prediction parameters and the start segment for the second set.
The adaptive prediction orders AdPredOrder[Ch] for each channel Ch=1, . . . . NumCh are extracted. Next for the channels with AdPredOrder[Ch]>0, the unsigned version of LAR quantization indices (AdPredCodes[n] for n=1, . . . AdPredOrder[Ch]) is extracted. For each channel Ch with prediction order AdPredOrder[Ch]>0 the unsigned AdPredCodes[n] are mapped to the signed values QLARInd[n] using the following mapping:
where the >> denotes an integer right shift operation.
An inverse quantization of LAR parameters and a translation to RC parameters is done in a single step using a Quant RC LUT (step 122). This is the same look-up table TABLE{ } as defined on the encode side. The quantized reflection coefficients for each channel Ch (QRC[n] for n=1, . . . AdPredOrder[Ch]) are calculated from the TABLE{ } and the quantization LAR indices QLARInd[n], as
For each channel Ch, the quantized RC parameters QRCord for ord=1, . . . AdPredOrder[Ch] are translated to the quantized linear prediction parameters (LPord for ord=1, . . . AdPredOrder[Ch]) according to the following algorithm (step 124):
Any possibility of saturation of intermediate results is removed on the encode side. Therefore on the decode side there is no need to perform saturation check after calculation of each Cord+1, m.
Finally for each channel with AdPredOrder[Ch]>0 an inverse adaptive linear prediction is performed (step 126). Assuming that prediction residuals e(n) are previously extracted and entropy decoded the reconstructed original signals s(n) are calculated according to the following equations:
Since the sample history is not kept at a RAP segment the inverse adaptive prediction shall start from the (AdPredOrder[Ch]+1) sample in the RAP segment.
Fixed Coefficient Prediction
A very simple fixed coefficient form of the linear predictor has been found to be useful. The fixed prediction coefficients are derived according to a very simple polynomial approximation method first proposed by Shorten (T. Robinson. SHORTEN: Simple lossless and near lossless waveform compression. Technical report 156. Cambridge University Engineering Department Trumpington Street, Cambridge CB2 1PZ, UK December 1994). In this case the prediction coefficients are those specified by fitting a p order polynomial to the last p data points. Expanding on four approximations:
ŝ0[n]=0
ŝ1[n]=s[n−1]
ŝ2[n]=2s[n−1]−s[n−2]
ŝ3[n]=3s[n−1]−2s[n−2]+s[n−3]
An interesting property of these polynomials approximations is that the resulting residual signal, ek[n]=s[n]−ŝk[n] can be efficiently implemented in the following recursive manner.
e0[n]=s[n]
e1[n]=e0[n]−e0[n−1]
e2[n]=e1[n]−e1[n−1]
e3[n]=e2[n]−e2[n−1]
. . .
The fixed coefficient prediction analysis is applied on a per frame basis and does not rely on samples calculated in the previous frame (ek[−1]=0). The residual set with the smallest sum magnitude over entire frame is defined as the best approximation. The optimal residual order is calculated for each channel separately and packed into the stream as Fixed Prediction Order (FPO[Ch]). The residuals eFPO[Ch][n] in the current frame are further entropy coded and packed into the stream.
The reverse fixed coefficient prediction process, on the decode side, is defined by an order recursive formula for the calculation of k-th order residual at sampling instance n:
ek[n]=ek+1[n]+ek[n−1]
where the desired original signal s[n] is given by
s[n]=e0[n]
and where for each k-th order residual ek[−1]=0.
As an example recursions for the 3rd order fixed coefficient prediction are presented where the residuals e3[n] are coded, transmitted in the stream and unpacked on the decode side:
e2[n]=e3[n]+e2[n−1]
e1[n]=e2[n]+e1[n−1]
e0[n]=e1[n]+e0[n−1]
s[n]=e0[n]
The inverse linear prediction, adaptive or fixed, performed in step 126 is illustrated for a case where the m+1 segment is a RAP segment 900 in
Because the segment start conditions and max segment duration are set based on the allowable location of a desired RAP or detected transient within a segment, the selection of the optimal segment duration may generate a bitstream in which the desired RAP or detected transient actually lie within segments subsequent to the RAP or transient segments. This might happen if the bounds M and L are relatively large and the optimal segment duration is less than M and L. The desired RAP may actually lie in a segment preceding the RAP segment but still be within the specified tolerance. The conditions on alignment tolerance on the encode side are still maintained and the decoder does not know the difference. The decoder simply accesses the RAP and transient segments.
The constrained optimization problem addressed by the adaptive segmentation algorithm is illustrated in
As shown in
An exemplary embodiment of segmentation and entropy code selection 24 for the constrained case (uniform segments, power of two of analysis block duration) is illustrated in
The exemplary process starts by initializing segment parameters (step 150) such as the minimum number of samples in a segment, the maximum allowed encoded payload size of a segment, maximum number of segments and the maximum number of partitions and the maximum segment duration. Thereafter, the processing starts a partition loop that is indexed from 0 to the maximum number of partitions minus one (step 152) and initializes the partition parameters including the number of segments, num samples in a segment and the number of bytes consumed in a partition (step 154). In this particular embodiment, the segments are of equal time duration and the number of segments scales as a power of two with each partition iteration. The number of segments is preferably initialized to the maximum, hence minimum time duration, which is equal to one analysis block. However, the process could use segments of varying time duration, which might provide better compression of audio data but at the expense of additional overhead and additional complexity to satisfy the RAP and transient conditions. Furthermore, the number of segments does not have to be limited to powers of two or searched from the minimum to maximum duration. In this case, the segment start points determined by the desired RAP and detected transients are additional constraints on the adaptive segmentation algorithm.
Once initialized, the processes starts a channel set loop (step 156) and determines the optimal entropy coding parameters and channel pair selection for each segment and the corresponding byte consumption (step 158). The coding parameters PWChDecorrFlag[ ][ ], AllChSameParamFlag[ ][ ], RiceCodeFlag[ ][ ][ ], CodeParam[ ][ ][ ] and ChSetByteCons[ ][ ] are stored (step 160). This is repeated for each channel set until the channel set loop ends (step 162).
The process starts a segment loop (step 164) and calculates the byte consumption (SegmByteCons) in each segment over all channel sets (step 166) and updates the byte consumption (ByteConsInPart) (step 168). At this point, size of the segment (encoded segment payload in bytes) is compared to the maximum size constraint (step 170). If the constraint is violated the current partition is discarded. Furthermore, because the process starts with the smallest time duration, once a segment size is too big the partition loop terminates (step 172) and the best solution (time duration, channel pairs, coding parameters) to that point is packed into the header (step 174) and the process moves onto the next frame. If the constraint fails on the minimum segment size (step 176), then the process terminates and reports an error (step 178) because the maximum size constraint cannot be satisfied. Assuming the constraint is satisfied, this process is repeated for each segment in the current partition until the segment loop ends (step 180).
Once the segment loop has been completed and the byte consumption for the entire frame calculated as represented by ByteConsinPart, this payload is compared to the current minimum payload (MinByteInPart) from a previous partition iteration (step 182). If the current partition represents an improvement then the current partition (PartInd) is stored as the optimum partition (OptPartind) and the minimum payload is updated (step 184). These parameters and the stored coding parameters are then stored as the current optimum solution (step 186). This is repeated until the partition loop ends with the maximum segment duration (step 172), at which point the segmentation information and the coding parameters are packed into the header (step 174) as shown in
An exemplary embodiment for determining the optimal coding parameters and associated bit consumption for a channel set for a current partition (step 158) is illustrated in
Ch1: L,
Ch2: R
Ch3: R-ChPairDecorrCoeff[1]*L
Ch4: Ls
Ch5: Rs
Ch6: Rs-ChPairDecorrCoeff[2]*Ls
Ch7: C
Ch8: LFE
Ch9: LFE-ChPairDecorrCoeff[3]*C)
The process determines the type of entropy code, corresponding coding parameter and corresponding bit consumption for the basis and correlated channels (step 194). In this example, the process computes optimum coding parameters for a binary code and a Rice code and then selects the one with the lowest bit consumption for channel and each segment (step 196). In general, the optimization can be performed for one, two or more possible entropy codes. For the binary codes the number of bits is calculated from the max absolute value of all samples in the segment of the current channel. The Rice coding parameter is calculated from the average absolute value of all samples in the segment of the current channel. Based on the selection, the RiceCodeFlag is set, the BitCons is set and the CodeParam is set to either the NumBitsBinary or the RiceKParam (step 198).
If the current channel being processed is a correlated channel (step 200) then the same optimization is repeated for the corresponding decorrelated channel (step 202), the best entropy code is selected (step 204) and the coding parameters are set (step 206). The process repeats until the channel loop ends (step 208) and the segment loop ends (step 210).
At this point, the optimum coding parameters for each segment and for each channel have been determined. These coding parameters and payloads could be returned for the channel pairs (basis, correlated) from original PCM audio. However, compression performance can be improved by selecting between the (basis, correlated) and (basis, decorrelated) channels in the triplets.
To determine which channel pairs (basis, correlated) or (basis, uncorrelated) for the three triplets, a channel pair loop is started (step 211) and the contribution of each correlated channel (Ch2, Ch5 and Ch8) and each decorrelated channel (Ch3, Ch6 and Ch9) to the overall frame bit consumption is calculated (step 212). The frame consumption contributions for each correlated channel is compared against the frame consumption contributions for corresponding decorrelated channels, i.e., Ch2 to Ch3, Ch5 to Ch6, and Ch8 to Ch9 (step 214). If the contribution of the decorrelated channel is greater than the correlated channel, the PWChDecorrrFlag is set to false (step 216). Otherwise, the correlated channel is replaced with the decorrelated channel (step 218) and PWChDecorrrFlag is set to true and the channel pairs are configured as (basis, decorrelated) (step 220).
Based on these comparisons the algorithm will select:
1. Either Ch2 or Ch3 as the channel that will get paired with corresponding basis channel Ch1;
2. Either Ch5 or Ch6 as the channel that will get paired with corresponding basis channel Ch4; and
3. Either Ch8 or Ch9 as the channel that will get paired with corresponding basis channel Ch7.
These steps are repeated for all channel pairs until the loop ends (step 222).
At this point, the optimum coding parameters for each segment and each distinct channel and the optimal channel pairs have been determined. These coding parameters for each distinct, channel pairs and payloads could be returned to the partition loop. However, additional compression performance may be available by computing a set of global coding parameters for each segment across all channels. At best, the encoded data portion of the payload will be the same size as the coding parameters optimized for each channel and most likely somewhat larger. However, the reduction in overhead bits may more than offset the coding efficiency of the data.
Using the same channel pairs, the process starts a segment loop (step 230), calculates the bit consumptions (ChSetByteCons[seg]) per segment for all the channels using the distinct sets of coding parameters (step 232) and stores ChSetByteCons[seg] (step 234). A global set of coding parameters (entropy code selection and parameters) are then determined for the segment across all of the channels (step 236) using the same binary code and Rice code calculations as before except across all channels. The best parameters are selected and the byte consumption (SegmByteCons) is calculated (step 238). The SegmByteCons is compared to the CHSetByteCons[seg] (step 240). If using global parameters does not reduce bit consumption, the AllChSamParamFlag[seg] is set to false (step 242). Otherwise, the AllChSameParamFlag[seg] is set to true (step 244) and the global coding parameters and corresponding bit consumption per segment are saved (step 246). This process repeats until the end of the segment loop is reached (step 248). The entire process repeats until the channel set loop terminates (step 250).
The encoding process is structured in a way that different functionality can be disabled by the control of a few flags. For example one single flag controls whether the pairwise channel decorrelation analysis is to be performed or not. Another flag controls whether the adaptive prediction (yet another flag for fixed prediction) analysis is to be performed or not. In addition a single flag controls whether the search for global parameters over all channels is to be performed or not. Segmentation is also controllable by setting the number of partitions and minimum segment duration (in the simplest form it can be a single partition with predetermined segment duration). A flag indicates the existence of a RAP segment and another flag indicates the existence of a transient segment. In essence by setting a few flags in the encoder the encoder can collapse to simple framing and entropy coding.
The lossless codec can be used as an “extension coder” in combination with a lossy core coder. A “lossy” core code stream is packed as a core bitstream and a losslessly encoded difference signal is packed as a separate extension bitstream. Upon decoding in a decoder with extended lossless features, the lossy and lossless streams are combined to construct a lossless reconstructed signal. In a prior-generation decoder, the lossless stream is ignored, and the core “lossy” stream is decoded to provide a high-quality, multi-channel audio signal with the bandwidth and signal-to-noise ratio characteristic of the core stream.
Meanwhile, the input digitized audio signal 402 in the parallel path undergoes a compensating delay 416, substantially equal to the delay introduced into the reconstructed audio stream (by modified encode and modified decoders), to produce a delayed digitized audio stream. The audio stream 400 is subtracted from the delayed digitized audio stream 414 at summing node 420.
Summing node 420 produces a difference signal 422 which represents the original signal and the reconstructed core signal. To accomplish purely “lossless” encoding, it is necessary to encode and transmit the difference signal with lossless encoding techniques. Accordingly, the difference signal 422 is encoded with a lossless encoder 424, and the extension bitstream 426 is packed with the core bitstream 408 in packer 410 to produce an output bitstream (not shown).
Note that the lossless coding produces an extension bitstream 426 which is at a variable bit rate, to accommodate the needs of the lossless coder. The packed stream is then optionally subjected to further layers of coding including channel coding, and then transmitted or recorded. Note that for purposes of this disclosure, recording may be considered as transmission through a channel.
The core encoder 404 is described as “modified” because in an embodiment capable of handling extended bandwidth the core encoder would require modification. A 64-band analysis filter bank 430 within the encoder discards half of its output data 432 and a core sub-band encoder 434 encodes only the lower 32 frequency bands. This discarded information is of no concern to legacy decoders that would be unable to reconstruct the upper half of the signal spectrum in any case. The remaining information is encoded as per the unmodified encoder to form a backwards-compatible core output stream. However, in another embodiment operating at or below 48 kHz sampling rate, the core encoder could be a substantially unmodified version of a prior core encoder. Similarly, for operation above the sampling rate of legacy decoders, the modified core decoder 412 includes a core sub-band decoder 436 that decodes samples in the lower 32 sub-bands. The modified core decoder takes the sub-band samples from the lower 32 sub-bands and zeros out the un-transmitted sub-band samples for the upper 32 bands 438 and reconstructs all 64 bands using a 64-band QMF synthesis filter 440. For operation at conventional sampling rate (e.g., 48 kHz and below) the core decoder could be a substantially unmodified version of a prior core decoder or equivalent. In some embodiments the choice of sampling rate could be made at the time of encoding, and the encode and decode modules reconfigured at that time by software as desired.
Since the lossless encoder is being used to code the difference signal, it may seem that a simple entropy code would suffice. However, because of the bit rate limitations on the existing lossy core codecs, a considerable amount of the total bits required to provide a lossless bitstream still remain. Furthermore, because of the bandwidth limitations of the core codec the information content above 24 kHz in the difference signal is still correlated.
For example plenty of harmonic components including trumpet, guitar, triangle . . . reach far beyond 30 kHz. Therefore more sophisticated lossless codecs that improve compression performance add value. In addition, in some applications the core and extension bitstreams must still satisfy the constraint that the decodable units must not exceed a maximum size. The lossless codec of the present invention provides both improved compression performance and improved flexibility to satisfy these constrains.
By way of example, 8 channels of 24-bit 96 Khz PCM audio requires 18.5 Mbps. Lossless compression can reduce this to about 9 Mbps. DTS Coherent Acoustics would encode the core at 1.5 Mbps, leaving a difference signal of 7.5 Mbps. For 2 kByte max segment size, the average segment duration is 2048*8/7500000=2.18 msec or roughly 209 samples at 96 kHz. A typical frame size for the lossy core to satisfy the max size is between 10 and 20 msec.
At a system level, the lossless codec and the backward compatible lossless codec may be combined to losslessly encode extra audio channels at an extended bandwidth while maintaining backward compatibility with existing lossy codecs. For example, 8 channels of 96 kHz audio at 18.5 Mbps may be losslessly encoded to include 5.1 channels of 48 kHz audio at 1.5 Mbps. The core plus lossless encoder would be used to encode the 5.1 channels. The lossless encoder will be used to encode the difference signals in the 5.1 channels. The remaining 2 channels are coded in a separate channel set using the lossless encoder. Since all channel sets need to be considered when trying to optimize segment duration, all of the coding tools will be used in one way or another. A compatible decoder would decode all 8 channels and losslessly reconstruct the 96 kHz 18.5 Mbps audio signal. An older decoder would decode only the 5.1 channels and reconstruct the 48 kHz 1.5 Mbps.
In general, more then one pure lossless channel set can be provided for the purpose of scaling the complexity of the decoder. For example, for an 10.2 original mix the channel sets could be organized such that:
A decoder that is capable of decoding just 5.1 will only decode CHSET1 and ignore all other channels sets. A decoder that is capable of decoding just 7.1 will decode CHSET1 and CHSET2 and ignore all other channels sets . . . .
Furthermore, the lossy plus lossless core is not limited to 5.1. Current implementations support up to 6.1 using lossy (core+XCh) and lossless and can support a generic m.n channels organized in any number of channel sets. The lossy encoding will have a 5.1 backward compatible core and all other channels that are coded with the lossy codec will go into the XXCh extension. This provides the overall lossless coded with considerable design flexibility to remain backward compatible with existing decoders while support additional channels.
While several illustrative embodiments of the invention have been shown and described, numerous variations and alternate embodiments will occur to those skilled in the art. Such variations and alternate embodiments are contemplated, and can be made without departing from the spirit and scope of the invention as defined in the appended claims.
This application claims benefit of priority under 35 U.S.C. 120 as a continuation-in-part (CIP) of Ser. No. 10/911,067 filed Aug. 4, 2004 now U.S. Pat. No. 7,392,195 issued Jun. 24, 2008, the entire contents of which are incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
6023233 | Craven | Feb 2000 | A |
6226616 | You | May 2001 | B1 |
6784812 | Craven | Aug 2004 | B2 |
7272567 | Fejzo | Sep 2007 | B2 |
7392195 | Fejzo | Jun 2008 | B2 |
7460993 | Chen et al. | Dec 2008 | B2 |
7668723 | Fejzo | Feb 2010 | B2 |
7689427 | Vasilache | Mar 2010 | B2 |
20030018884 | Wise et al. | Jan 2003 | A1 |
20040196913 | Chakravarthy et al. | Oct 2004 | A1 |
20050198346 | Wang et al. | Sep 2005 | A1 |
20070094027 | Vasilache | Apr 2007 | A1 |
20080059202 | You | Mar 2008 | A1 |
20090164223 | Fejzo | Jun 2009 | A1 |
20090164224 | Fejzo | Jun 2009 | A1 |
20100082352 | Fejzo | Apr 2010 | A1 |
Number | Date | Country |
---|---|---|
0955731 | Nov 1999 | EP |
1054514 | Sep 2007 | EP |
WO 0074038 | Dec 2000 | WO |
WO0079520 | Dec 2000 | WO |
WO03077235 | Jan 2008 | WO |
Number | Date | Country | |
---|---|---|---|
20080215317 A1 | Sep 2008 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10911067 | Aug 2004 | US |
Child | 12011899 | US |