The present invention is related to processing of signals and, more particularly, to encoding and decoding of signals such as digital visual or auditory data.
Perceptual coding is a known technique for reducing the bit rate of a digital signal by utilizing an advantageous model of the destination, e.g., by specifying the removal of portions of the signal that are unlikely to be perceived by a human user.
The simplest approach to addressing this issue is to use predefined quantization intervals, based on a priori information known about the coefficients, such as the frequencies and orientations of the corresponding basis functions. The quantization of a coefficient, accordingly, depends only on the position of that coefficient in the transform and is independent of the surrounding context. See, e.g., ITU-T Rec. T.81, “Digital Compression and Coding of Continuous-Tone Still Images-Requirements and Guidelines,” International Telecommunication Union, CCITT (September 1992) (JPEG standard, ISO/IEC 10918-1). Although this approach is very efficient, it is very limited and cannot take advantage of any perceptual phenomena beyond those that are separated out by the transform. A more powerful approach is to define a perceptual model that can be applied in the decoder during decompression. During compression, the encoder dynamically computes a quantization interval for each coefficient based on information that will be available during decoding; the decoder uses the same model to recompute the quantization interval for each coefficient based on the values of the coefficients decoded so far. See, e.g., ISO/IEC 15444-1:2000, “JPEG2000 Part I: Image Coding System,” Final Committee Draft Version 1.0 (Mar. 16, 2000) (JPEG2000 standard); ISO/IECJTC 15444-2:2000, “JPEG2000 Part II: Extensions,” Final Committee Draft, (Dec. 7, 2000) (point-wise extended masking extension). While a well-designed system using such recomputed quantization can yield dramatic improvements over predefined quantization, it is still limited in that the perceptual model utilized cannot involve any information lost during quantization, and the quantization of a coefficient cannot depend on any information that is transmitted after that coefficient in the bitstream. The most flexible approach in the prior art is to include some additional side information in the coded bitstream, thereby giving the decoder some hints about how the coefficient values were quantized. Unfortunately, side-information adds bits into the bitstream and, thus, lowers the compression ratio.
Accordingly, there is a need for a new approach that can fully exploit perceptual modeling techniques while avoiding the need for side information.
An encoding system and method are disclosed which utilize a modified quantization approach which advantageously foregoes the need for inverse quantization at the decoder. A plurality of coefficients is obtained from an input signal, e.g., by a transformation or from sampling, and, for each coefficient, a range of quantized values is determined that will not produce unacceptable perceptual distortion, preferably in accordance with an arbitrary perceptual model. This range of values is referred to herein as the “perceptual slack” for the coefficient. A search is then conducted for code values based on a selected entropy code that lie within the perceptual slack for each of the coefficient values. A sequence of code values is selected which minimizes the number of bits emitted by the entropy code. The modified quantizer thereby maps the coefficient values into a sequence of code values that can be encoded in such a way that the resulting perceptual distortion is within some prescribed limit and such that the resulting entropy-coded bit sequence is as short as possible. The perceptual model is advantageously not directly involved in the entropy code and, thus, it is unnecessary to limit the perceptual model to processes that can be recomputed during decoding.
In accordance with another aspect of the invention, an embodiment is disclosed in which the entropy code utilized with the modified quantizer can be optimized for a corpus of data. The corpus is utilized to obtain coefficient values and their respective perceptual slack ranges as determined by the perceptual model. At a first iteration, the code value to which the most number of coefficients can be quantized to is identified; all coefficients whose ranges overlap with this code value are removed from the corpus. The probability of this value in the probability distribution is set to the frequency with which coefficients can be quantized to it. On the next iteration, the second-most common value in the quantized data is recorded, and so on, until the corpus is empty. The resulting probability distribution can be utilized to construct the entropy codes, as well as guide the modified quantization.
In accordance with another aspect of the invention, a new technique for constructing codes for the entropy coder is disclosed. A conventional Huffman code is constructed for the strings in the code list. If the extra bits required to code each string exceeds a threshold, then a selection of strings in the code list is replaced by longer strings. Another set of Huffman codes is constructed and the processing iterated until the extra bits do not exceed the threshold. A number of heuristics can be utilized for selecting the strings to replace, including selecting the string with the highest probability, selecting the string that is currently encoded most inefficiently, or selecting the string with the most potential for reducing the extra bits.
The above techniques can be combined together and with a range of advanced perceptual modeling techniques to create a transform coding system of very high performance. These and other advantages of the invention will be apparent to those of ordinary skill in the art by reference to the following detailed description and the accompanying drawings.
In
At step 223, a search is conducted for code values based on the selected entropy code that lie within the perceptual slack for each of the coefficient values. Then, at step 224, a sequence of code values is selected which minimizes the number of bits emitted by the entropy code. For example, consider the situation in which the entropy code is optimized for a sequence of independent and identically distributed (i.i.d.) coefficient values. Assume that the code will yield optimal results when each coefficient value is drawn independently from a stationary distribution P, such that P(x) is the probability that a coefficient will have value x. The entropy code being optimal for this distribution means that the average number of bits required for a given value, x, is just −log2(P(x)). Thus, for each coefficient, step 224 in
where xmin and xmax are the ends of the range of values allowable for that coefficient and Xq is the selected value.
It is helpful to contrast the approach illustrated above with prior art quantization. Using a conventional quantization approach, the arbitrary coefficient values would be replaced with discrete symbols by applying a real-valued function and rounding the real-valued results to the nearest integer. In other words, a quantization function Q(x) is typically defined as Q(x)=round(f(x)) where f(x) is the arbitrary real-valued function that defines the manner in which quantization is performed. As f(x) changes from coefficient to coefficient, in accordance with the specific perceptual coding strategy, the transform decoder needs to follow these changes. The prior art transform decoder accomplishes this by performing the process of “inverse quantization,” namely by applying the inverse of f(x). This process of inverse quantization does not actually invert Q(x), since information is lost during rounding.
The transform coefficients are processed in
ENTROPY CODE DESIGN. Although the above-mentioned modified quantization can be utilized with any entropy encoder, nevertheless, it is preferable to select an entropy code that is optimized for use with the modified quantization approach. Assuming that the entropy codes are designed for i.i.d. coefficient values, this amounts to seeking the best probability distribution, P, for which to optimize the code. In the absence of any quantization, P(x) should simply be the frequency with which x appears in the transforms of a large corpus of sample data. When applying the above quantization approach, however, these frequencies will be changing. Moreover, the changes made will be dependent on P itself. What is preferable, then, is a P that matches the distribution resulting from the modified quantization, when that quantization is applied using P itself. This distribution preferably should have as low an entropy as can be managed, given the limits imposed by the perceptual model.
As an example,
Code Construction. Once a probability distribution of code words is obtained, the code utilized by the entropy coder can be readily constructed using any of a number of known techniques. See, e.g., D. A. Huffman, “A Method for the Construction of Minimum-Redundancy Codes,” Proceedings of the I.R.E., pp. 1098-1102 (September 1952); J. S. Vitter, “Design and Analysis of Dynamic Huffman Codes,” Journal of the ACM, pp. 825-45 (October 1987). In accordance with an embodiment of another aspect of the invention,
At step 701, a set of strings S is initialized to {‘s1’, ‘s2’, . . . ‘s3’}, where ‘si’ is a string consisting of only symbol si. This is the set of strings represented by specific bit sequence. As the processing progresses, some of these strings will probably be replaced by longer strings. At step 702, a conventional Huffman code C(•) is constructed for the strings in S. Huffman's algorithm, and its variants, generate the best code that can be achieved using an integral number of bits to represent each string, but it is unlikely that this will be the most efficient code possible, because many symbols should be encoded with non-integral numbers of bits. The expected number of bits per symbol in a message encoded using C(•) is given by the following equation:
P(S) is the probability that the next several symbols in the sequence will match string S. As the symbols are assumed to be i.i.d., this is equal to the product of the probabilities of the individual symbols in S. The expression len(•) gives the length of a string os symbols or a sequence of bits. C(S) is an encoding of string S with a sequence of bits. Thus, the expression b is just the ratio between the expected number of bits and the expected string length. The theoretical minimum number of bits per symbol is given by the entropy of the symbol distribution:
P(si) is the probability that the next symbol in the sequence will be si. This is independent of previous symbols in the sequence.
Thus, at step 703 in
P(concat(S, si))=P(S)P(si).
Then, the processing continues back at step 902, with the construction of a new Huffman code for the strings in S.
With regard to the strategy for selecting the string S to be replaced in step 905, a variety of heuristics can be utilized. The better the strategy utilized, the smaller the code books should be. The simplest heuristic is to select the S that has the highest probability. This is intuitive, because it will tend toward a set of strings that all have similar probabilities. However, it might be that the most probable string is already perfectly coded, in which case replacing it with longer strings is unlikely to improve the performance of the code. Another strategy is to select the S that is currently encoded most inefficiently. That is, one can pick the S that maximizes es=len(C(S))−log2(P(S)). This typically works better than picking the most probable S, but it doesn't consider all of the characteristics of the string that effect the calculation of e above, which is the value that needs to change. The approach that appears to work the best is to select the S that has the most potential for reducing e. A determination is made of how much e will be reduced if, by replacing S with longer strings, the first len(S) symbols of those strings are caused to be perfectly encoded. This would mean that the numerator in the equation for b above would be reduced by P(S)es. At the same time, by replacing S with n strings that are one symbol longer, the denominator would be increased by P(S). Thus, it is desirable to seek the string that minimizes:
It is useful to terminate the processing in
By encoding strings of symbols, rather than individual symbols, it is possible to encode some symbols with non-integral numbers of bits. This is particularly important when the probability of some symbols is larger than 0.5, because such symbols should be encoded with less than one bit, on average. This occurs with values of 0, which typically arise far more than half the time in practical applications. The technique above will produce the equivalent of run-length codes in such cases.
Parametric Codes. The above description of the entropy coder has focused on a limited form of entropy coding that utilizes fixed sets of predefined codebooks. While the above-mentioned modified quantization approach serves to map the distribution of values into one that is appropriate for the given code, even further improvements in matching the distribution of coefficient values in a given data set can be obtained by using more flexible forms of entropy coding. For example, the probability distribution for the entropy code could be described with a small set of parameters. The encoder could then choose parameters that provide the best match to an ideal distribution, as determined by the processing illustrated by
Exploiting Mutual Information. The above description has also assumed that the entropy code is optimized for i.i.d. coefficient values. This means that the average number of bits required for a given value is independent of the values around it. If there is significant mutual information between coefficients, however, then the code should be context-dependent, meaning that the number of bits should depend on surrounding values. For example, if successive coefficient values are highly correlated, a given coefficient value should require fewer bits if it is similar to the preceding coefficient, and more bits if it is far from the preceding coefficient. The above modified quantization approach can be applied with a context-dependent entropy code. The context can be examined to determine the numbers of bits required to represent each possible new value of a coefficient. That is, the new value of a coefficient is given by
where xmin and xmax describe the slack range for the coefficient, C is a neighborhood of coefficient values that effect the coding of the current coefficient, and B(x,C) gives the number of bits required to encode value x in context C (infinity if the code cannot encode x in that context). The improvement obtained using context-dependent coding may be dramatic, because there is substantial mutual information between coefficient slack-ranges. That is, a coefficient's neighborhood has a significant impact on its slack-range, and hence on its quantized value.
PERCEPTUAL MODEL. The following example perceptual model illustrates the flexibility afforded by the above-described modified quantization approach. It should be noted that perceptual model described herein has not been selected as an example of an optimal design, but as illustrating the limitations that constrain prior art perceptual model design—and how those constraints can be overcome with the present approach, thereby allowing almost completely arbitrary design of future perceptual models.
The model assigns slack ranges to wavelet coefficients of images and is a variation on the perceptual model implicit in the visual optimization tools provided in JPEG 2000. The wavelet transform used here is the 9-7 transform used in JPEG 2000. The number of times the transform is applied to the image depends on the original image size—it is applied enough times to reduce the LL band to 16×16 coefficients or smaller. Thus, for example, if the original image is 256×256, the system uses a four level transform. The model is controlled by a single parameter, q, which determines the amount by which the image may be distorted during quantization. When q=0, all coefficients are assigned slacks of 0, and no quantization takes place. As q increases, the slack ranges become progressively larger, and the image will be more heavily quantized. No attempt is made to perform sophisticated perceptual modeling for the final LL band of the transform. This band has dramatically different perceptual qualities from the other bands, which would require a different method of assigning slack ranges. However, as the band is small compared to the rest of the image, it is not really necessary to come up with such a method for our purposes here. Instead, each coefficient in this band is given a slack range obtained by
xmin=x−min(q, 1)
xmax=x+min(q, 1)
where x is the original value of the coefficient and xmin and xmax give the slack range.
The method of assigning slack ranges for the remaining coefficients is described below as a succession of components. Each of the following describes a progressively more sophisticated aspect of the perceptual model.
Self masking. The model begins by replicating the JPEG 2000 tool of self-contrast masking. The idea behind this tool is that the amount by which a coefficient may be distorted increases with the coefficient's magnitude. This suggests the quantization scale should be non-linear. In JPEG˜2000, the non-linear quantization scale is implemented by applying a non-linear function to each coefficient before linear quantization in the encoder:
x1=x/Csb
x2=x1α
xq=round(x2)
where alpha is a predefined constant, usually 0.7, and Csb is a constant associated with the subband being quantized, based on the contrast sensitivity function of the human visual system. This process is inverted (except for the rounding operation) in the decoder:
To find a slack range based on this tool, we want to find xmin and xmax such that xmin<=xhat<=xmax. This will give us the range of values that the above coding and decoding process might produce, which, implicitly, is the range of values that should yield acceptable distortion. As the rounding operation might add or subtract up to 0.5 (|x2−xq|<=0.5), the range of possible values for xhat is given by
We can replace 0.5 Csbalpha with a different constant, also indexed by subband, Qsb. To control the amount of distortion, we'll multiply this latter constant by q. So the final mechanism for handling self contrast masking in the present perceptual model is
A minor problem arises when x or xˆ\alpha−q Qsb is less than zero, because this can lead to imaginary values of xmin. To solve this, one can simply clip the range at zero. If x>=0, then
Neighborhood masking. The next mechanism models the effect of a coefficient's local neighborhood on its slack range. If there is a lot of energy in the neighborhood, with the same frequency and orientation as the coefficient in question, then distortions will be less perceptible and the slack can be increased. This is handled in JPEG 2000 with what is called point-wise extended masking, wherein x2 (see above) is adjusted according to a function of the coefficient values in the neighborhood. Thus
where a is a constant, N is the set of coefficient indices describing the neighborhood, |N| is the size of that set, xqi is the previously quantized value for coefficient i, and beta is a small constant. As with the self-masking tool described above, this process must be inverted at the decoder, which means that n must be computable at the decoder. This is made possible by computing n from the quantized coefficient values in the neighborhood, rather than their original values, and by limiting the neighborhood to coefficients appearing earlier in the scanning order, as illustrated by
The above-described modified quantization approach removes the need for these prior art limitations. To illustrate this,
where k and p are constants, and N describes the neighborhood. xmin and xmax are then computed from x′ instead of x, as described above.
Calculating slacks before subsampling. One of the problems with perceptual modeling for wavelet transforms is that each subband is subsampled at a rate lower than the Nyquist frequency. The information lost in this sampling is recovered in lower-frequency subbands. This means that, if we try to estimate the local energy of a given frequency and orientation by looking at the wavelet coefficients (as the above perceptual model does so far), then aliasing can severely distort the estimates. This problem can be reduced by simply calculating slacks before subsampling each subband. Each level of the forward wavelet transform can be implemented by applying four filters to the image—a low-pass filter (LL), a horizontal filter (LH), a vertical filter (HL), and a filter with energy along both diagonals (HH)—and then subsampling each of the four resulting filtered images. The next level is obtained by applying the same process recursively to the LL layer. Slacks can be computed, using the above models for self masking and neighborhood masking, after applying the filters but before subsampling. The slacks themselves are then subsampled along with the subbands.
Separating orientations in the diagonal band. Another perennial problem with perceptual modeling for wavelet transforms is that the HH subband contains energy in both diagonal directions. This is a problem because the two directions are perceptually independent—energy in one direction does not mask noise in the other. A model that calculates slacks from the local energy in the HH subband, however, cannot distinguish between the two directions. A large amount of energy along one diagonal will translate into a high slack, allowing large distortions in the HH subband that will introduce noise in both directions. To solve this problem, we can compute two sets of slacks, using two single-diagonal filters, illustratively shown in
where xmin[1] and xmax[1] are the minimum and maximum values of the slack range computed for the first diagonal, and xmin[2] and xmax[2] are the slack range computed for the second diagonal. Basically, this says that the maximum amount a given HH coefficient may change is limited by the minimum masking available in the two diagonal directions.
EXAMPLE SYSTEM.
The encoder 1100, as further described above, first applies a transform at 1010 to the input signal. The encoder 1100 then computes the perceptual slack at 1022 for the coefficients in the transformed signal in accordance with the specified perceptual model 1070, such as the model described above. Then, the encoder 1100 at 1024 selects code values from the codebook 1025 that lie within the perceptual slack for each of the coefficients. The encoder 1100 then applies an entropy coder 1030 using the selected code values. The decoder 1200 can decode the coded signal 1005 by simply using an entropy decoder 1040 and applying an inverse transform 1060 without any inverse quantization. As discussed above, it is preferable to utilize a codebook 1025 that has been optimally generated at 1080 for use with the system. The code generator 1080, in the context of generating appropriate codes for the encoder 1100 and decoder 1200, can utilize the approximation processing illustrated by
It is also advantageous to incorporate techniques such as subband coding and zero tree coding in the system. Subband coding is basically the process of quantizing and coding each wavelet subband separately. In the context of the system in
It is also advantageous to incorporate zero tree coding into the construction of the codes. Zero tree coding is a method of compacting quantized wavelet transforms. It is based on the observation that, when a wavelet coefficient can be quantized to zero, higher-frequency coefficients in the same orientation and basic location can also be quantized to zero. As a coefficient at one level corresponds spatially with four coefficients at the next lower (higher-frequency) level, coefficients can be organized into trees that cover small blocks of the image, and in many of these trees all the coefficients can be quantized to zero. Such trees are referred to in the art as “zero trees” and are illustrated by
Scalable coding/decoding. Currently, there is much interest in arranging that a decoder can obtain images of different quality by decoding different subsets of the coded image. That is, if the decoder decodes the first No bits, it should obtain a very rough approximation to the image; if it decodes the first N1>N0 bits, the approximation should be better; and so on. This is referred to in the art as scalable coding and decoding. The above modified quantization approach can be utilized to effectuate scalable coding/decoding. An image can be first quantized and encoded with very large perceptual slacks (e.g. a large value of q). Next, compute a narrower set of slack ranges (smaller value of q), but before using these slacks for the modified quantization, subtract the previously-quantized, lower-quality image from them. That is, for each coefficient, use xmin−xq0 and xmax−xq0 instead of xmin and xmax, where xq0 is the previous quantized value of the coefficient. Since xq0 is likely close to the original value of the coefficient, x, the new slack ranges will be tightly grouped around zero, and can be highly compressed. To reconstruct the higher-quality layer upon decoding, it can be simply added to the decoded lower-quality layer.
While exemplary drawings and specific embodiments of the present invention have been described and illustrated, it is to be understood that that the scope of the present invention is not to be limited to the particular embodiments discussed. Thus, the embodiments shall be regarded as illustrative rather than restrictive, and it should be understood that variations may be made in those embodiments by workers skilled in the arts without departing from the scope of the present invention as set forth in the claims that follow and their structural and functional equivalents. As but one of many variations, it should be understood that transforms and entropy coders other than those specified above can be readily utilized in the context of the present invention.
This application claims the benefit of U.S. Provisional Application No. 60/643,417, entitled “TRANSFORM CODING SYSTEM AND METHOD,” filed on Jan. 12, 2005, the contents of which are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
60643417 | Jan 2005 | US |