The advantages of the wavelet transform over conventional transforms, such as the Fourier transform, are now well recognized. In many application areas, the wavelet transform is more efficient at representing signal features that are localized in both time and frequency. Over the past 15 years, wavelet analysis has become a standard technique in such diverse areas as geophysics, meteorology, audio signal processing, and image compression. Significantly, the 2-D biorthogonal discrete wavelet transform (DWT) has been adopted in the recently established the new JPEG-2000 still image compression standard.
The classical DWT can be calculated using an approach known as Mallat's tree algorithm. Here, the lower resolution wavelet coefficients of each DWT stage are calculated recursively according to the following equations:
where
The corresponding tree structure for a two-level DWT is illustrated in
The structure of the corresponding separable 2-D DWT algorithm is shown in
In 1994, Sweldens proposed a more efficient way of constructing the biorthogonal wavelet bases, called the lifting scheme. Concurrently, similar ideas were also proposed by others. The basic structure of the lifting scheme is shown in
Daubechies and Sweldens showed that every FIR wavelet or filter bank can be factored into a cascade of lifting steps, that is, it can be represented as a finite product of upper and lower triangular matrices and a diagonal normalization matrix. The high-pass filter g(z) and low-pass filter h(z) in Equations 1 and 2 can thus be rewritten as:
where J is the filter length. We can split the high-pass and low-pass filters into even and odd parts:
g(z)=ge(z2)+z−1go(z2) (5)
h(z)=he(z2)+z−1ho(z2) (6)
The filters can also be expressed as a polyphase matrix as follows:
Using the Euclidean algorithrn, which recursively finds the greatest common divisors of the even and odd parts of the original filters, the forward transform polyphase matrix {tilde over (P)}(z) can be factored into lifting steps as follows:
where si(z) and ti(z) are Laurent polynomials corresponding to the update and prediction steps, respectively, and K is a non-zero constant. The inverse DWT is described by the following synthesis polyphase matrix:
As an example, the low-pass and high-pass filters corresponding to the Daubechies 4-tap wavelet can be expressed as:
{tilde over (h)}(z)=ho+h1z−1+h2z−2+h3z−3 {tilde over (g)}(z)=−h3z2+h2z1−h1+h0z−1, (10)
where
Following the above procedure, we can factor the analysis polyphase matrix of the Daubechies-4 wavelet filter as:
The corresponding synthesis polyphase matrix can be factored as:
Similarly, the 9/7 analysis wavelet filter can be factored as:
The corresponding synthesis wavelet filter is factored as:
where the values of α, β, γ, δ, and ζ are shown in
To calculate a DWT using a lifting algorithm, the input signal has to be first separated into even and odd samples. Each pair of input samples (one even and one odd) is then processed according to the specific analysis polyphase matrix. For many applications, the data can be read no faster than one input sample per clock cycle, so sample pairs are usually processed at every other clock cycle. Hence, this is a limitation on the speed and efficiency of a direct implementation of the lifting scheme. To overcome this bottleneck, there are proposed architectures in which data streams are interleaved within the DWT. Recursive architectures exploit the available idle cycles and re-use the same hardware to recursively interleave the DWT stages, and dual scan architectures achieve efficiency gain by keeping the datapath hardware busy with two different streams of data.
There is therefore provided in accordance with an aspect of the invention, an apparatus for digital signal processing, the apparatus comprising a cascade of digital filters connected to receive a sampled input signal and having an output, in which the digital filters implement a transform decomposed into lifting steps, the cascade of digital filters operating on pairs of samples from the sampled input signal. A source of a data stream is also provided, where the data stream is also composed of samples. A multiplexer multiplexes the samples of the data stream with the sampled input signal for processing by the cascade of digital filters.
In a further aspect of the invention, there is provided a method of transforming a sampled input signal into a transformed output signal, the method comprising the steps of:
operating on pairs of the sampled input signal with a cascade of digital filters that implements a transform decomposed into lifting steps to provide an output; and
operating on samples from a data stream using the cascade of digital filters, where the samples from the data stream have been multiplexed with the sampled input signal.
In further aspects of the invention, the cascade of digital filters implements a one-dimensional discrete wavelet transform, such as a Daubechies-4 wavelet transform or 9/7 wavelet transform. The cascade of digital filters may implement filtering steps corresponding to Laurent polynomials. The cascade of digital filters may implement a two-dimensional transform that is decomposed into a first one-dimensional (row) transform followed by a second one-dimensional (column) transform. A buffer memory may be connected to receive samples from the data stream and output the samples to the cascade of digital filters for processing of the data stream by interleaving of the samples from the data stream with the sampled input signal. The data stream received by the buffer memory may be taken from the output of the cascade of digital filters to provide a recursive architecture. The cascade of digital filters may implement an N-dimensional transform, where N is greater than 2, and the number of digital filter cascades is N.
There will now be described preferred embodiments of the invention with reference to the figures by way of illustration, without intending to limit the invention to the precise embodiments disclosed, in which:
a depicts a controller emitting enabling signals to registers;
b depicts a controller emitting enabling signals to delay stages;
c illustrates how the circuit of
a shows detail of the architecture of the block PE in
This disclosure ends with Tables 1, 3, 5, 7, 9, 11, 13, 15, 17 and 19 that illustrate the manner of implementation of the recursive and dual scan architectures.
In the claims, the word “comprising” is used in its inclusive sense and does not exclude other elements being present. The use of the indefinite article “a” does not exclude more than one of the element being present.
As a preliminary matter, we consider a signal extension method for use in the proposed hardware architectures. To keep the number of wavelet coefficients the same as the number of data samples in the original signal, an appropriate signal extension method is necessary. Typical signal extension methods are zero padding, periodic extension, and symmetric extension. Zero padding is not normally acceptable for the classical wavelet algorithms due to the extra wavelet coefficients that are introduced. Periodic extension is applicable to all (biorthogonal and orthogonal) wavelet filters, but symmetric extension is suitable only for (symmetric) biorthogonal wavelet filters. Since the lifting scheme applies for constructing biorthogonal wavelets, symmetric extension can always be used for calculating the lifting scheme. Lifting steps obtained by factoring the finite wavelet filter pairs can be calculated by using simple zero padding extension. After a polyphase matrix representing a wavelet transform with finite filters is factored into lifting steps, each step becomes a Laurent polynomial, namely the si(z) or ti(z) from Equation 8. Since the difference between the degrees of the even and odd parts of a polynomial is never greater than two, we can always find a common divisor of first-order or lower for the polynomials. Hence, a classical wavelet filter can always be factored into first-order or lower-order Laurent polynomials (i.e., si(z) or ti(z)). Lifting steps containing these short polynomials correspond to one to three-tap FIR filters in the hardware implementations. Because signal extension is not necessary for a two-tap wavelet filter, like for the Haar wavelet, zero padding can be used in the lifting algorithm.
In the preferred embodiment disclosed here, the easily implemented zero extension is used in the proposed architectures. The sample overlap wavelet transform recommended in JPEG-2000 Part II can also be implemented in the proposed 2-D architecture.
Because of the down-sampling resulting from the splitting step at each stage in the lifting-based DWT, the number of low frequency coefficients is always half the number of input samples from the preceding stage. Further, because only the low frequency DWT coefficients are further decomposed in the dyadic DWT, the total number of the samples to be processed for an L-stage 1-D DWT is:
N(1+½+¼+ . . . +½L−1)=N(2−½L−1)<2N
where N is the number of the input samples. For a finite-length input signal, the number of input samples is always greater than the total number of intermediate low frequency coefficients to be processed at the second and higher stages. Accordingly, there are time slots available to interleave the calculation of the higher stage DWT coefficients while the first-stage coefficients are being calculated.
The recursive architecture (RA) is a general scheme that can be used to implement any wavelet filter that is decomposable into lifting steps. As 1-D examples, we describe RA implementations of the Daub-4 and 9/7 wavelet filters. The RA can be extended to 2-D wavelet filters, and can be extended to even higher dimensions by using the methods set forward in this disclosure.
The RA is a modular scheme made up of basic circuits such as delay units, pipeline registers, multiplier-accumulators (MACs), and multipliers. Since the factored Laurent polynomials si(z) and ti(z) for symmetric (biorthogonal) wavelet filters are themselves symmetric, and those for asymmetric filters are normally asymmetric, we use two kinds of MACs to minimize the computational cost. The MAC for asymmetric filters, shown in
Different kinds of lifting-based DWT architectures can be constructed by combining the four basic lifting step circuits, shown in
Step 1: Decompose the given wavelet filter into lifting steps.
Step 2: Construct the corresponding cascade of lifting step circuits. Replace each delay unit in each circuit with an array of delay units. The number of delay units in the array is the same as the number of wavelet stages.
Step 3: At the beginning of the cascade, construct an array of delay units that will be used to split the inputs for all wavelet stages into even and odd samples. These delay units are also used to temporarily delay the samples so that they can be input into the lifting step cascade at the right time slot. Two multiplexer switches are used to select one even input and one odd input to be passed from the delay units to the first lifting step.
Step 4: Construct a data flow table that expresses how all of the switches are set and how the delay units are enabled in each time slot. There is latency as the initial inputs for the first wavelet stage propagate down through the cascade. A free time slot must then be selected to fix the time when the inputs for the second wavelet stage will be sent into the cascade. All higher order stages must also be scheduled into free time slots in the data flow table.
Step 5: Design the control sequencer to implement the data flow table.
The RA in
In the proposed lifting scheme, a cascade of digital filters CDF (
In
The input registers Ri also synchronize the even and odd samples of each stage. Since the first two stages can be immediately processed when the odd samples are ready, no input register is needed for the odd samples for these two stages. Register Di is a delay unit for the ith stage. After splitting the input data into even and odd parts, the Daub-4 DWT is calculated step by step as shown in Table 1. In Table 1, En and On are the outputs of each lifting step; e−i,j and o−i,j denote the even and odd intermediate results of each lifting step. Since the architecture is pipelined by each MAC unit, the outputs of each lifting step are synchronized. As an example, the calculations of the first pair of DWT coefficients are given below:
E1: x0,1=x0,1
O1: x0,2=x0,2
E2: e−1,1=x0,1
O2: o−1,1=αx0,1+x0,2
E3: e−1,1=βo−1,1+e−1,1
O3: o−1,1=o−1,1
E4: e−1,1=z−1γo−1,1+e−1,1
O4: o−1,1=z−1o−1,1
Low frequency DWT coefficient l: l−1,1=υ.e−1,1
High frequency DWT coefficient h: h−1,1=ω(λe−1,1+o−1,1).
Therefore, the DWT coefficients of the first stage are generated five clock cycles after the first input sample is received. The first low frequency DWT coefficient l−1,1 is also stored in register R2. After the second low frequency DWT coefficient l−1,2 is ready, l−1,1 and l−1,2 are further processed in the idle cycles, as shown in Table 1. The outputs at various stages of lifting steps corresponding to Daub-4 wavelet are shown in
Matrix-1
[A1B1]=[x0,1x0,2][E1O1]
The [Ep,Op] denotes the output at p-th clock cycle. [Ap,Bp]s are as defined in
The control signals for the switches in a RA can also be deduced from the corresponding data flow table (which is Table I in this case). The timing for the register enable signals is shown in Table 3. Switches S1, S2 and S3 steer the data flows at each stage. The timing of the switch control signals is shown in Table 5. Output switch S4 feeds back the low frequency DWT coefficients (except for the last stage) to be further decomposed. The switching timing for S4 is the same as for S1.
The calculations of the first pair of DWT coefficients for Daubechies 9/7 wavelet are given below:
E1=x0,1=x0,1
O1=x0,2=x0,2
E2=e−1,1=z−1.x0,1
O2=o−1,1=α(φ+x0,1)+x0,2O−1,2=α(x0,1+x0,3)+x0,4
E3=e−1,1=β(o−1,1+o−1,2)+e−1,1
O3=o−1,1=z−1.o−1,1
O4=o−1,1=γ(e−1.1+e−1,2)+o−1,1
E4=e−1,1=z−1.e−1,1
E5=e−1,1=γ(o−1,1+z−1o−1,2)+e−1,1
O5=o−1,1=z−1.o−1,1
Low frequency DWT coefficient L=l=ζ.e−1,1
High frequency DWT coefficient H=h=ζ−1.o−1,1
The design of the controller is relatively simple, due to the regularity of the control signals for the RA, as shown in Table 3 and Table 5. All control signals are generated by counters and flip-flops controlled by a four-state finite state machine. The counters generate periodic signals for the longer period (T>4 clock cycles) control signals, and the flip-flops produce local delays. If externally-generated start and stop signals are provided, the long counter for keeping track of the number of input samples is unnecessary. Compared to other direct implementations of lifting-based DWTs, the overhead for the RA controller is very small. The controller should occupy less than 10% of the total silicon area of the 1-D RA.
The remaining elements of the RA include registers and switches (tri-state buffers). Since the area of the switches is negligible compared to the size of the whole architecture, the cost of the registers dominates. For implementing an L-stage DWT, the RA uses (L−1)(M+1) more registers than a conventional lifting-based architecture, where M is the number of delay registers. Considering that a conventional architecture needs an extra memory bank to store at least N/2 intermediate DWT coefficients, the RA architecture is more area-efficient in most applications, where (L−1)(M+1)<<N/2. The power consumption of the RA should be lower than that of a conventional architecture because the RA eliminates the memory read/write operations and because all data routing is local. By avoiding the fetching of data from memories and the driving of long wires, the power dissipated by the RA switches is small.
In
TP=N+(L×Td)+(1+2+ . . . +2L−2)=N+L×Td+2L−1−1.
The hardware utilization can be defined as the ratio of the actual computation time to the total processing time, with time expressed in numbers of clock cycles. At each section of the pipeline structure, the actual clock cycle count TC is the number of sample pairs to be processed.
TC=(N+N(1−21−L))/2.
Note that N(1−21−L) is the number of samples being processed at the second or higher stages. The busy time TB of the corresponding section can be expressed as:
TB=TP−Td=N+(L−1)×Td+2L−1−1.
Consequently, the hardware utilization U of the L-stage RA is:
Because U is a continuous concave function of variable L when L≧1, the maximum hardware utilization can be achieved when ∂U/∂L=0. Ignoring the delay Td, ∂U/∂L=0 can be expressed as:
The above equation is true when L=2−1(log2N+log2(1−1/L)+1). Assuming L>1 and N>>√{square root over (N)}, the utilization reaches a maximum of about 90% when L=0.5 log2 N, and gradually reduces to around 50% when L=1 or log2N. For a 5-stage DWT operating on 1024 input samples, the utilization approaches 92%. When the number of decomposition stages L increases, the processing time increases significantly and the utilization drops accordingly. As mentioned above, the delay of 2L was due to the increasing separation (2L clock cycles) of the input values to each stage. If we decrease the sampling grid for each stage as soon as all previous stages have finished, we can speed up the computation. With a little bit additional controller overheads, the processing time in clock cycle of an L-stage DWT can be reduced to:
N+(L×Td).
When N→∞, the hardware utilization of the 1-D RA approaches 100%. Compared to the conventional implementations of the lifting algorithm, the proposed architectures can achieve a speed-up of up to almost 100% as shown in Table 7.
To achieve higher hardware utilization for special cases, we also propose the dual scan architecture (DSA), which interleaves the processing of two independent signals simultaneously to increase the hardware utilization. The 1-D DSA is shown in
The 1-D DSA calculates the DWT as the input samples are being shifted in, and stores the low frequency coefficients in the internal memory. When all input samples have been processed, the stored coefficients are retrieved to start computing the next stage. The input switches SW2, SW3 in
As the 1-D DSA performs useful calculations in every clock cycle, the hardware utilization for the PE is 100%. The processing time for the L-stage DWT of two N-sample signals is N+L×Td. Compared to conventional implementations for computing two separate signals, the 1-D DSA requires only half the hardware. Hence, given an even number of equal-length signals to process, the speedup of the 1-D DSA is 100%. A recursive architecture RA will calculate a DWT in about half the time compared to DSA. It starts calculating higher-level DWT coefficients even before it completes the first level decomposition. On the other hand, a DSA calculates a DWT for two streams stage-by-stage. The total computation time is double that of the RA. However, because it calculates DWT of two arrays, on average it has hardware utilization efficiency similar to that RA.
A conventional implementation of a separable 2-D lifting-based DWT is illustrated in
The basic strategy of the 2-D recursive architecture is the same as that of its 1-D counterpart: the calculations of all DWT stages are interleaved to increase the hardware utilization. Within each DWT stage, we use the processing sequence shown in
A schematic for the 2-D RA is shown in
A portion of the data flow for computing an 8×8 sample 2-D Daub-4 DWT is shown in Table 9. As described before, the first pair e−1,1,1 and o−1,1,1 of the first stage row transform coefficients are generated at the sixth clock cycle. They are immediately shifted into the high and low frequency FIFOs, respectively. The consecutive DWT coefficients of the same row are in turn pushed into their conresponding FIFOs in the consequent clock cycles until the end of the row (the 12th clock cycle in this case). When the first pair of the row transform coefficients of the second row is ready, the low frequency coefficient e−1,1,2 is sent to the odd input of the column processor, and the high frequency coefficient o−1,1,2 is pushed into the corresponding FIFO. The first low frequency coefficient of the first row e−1,1,1 is also popped out of the FIFO and sent to the even input of the column processor; its high frequency counterpart o−1,1,1 is pushed to the low frequency FIFO. After 4 clock cycles, the column processor Cp generates the first pair of the 2-D DWT coefficients, of which the low frequency one ll−1,1,1 is temporarily stored in register R2. The row processor Rp starts further decomposing the low frequency DWT coefficients after the second low frequency coefficient ll−1,2,1 is generated (at 21st clock cycle in Table 9).
At the end of the row transform of the second row (at 20th clock cycle in this case), both FIFOs for the first stage contain only the high frequency row transform coefficients of the first two rows, and start sending these coefficients to the column processor Cp after one clock cycle. As shown in Table 9, the calculation of the multiple stage 2-D DWT is continuous and periodic, so that control signals for the data flow are easy to generate by relatively simple logic circuits.
Similar to the 1-D RA case, the control signals for the 2-D RA are deduced from the data flow as shown in Table 9. The timing for the switch signals of the 2-D RA for lifting-based Daub-4 DWT are shown in Table 11, and the enable signals are fixed delay versions of these switch signals. Also, similar to the delay reduction method used in the 1-D RA, the delay time of the 2-D DWT can be minimized. The timing of control signals for other wavelets are similar, and can be achieved by changing the delay in Table 11.
Since the high-frequency components are processed one row after the low-frequency components, as shown in
N×N+N+2×L×Td+2L−1−1.
Similar to the 1-D implementation, the hardware utilization of about 90% can be achieved when L is close to log2N.
In a conventional 2-D DWT algorithm, the vertical DWT is carried out only after the horizontal DWT is finished. This delay between the row and column computations limits the processing speed. The 2-D DSA shortens the delay by adopting a new scan sequence. In applications that can read two pixels per clock cycle from a data buffer, the scan sequence of the 2-D DSA shown in
The structure of the 2-D DSA is shown in
The processing time for each stage is:
0.5N2(1−¼L)+2Td
Because only a quarter of the coefficients are further decomposed, the total processing time for a L-stage 2-D DWT is:
(⅔)N2(1¼L)+2TdL
Compared to a conventional implementation, the DSA uses roughly half of the time to compute the 2-D DWT, and the size of the memory for storing the row transform coefficients is reduced to M rows, where M is the number of delay units in a 1-D filter. The comparisons of the processing time and memory size are shown in Table 15 and Table 17, respectively. In Table 15, the timing for the RA is based on one input pixel per clock cycle, while the others are based on two input pixels per cycle.
As the dynamic range of the DWT coefficients increases with the number of decomposition stages, the number of bits used to represent the coefficients should be large enough to prevent overflow. Bits representing the fractional part can be added to improve the signal noise ratio (SNR) of the calculated DWT coefficients. In simulations described below, the filter coefficients and the DWT coefficients are represented in 16 bits (11-bit integer and 5-bit decimal). Therefore, 16-bit multipliers are implemented in our designs, and their results are also rounded to 16-bit. The SNR and PSNR values for the 3-stage forward DWT of the test gray level images are listed in Table 19.
The proposed architectures were synthesised and implemented for Xilinx's Virtex II FPGA XC2V250. The 1-D RA implementing the 3-stage 9/7 lifting-based DWT uses 409 logic slices out of the 1536 slices available in the FPGA. The 2-D RA implementing the 3-stage Daub-4 DWT uses 879 logic slices, and can compute the DWT of 8-bit gray level images of sizes up to 6000×6000 at 50 MHz using the built-in RAM blocks and multipliers in the FPGA. To estimate the corresponding silicon areas for ASIC designs, we used Synopsys' Design Compiler to synthesize the above architectures with TSMC's 0.18-μm standard cell library aiming for 50 MHz operation. Since the MAC unit is the critical element in the designs, higher operating frequencies can be achieved by implementing faster multipliers or by pipelining the MAC units and minimizing the routing distance of each section of the pipeline. The synthesized designs were then placed and routed by Silicon Ensemble, and the final layouts were generated by using Cadence DFII. The core size of the 1-D RA implementing the 3-stage 9/7 DWT is about 0.177 mm2 (90% of which is the datapath, 10% is the controller, and the rest is memory), and the core size of the 2-D RA that calculates the 3-stage Daub-4 DWT of a 256×256 image is about 2.25 mm2 (about 15% of which is the datapath, 5% is the controller, and the rest is memory). The core area could be reduced by reimplementing the delay units as register files instead of separate flip-flops, and the performance of the proposed arhcitectures can be further improved by optimizing the ciruit designs.
We have disclosed two recursive architectures and two dual scan architectures for computing the DWT based on the lifting scheme. Compared to previous implementations of the lifting-based DWT, the disclosed architectures have higher hardware utilization and shorter computation time. In addition, since the recursive architectures can continuously compute the DWT coefficients as soon as the samples become available, the memory size required for storing the intermediate results is minimized. Hence, the sizes and power consumptions of both the 1-D and 2-D recursive architectures are significantly reduced compared to other implementations. In addition, since the designs are modular, they can be easily extended to implement any separable multi-dimensional DWT by cascading N of the basic 1-D DWT processors, where N is the dimension of the DWT, by using the principles set forward in this disclosure. We also believe, on reasonable grounds, that the proposed architectures may be used to implement lifting schemes for multiwavelets.
Applications in which wavelet processing and hence the principles in this disclosure are potentially useful include but are not limited to image processing, compression, texture analysis, and noise suppression; audio processing, compression, and filtering; radar signal processing, seismic data processing, and fluid mechanics; microelectronics manufacturing, glass, plastic, steel, inspection, web and paper products, pharmaceuticals, food and agriculture.
Immaterial modifications may be made to the embodiments disclosed here without departing from the invention.
Number | Name | Date | Kind |
---|---|---|---|
5347479 | Miyazaki | Sep 1994 | A |
5757432 | Dulong et al. | May 1998 | A |
5763961 | Dreyer et al. | Jun 1998 | A |
5974097 | Julyan et al. | Oct 1999 | A |
5984514 | Greene et al. | Nov 1999 | A |
6014897 | Mo | Jan 2000 | A |
6182102 | Ramachandran | Jan 2001 | B1 |
6236684 | Wu | May 2001 | B1 |
6654467 | York et al. | Nov 2003 | B1 |
6785700 | Masud et al. | Aug 2004 | B2 |
20010033698 | Yip | Oct 2001 | A1 |
20020028022 | Fukuhara et al. | Mar 2002 | A1 |
20020078112 | Majani | Jun 2002 | A1 |
20020078113 | Nakayama | Jun 2002 | A1 |
20020107899 | Masud et al. | Aug 2002 | A1 |
Number | Date | Country | |
---|---|---|---|
20040223655 A1 | Nov 2004 | US |