Discrete wavelet transforms (DWTs) are relatively new tools for presenting signals in a decomposed form where the signal is presented in different levels of detalization in the time and frequency domain. During the last decade, DWTs have been intensively studied and successfully applied to a wide range of applications such as numerical analysis, biomedicine, different branches of image and video processing, signal processing techniques, speech compression/decompression, etc [1]-[4]. DWTs have often been found preferable to other traditional transform techniques due to such useful features as inherent scalability, linear computational complexity, low aliasing distortion for signal processing applications, and adaptive time-frequency windows. DWTs have become basis of international image/video coding standards JPEG 2000 and MPEG-4.
Since most of the applications require real-time implementation of DWTs, the design of fast parallel VLSI ASIC's for DWTs has recently captivated the attention of a number of researchers. Many architectures have already been proposed for implementing the classical (or Haar) DWT [5]-[17] while much less attention has been paid to architectures for Hadamard wavelets and wavelet packets. Some of these devices have been targeted to have a low hardware complexity but they require at least 2N clock cycles (cc's) to compute the Haar DWT of a sequence of length N (see e.g. [1]-[3]). Nevertheless, also a large number of Haar DWT architectures, having a period of approximately N cc's, have been designed [7]-[12]. Most of these architectures exploit the Recursive Pyramid Algorithm (RPA) [15] based on the tree-structured filter bank representation of DWTs (see
According to the invention there is provided a new approach for implementation of DWTs (Haar wavelets, Hadamard wavelets as well as wavelet packets). The approach is based on representation of DWTs using flowgraphs similar to those known for traditional fast transforms like the fast Fourier, Walsh, Haar and other transforms (see [18]).
Embodiments of the invention will now be described by way of example only with reference to the accompanying drawings in which:
The flowgraph representation of DWTs will now be described. Advantages of the new representation are then discussed from the architecture design point of view. As examples of such designs we present several architectures (called FPP DWT, LPP DWT and LP DWT). The efficiency of both architectures is approximately 100%. They are very fast and provide excellent performance with respect to area-time characteristics. They are scalable, simple, regular, and free of long connections (depending on the length of input signal).
Several alternative definitions/representations of DWTs such as tree-structured filter bank [3], lattice structure [21]-[23], lifting scheme [24], [25] or matrix representation have been introduced during the last decade. Each of this representations has advantages from a certain point of view. These definitions/representations have been primarily targeted to easy synthesis and analysis of wavelets, and to their simple implementation as a secondary aim only. However, DWT architectures proposed in the literature so far are based on one of these representations. Below we present a new flowgraph representation of wavelets primarily targeted to designing parallel algorithms and architectures for implementation of wavelets. The new representation is similar to those widely used for representation of fast traditional transforms such as fast Haar, Fourier, of Hadamard transforms.
Most of the existing DWT architectures (e.g. those in [6], [13], [14], [27]) are based on the tree-structured filter bank representation of DWTs (see
The Haar wavelets, the Hadamard wavelets and the wavelet packets differ in that whether the results of both the low-pass and high-pass filtering or the results of only the low-pass filtering of a given octave are further processed in the next octave. Each of these cases is considered separately in the following subsections.
2.1. The Flowgraph Representation of Haar Wavelets.
In the case of the Haar wavelets (see
One can see a computational redundancy in the tree-structure representation of DWTs (as for Haar wavelets as well as for Hadamard wavelets and wavelet packets). This redundancy is related to downsampling which is, however, not inherent to the DWT computation (naturally no a sophisticated computational scheme would directly implement downsampling but rather would not compute every second output of low-pass and high-pass filters).
Another obvious problem with the tree-structure representation of Haar wavelets is that the input signal (without the appended points) is twice shorter from an octave to the next octave. This creates difficulties in developing pipelined designs. In a straightforward pipelining where the octaves are mapped into similar pipeline stages, the hardware underutilizaton would occur since every next stage would have twice less computations to implement compared to the previous one. Some designs (see e.g. [10], [17]) overcome this difficulty by implementing the first octave in one pipeline stage and all the others in the second stage. However, interleaving several octaves in one leads to complicated control and data routing schemes or extensive memory requirements as well as to a restricted pipelining where only two stages are used.
Let us also note that the tree structure representation (as for Haar wavelets as well as for Hadamard wavelets and wavelet packets) assumes digit-serial input signals and it hides the parallelism of octaves which is, however, inherent to DWT computation as we will see later from the flowgraph representation. Similar problems are typical also for lattice structure and lifting scheme representations of DWTS.
Another widespread definition/representation of DWTs is based on the matrix approach. The schemes on
y=H·x, (1)
where x=[x0, . . . , xN−1]T and y=[y0, . . . , yN−1]T are, respectively, the input and the output vectors of length N=2m and H is the DWT matrix of order N×N which is formed as the product of sparse matrices:
H=H(J)H(J−1)· . . . ·H(1), 1≦J≦m; (2)
In the case of the Haar wavelets (see
where Dj is the analysis (2m−j+1×2m−j+1) matrix at stage j, and Ik is the identity (k×k) matrix (k=2m−2m−n+1). If the vector of coefficients of the low-pass and of the high-pass filters in the scheme on
where Pj is the perfect unshuffle operator (see [18]) of the size (2m−j+1×2m−j+1).
Adopting (1)-(4), the DWT is computed in J stages (J being the number of octaves):
x(0)=x; x(j)=H(j)·x(j−1), j=1, . . . , J, y=x(J), (5)
where x(j), j=1, . . . , J, is an (N×1) vector of scratch Noting that lower right corner of every matrix H(j) is matrix, the algorithm of (5) can be rewritten as:
{tilde over (x)}(0)=x; {tilde over (x)}(j)=Dj·{tilde over (x)}(j−1)(0:2m−j+1−1), j=1, . . . , J, y=[{tilde over (x)}(J), {tilde over (x)}(J−1)(2m−J+1:2m−J+2), . . . , {tilde over (x)}(2)(2m−1:2m−1)]T, (6)
where {tilde over (x)}(j)=[{tilde over (x)}0j, . . . , {tilde over (x)}2
Computation of (6) with the matrices H(j) of (3), (4) can be clearly demonstrated using a flowgraph representation. An example for the case N=24=16, L=4, J=3 is shown in
The flowgraph representation of Haar wavelets as it has yet been presented has an inconvenience of being very large for bigger values of N. This inconvenience can be overcome based on the following observation. Assuming J<log2 N (in the most of applications J<<log2 N) one can see that the flowgraph of a Haar wavelet consists of N/2J similar patterns (see the two hatching regions on
[x0, x1, x2, x3, x4, x5, x6, x7]T={tilde over (x)}(0,0)
[x8, x9, x10, x11, x12, x13, x14, x15]T={tilde over (x)}(0,1)
[{tilde over (x)}0(1), {tilde over (x)}1(1), {tilde over (x)}2(1), {tilde over (x)}3(1)]T={tilde over (x)}(1,0)
[{tilde over (x)}4(1), {tilde over (x)}5(1), {tilde over (x)}6(1), {tilde over (x)}7(1)]T={tilde over (x)}(1,1)
[{tilde over (x)}0(2), {tilde over (x)}1(2)]T={tilde over (x)}(2,0)
[{tilde over (x)}2(2), {tilde over (x)}3(2)]T={tilde over (x)}(2,1)
[y8, y9, y10, y11]T=y(1,0)
[y12, y13, y14, y15]T=y(1,1)
[y4, y5]T=y(2,0)
[y6, y7]T=y(2,1)
[y0, y2]T=y(3,0)
[y1, y3]T=y(3,1)
Merging the 2m−J patterns in one, we can now obtain compact (or core) flowgraph representation of DWT. An example of a DWT compact flowgraph representation for the case J=3, L=4 is shown on
Also, outputs are now distributed over the outgoing edges of the compact flowgraph not only spatially but also temporally. That is, every outgoing edge corresponding to a high-pass filtering result of a node or low-pass filtering result of the node of the last stage represents a set of 2m−J output values.
Note that the structure of the compact flowgraph does not depend on the length of the DWT but only on the number of decomposition levels and filter length. The DWT length is reflected only in the number of values represented by every node. Also note that the compact flowgraph has the structure of 2J-point DWT with slightly modified appending strategy.
2.2.The Flowgraph Representation of Hadamard Wavelets.
In the case of the Hadamard wavelets (see
Similarly to the Haar wavelets, the Hadamard wavelets are also presented in the matrix representation of (1)-(2) where now matrices H(j) are of the form of the matrices Dj (see (4)) but of the size (2m−1−2m−1). Thus, the fast algorithm of (5) is directly applicable for computation of Hadamard wavelets.
The flowgraph representation of the algorithm (5) can similarly be considered as it was described in the case of the Haar wavelets. An example for the case N=23=8, L=4, J=2 is shown in
The main difference between Haar wavelet flowgraphs and Hadamard wavelet flowgraphs is that the former ones are of a “semitriangular” form while the latter ones are of “semirectangular form.” This means that a reducing from an octave to octave parallelism level is inherent to Haar DWTs while a uniform parallelism level of octaves is inherent to Hadamard DWTs. An “arbitrary reducing” from an octave to octave level of parallelism is inherent to wavelet packets.
Assuming J<log2 N one can see that the flowgraph of a Hadamard wavelet consists of N/2J similar patterns (see the two hatching regions on
Merging the 2m−J patterns in one, we can now obtain compact (or core) flowgraph representation of Hadamard wavelets. An example of a compact flowgraph representation for a Hadamard wavelet corresponding to the case J=2, L=4 is shown on
2.3. The Flowgraph Representation of Wavelet Packets.
Wavelet packets take an intermediate place between Haar wavelets and Hadamard wavelets in the sense of forming inputs to octaves. Some of the output signals of every octave are further processed with next octaves and some of them form the outputs of the transform (see
Compact flowgraph representation of wavelet packet transforms similar to the compact flowgraph representation of Hadamard transforms may also be considered where a sequence of ones and zeros must be associated with every node to make some nodes representing NOP operations at some time instants.
2.4. Advantages of the Flowgraph Representation of DWTs.
Essentially, the flowgraph representation gives an alternative, rather demonstrative and easy-to-understand definition of discrete wavelet transforms. It has several advantages, at least from implementational point of view, as compared to the conventional DWT representations such as tree-structured filter bank, lifting scheme or lattice structure representation. Some of these advantages are as follows.
In the next section several architectures which are designed with an approach based on the flowgraph representation of DWTs are described. Other, perhaps, even more sophisticated architectures could be designed using the flowgraph approach. However, the most illustrative archietectures are described in order to better explore the approach to designing DWT architectures based on their flowgraph representation.
The DWT flowgraphs described in the previous section are very regular and provide a systematic approach to design different parallel DWT algorithms and architectures similar to as the well-known fast transform flowgraphs provide for the traditional transforms [18]. Below we present some examples of DWT architectures designed by analyzing the corresponding flowgraphs. In fact, these architectures can be considered as extensions of the parallel-pipelined designs proposed in [28], [31] (see also [18]) for families of Haar-like and Fourier-like transforms.
It should be noted that the presented architectures just tend to demonstrate the power of the flowgraph analysis based approach to developing efficient parallel/pipelined DWT architectures. So, the architectures are only presented in a general form to demonstrate principles and some architectural details (such as PE structure, interconnections, etc.) which would be present in a complete design of a sophisticated DWT architecture are omitted for the purposes of clarity.
3.1. Fully Parallel-Pipelined (FPP) Architectures for DWTs.
These architectures, which we call fully parallel-pipelined or FPP DWT architectures, are straightforwardly obtained by a direct “one-to-one” mapping of the corresponding (Haar or Hadamard) DWT flowgraph to a processor architecture where the nodes of the flowgraph represent processor elements (PEs) and edges represent interconnections. Different structures of PEs implementing the basic DWT operation can be developed. Any PE that is able of implementing a pair of inner products of the vector on its inputs with a pair of predefined vectors of coefficients can be employed. A simple example of a PE could be designed with a pair of multiply-accumulate (MAC) units used in digital signal processors and shown in
To support a wavelet packet transform implementation, the PEs should be also able to operate in two modes: the first mode for implementing the basic DWT operation as discussed above and the second mode for implementing NOP operations. This may simply be achieved by multiplexing the first two inputs of every PE with its outputs.
The input to the FPP can be made as word-parallel (consider small circles at the first stage on
The architecture can be efficiently pipelined by considering small circles at intermediate stages (see
It is easy to see that, in the pipelined mode, for big values of M (M>>J,) approximately 100% hardware utilization of the FPP architecture is achieved regardless which kind of PEs are used. Indeed, to process all the M vector-signals M+J−1 time units are needed during which every PE operates M time units. More formally, let us define the measure of hardware utilization (or efficiency) for a parallel/pipelined architecture as:
where T(1) is the time of implementation of an algorithm with one PE and T(K) is the time of implementation of the same algorithm with the architecture consisting of K PEs. In the case of the Haar DWT T(1)=M(2m−1)τ, K=(2m−1), and in the case of the Hadamard DWT T(1)=MJ2m−1τ, K=J2m−1. In the both cases, T(K)=(M+J−1)τ. Substituting these values into (7), we obtain
as an estimate for the efficiency of the FPP architecture in the pipelined mode. Clearly, EFPP−P≈100% if the number M of the processed vector-signals is sufficiently high.
The efficiency of the FPP in the non-pipelined mode is estimated as EFPP−np=(100/J) which is rather pure. Nevertheless, even in this case the architecture is very fast. Its delay is estimated as TFPP−np=J time units while the known Haar DWT architectures require at least O(N) time units.
Table 1 summarizes Area-Time characteristics of the Haar DWT FPP architecture (both in pipelined and non-pipelined modes) assuming PEs of
However, FPP architectures require a large area for big values of N which makes them impractical for processing long input signals. A more sophisticated architecture is considered in the next section.
3.2. Limited Parallel-Pipelined (LPP) Architectures for DWTs.
These architectures, which we call limited parallel-pipelined architecture for Haar DWT (or Haar DWT LPP, for short), are obtained from corresponding compact DWT flowgraphs. Let us note that the compact DWT flowgraph for Haar or for Hadamard wavelets have, in fact, been obtained by decomposition of the computational process for the corresponding 2m-point DWT into a set of 2m−J computational processes each for a 2J-point DWT with slightly modified appending strategy. The main idea in designing an LPP architecture is to decompose the input 2m-point vector x into a set of 2m−J subvectors {tilde over (x)}(0,s) or x(0,s), s=0, . . . , 2m−J−1 of length 2J (similarly to as in Section 2.1 or Section 2.2) and process them in the pipelined mode. As we saw in the previous subsection, the FPP architecture is very efficient in the pipelined mode. However, we cannot directly compute DWTs of the stream of vectors {tilde over (x)}(0,s), s=0, . . . , 2m−J−1, on the FPP (for a 2J-point DWT) in order to obtain the DWT of the vector x. Some modification to support the modified appending strategy is needed. Several possibilities exist for doing this. Within the LPP architecture presented in this paper, the modified appending strategy is supported by including delays and additional connections between the pipeline stages of the FPP according to the compact DWT flowgraph structure.
Consider an example of the LPP architecture corresponding to the case of Haar wavelet with L=6 and J=3. The architecture, in this case, (see
In the case of
The high-pass outputs of the PEs of the first stage form the (2J−=4)th to (2J−1=7)th outputs of the architecture while their low-pass outputs form the inputs to the second stage and are connected to a group of 2J−1=4 delays. Outputs of the delays and the first four inputs of the second stage are connected to the inputs of the four PEs of the second stage similarly to as for the first pipeline stage. One half of the PE outputs forms the (2J−2=2)th to (2J−1=3)th outputs of the architecture and the other half form the input to the third pipeline stage. This stage consists of two groups of delay elements each consisting of two delay elements, with the zeroth group with the elements delaying for two time units and the first group with the elements delaying for one time unit. Outputs of all four delays as well as the first two inputs to the stage are connected to the inputs of the single PE of the stage. Outputs of this PE3,0 form the zeroth and the first outputs of the architecture. Operation of the LPP architecture corresponding to computation of a 16-point Haar DWT with L=6 and J=3. is summarized in Table 2. At the zeroth time unit, the vector {tilde over (x)}(0,0)=[x0, . . . , x7] enters to the input registers. At the first step the vector {tilde over (x)}(0,1)=[x8, . . . , x15] enters to the input registers so that the components x0 , . . . , x7, x8, . . . , x11 begin to be processed with the PEs of the first group. Computation then proceeds in a similar way according to the Table 2.
In general, when implementing a 2m-point DWT on the LPP architecture, a subvector {tilde over (x)}(0,s) (for the case of Haar DWTS) or the subvector x(0,s) (for the case of Hadamard DWTs or wavelet packets) is formed on the input of the first pipeline stage every time unit s=0, . . . , 2m−J−1. With a delay sJ (the sum of delay layers of pipeline stages), output subvectors are formed with the same rate of one subvector per time unit. Since there 2m−J subvectors in total when implementing (Haar, Hadamard or wavelet packet) DWT of a vector of length N=2m the total delay of the LPP architecture is given by
TLPP=(sJ+N/2J−1)τ (8)
The LPP architecture consists of K=2J−1 PEs in the case of the Haar DWTs or K=J2J PEs in the case of Hadamard wavelets and wavelet packets. Substituting these values into (7) and also noting that the Haar DWT requires T(1)=(2m−1)τ, and the Hadamard DWT requires T(1)=J2m−1τ time units to be implemented with one PE we obtain that the efficiency of the LPP architecture is given by:
This means the efficiency of the LPP is close to 100% for large values of m (2m>>J&2M>>L):
ELPP≈100%
even when considering computation of the DWT of one long enough vector and computing the efficiency with respect to time delay. It should be noted that in the case where DWTs of a stream of vectors need to be computed, the period between computation of successive DWTs is
TLPPp=τ2m−J.
Table 1 presents comparative performance of the LPP architecture for the Haar DWT with some known architectures demonstrating excellent Area-Time characteristics of the proposed architecture. Among the other useful features of the architecture should be noted are its regularity, ease of control, absence of long (depending on N) connections and the independence of the architecture on N meaning that DWTs of different length can be computed with the same hardware. Input to the device can be made as word-parallel as well as word-serial.
3.2. The Limited Parallel (LP) Architecture for Hadamard Wavelets and Wavelet Packets.
This architecture (see
The entire computation takes J2m−J+┌(L−2)/2J┐ time units (the overhead delay of ┌(L−2)/2J┐ time units is introduced due to the delay on the inputs to the PEs). It is easy to verify that the architecture operates at approximately 100% of hardware utilization.
It should be noted that although in the specific embodiments described in the foregoing reference is made to a perfect unshuffle operator, in a more general form of the invention in which the input signal is of length r×km rather than 2m (representing PEs which carry out k filtering operations rather than two filtering operations, and thus having k outputs rather than two outputs), a stride permutation operation is used.
A flowgraph representation of discrete wavelet transforms (Haar wavelets, Hadamard wavelets, and wavelet packet transforms) has been suggested. This representation is a new definition of DWTs. An approach for developing efficient parallel architectures for implementing DWTs has been suggested. Some examples of architectures designed with the proposed approach have been presented demonstrating excellent area-time characteristics. However, the presented architecture are just some examples for illustrating the approach. For example, the invention can be appledto inverse DWTs including inverse Haar wavelets, inverse Hadamad wavelets, and inverse wavelet packets.
This application is a continuation of, and claims priority to and the benefit of, U.S. patent application Ser. No. 10/155,944 filed on May 24, 2002, the disclosure of which is incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
60295292 | Jun 2001 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10155944 | May 2002 | US |
Child | 11442682 | May 2006 | US |