The invention relates generally to the field of pipelined hardware architecture. More specifically, embodiments of the invention relate to systems and methods for implementing power efficient hardware solutions for streaming computations.
Low power consumption and high performance are important requirements for any signal processing hardware design. Mobile multimedia systems are becoming popular consumer items, but limited battery life continues to be a problem. Energy efficiency must be balanced against the fact that users demand a high quality of service. With the ever increasing number of battery-operated devices, the need for minimizing power consumption without compromising performance is essential.
The practice of using data pipelines for streaming computations leads to high performance. Pipelining breaks up a complex operation performed on a stream of data into smaller sequential stages or subprocesses where the output of one subprocess feeds into the next. When implemented properly, multiple operations can be performed concurrently even if one step normally would depend on the result of the preceding step before it can start. Pipelining improves performance by reducing the idle time or latency of each piece of hardware. Conversely, the pipelined stages must be designed to make the pipeline balanced, so that the different stages take approximately the same time to complete. With each clock cycle, new data is input to one end of the pipeline and a completed result will be output from the other end.
Pipelining enables the realization of high-speed, high-efficiency complementary metal oxide semiconductor (CMOS) data paths by allowing for the reduction of supply voltages to the lowest possible levels while still satisfying throughput constraints. In deep pipelines, however, registers and corresponding clock trees are responsible for an increasingly large fraction of total dissipation, no matter how efficiently they may have been implemented.
One application that naturally lends itself to pipelining is video processing, a key component of streaming multimedia communications and an integral part of next-generation portable devices. Currently, there are several video standards established for different purposes such as MPEG, JPEG 2000 and others, and their implementations for mobile systems-on-a-chip (SoCs) provide substantial computing capabilities at low energy consumption levels. The requirements of these standards incorporate demanding computations that include the discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT), the discrete wavelet transform (DWT) and inverse discrete wavelet transform (IDWT), motion estimation, motion compensation, variable-length coding/decoding, quantization and inverse quantization. JPEG 2000 is a recently developed standard for digital image processing and individually compresses each frame in a moving picture. Implementations of JPEG 2000 may be used in applications ranging from battery-operated cameras where low-power consumption is desirable, to digital cinema which requires real-time decompression of high-resolution images.
Streaming computations are numeric operations in which data flow is unidirectional and uninterrupted from a primary input or inputs, to a primary output or outputs. During computation, however, the data flow can experience transformations where the amount of data being processed changes. Data can increase progressively as it is processed through a plurality of stages due to external inputs or internal generation due in part to signal processing techniques like the Nyquist criteria. Most current implementations are synchronous, using a global clock to pace all operations of a system or device where all components of the system operate once per clock cycle. However, using a global clock reduces efficiency.
To illustrate the association of power and frequency, the delay of a logic gate Td is given by
where CL is the load capacitance, Vdd the supply voltage, Vth the device threshold voltage, W and L the width and length of the transistor channels, Cox the oxide capacitance and μ the mobility. CMOS transistors have a source-drain channel formed only when their gate voltage is larger than Vth. If the source-drain voltage Vdd is greater than the gate voltage, the transistor operates in a saturation mode where they exhibit switch-like properties required for logic circuit design. Keeping all device parameters and circuit topology constant, Td is inversely proportional to the supply voltage Vdd if operation is over the threshold voltage.
The delay Td approximately doubles if the voltage is halved. Conversely, if the frequency is halved, the voltage can be reduced in practice.
In addition to logic gate delay Td, the power P consumed by a CMOS device is
P=CLVdd2f (2)
where f is the frequency. As can be seen, power has a quadratic dependence on the supply voltage Vdd, and a linear relationship with the frequency f of operation. Since power consumption is proportional to clock frequency, the difference becomes more important at higher operating frequencies.
a shows a single computation block C transformed into two discrete computation blocks that can be evaluated in a parallel configuration (spatially parallel) as shown in
When the functional requirement of computation block C is decomposed into a system of parallel computation blocks C1 and C2 as in
while maintaining the same data throughput. Voltages V1 and V2 supplied to blocks C1 and C2 can be reduced by
in proportion to the frequency
and are equal V1=V2. While voltage and frequency decrease by a factor of two, the total system capacitance increases approximately by a factor of two due to the parallel implementation. Power has a cubic relationship with voltage and frequency as shown in equations (1) and (2), leading to a 4× reduction in power. In practice, the power reduction is not as great due to additional wiring capacitances and smaller voltage reductions due to threshold voltage restrictions.
When computation block C is functionally decomposed into a pipeline comprising serial computation blocks C3 and C4 as in
In
yielding a total delay of
and the number of circuit elements in the critical path is reduced by two. The circuit elements within blocks C3 and C4 can have a larger delay and supply voltage V3 can be reduced (V3<V). The supply voltage V3 and frequency f can be reduced by a factor of two leading to a 4× reduction in power. However, capacitance remains unchanged since the hardware for blocks C3 and C4 together constitute computation block C. In practice, power reduction is not as great due to extra capacitance added by latches and smaller voltage reductions.
In terms of power consumption, the transformation of computation block C shown in
Most existing parallel and pipelined computations use a single global clock and voltage supply. To decrease power consumption, voltage scaling has been employed which uses software controlled voltage modulation based on run-time demands. Other current design efforts for low power operation lower voltage for portions of the circuit, i.e., voltage islands, which are removed from the critical path. A power efficient solution for stream-based pipelines having a plurality of stages but with different computational requirements in each stage has not yet been proposed.
A method for optimizing voltage and frequency for pipelined architectures that offers better power efficiency is not available. The inventors have discovered that it would be desirable to have a method of implementing pipelined architectures that result in reduced power consumption while maintaining high throughput by determining frequencies and voltages in conjunction with semiconductor parameters that are dependent upon the amount of streaming data processed in each stage of the pipeline.
One aspect of the invention provides methods for implementing a computation as a pipeline that processes streaming data. Methods according to this aspect of the invention preferably start with partitioning the computation into a plurality of temporal stages, each stage having at least one input and at least one output, wherein one of the stages is a first stage having at least one primary input and one of the stages is a last stage having at least one primary output, each stage defined by a clock frequency. Forming a pipeline by coupling at least one output from the first stage to at least one input of another one of the plurality of stages, and coupling at least one output from another one of the plurality of stages to at least one input for the last stage. Assigning a clock frequency to each one of the stages in the pipeline such that an overall throughput requirement is met and not all of the assigned stage clock frequencies are equal and assigning to each stage in the pipeline a supply voltage where not all of the assigned stage voltages are equal.
Another aspect of the method of the invention is inserting at least one storage element in at least one of the plurality of stages in the pipeline to allow for operational independence between the storage element stage and another one of the plurality of stages.
Yet another aspect of the method of the invention is an inverse discrete wavelet pipeline implementation having at least one reconstruction channel having a low input, a high input and an output, a row processing stage having a row reconstruction channel; the row reconstruction channel output coupled to a row stage storage element first input, the row storage element having a corresponding first output, and the row storage element having a second input and a corresponding second output, a third input and a corresponding third output, and a fourth input and a corresponding fourth output.
Other objects and advantages of the systems and methods will become apparent to those skilled in the art after reading the detailed description of the preferred embodiments.
a is a diagram of an exemplary single computation block.
b is a diagram of an exemplary parallel computation.
c is a diagram of an exemplary pipeline computation.
a and 2b is a diagram of an exemplary method of the invention.
a is a diagram of an exemplary N row by M column array.
b is a diagram of an exemplary row decomposition of the array of
c is a diagram of an exemplary one level decomposition of the array of
d is a diagram of an exemplary two level decomposition of the array of
e is a diagram of an exemplary three level decomposition of the array of
f is a diagram of an exemplary four level decomposition of the array of
a is a schematic of an exemplary IDWT column stage in accordance with the invention.
b is a schematic of an exemplary IDWT row stage in accordance with the invention.
a-11e is an exemplary data flow of a five level, IDWT using the stages of
Embodiments of the invention will be described with reference to the accompanying drawing figures wherein like numbers represent like elements throughout. Before embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of the examples set forth in the following description or illustrated in the figures. The invention is capable of other embodiments and of being practiced or carried out in a variety of applications and in various ways. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having” and variations thereof herein is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. The terms “mounted,” “connected,” and “coupled” are used broadly and encompass both direct and indirect mounting, connecting, and coupling. Further, “connected” and “coupled” are not restricted to physical or mechanical connections or couplings.
Shown in
A typical high-level synthesis algorithm comprises a number of steps. The operations within a computation are decomposed into a standard set of operations supported by the pipeline stages. For example, multiplications are broken up into addition and shift operations. Then, an interconnected network of standard operations is formed and allocated to available stages in the pipeline. One algorithm for performing this task is list scheduling, where the given network is topologically sorted and each operation is assigned to a component in the pipeline stage capable of executing it. An operation is assigned only after its predecessors in the network have been assigned. Based on granularity, different operations in the network may be allocated to the same pipeline stage or different stages. Operations in different pipeline stages are temporally divided from each other by latches between stages. Several practical heuristics exist to synthesize a pipeline with minimal stages, minimal latency, etc. A more detailed discussion of the synthesis step is beyond the scope of this disclosure. After synthesis, the operation(s) performed within each stage is translated into a hardware equivalent (step 107).
Depending upon the performance/computation requirements (step 103) and synthesis (step 105), a storage element with write and read functionality may be inserted within a pipeline stage (steps 109, 111) if required. Storage elements are used to maintain continuous data flow and may or may not be required.
Once the hardware is synthesized and storage element allocation is complete, clock frequencies are assigned to each pipeline stage, starting with the final stage (step 113). The frequency of the final stage is determined to be as low as possible while maintaining the design throughput requirement. The clock frequency for each preceding stage is determined, set as low as possible while maintaining the design throughput (steps 115, 117, 119) until the clock frequencies for all stages in the pipeline are set to their lowest possible values.
After all stage clock frequencies have been assigned, the operating voltage for each pipeline stage is determined according to the respective clock frequencies (steps 121, 123). As discussed above, supply voltage Vdd and time delay Td are inversely proportional, which makes voltage Vdd and frequency f directly proportional. If the clock frequency for a preceding stage is halved, its supply voltage can likewise be halved so long as the stage supply voltage Vdd is higher than the hardware threshold voltage Vth as previously discussed.
Depending upon the access of the read function swread, storage element sir contents mem1 or mem2 can be read by stage C8. Depending upon the access of the write function swwrite, storage element str contents mem1 or mem2 can be written to by stage C7. In this example, the access of the write swwrite and read swread functions are controlled in opposite correspondence—one memory space mem2 is read from while the other memory space mem1 is written to.
Each stage C7, C8 can process data until it reads (stage C8) all data (mem2), or writes (stage C7) all data (mem1). The separation of stage operations using a storage element sir is desirable when different stages have to write or read data in different patterns. The storage capacity of a memory space is greater than or equal to the latency of a following stage. A classic, prior art pipeline implementation only permits sequential dataflow, i.e., the output of a stage is accessed in the same order by the input of a subsequent stage. The operating frequency of the storage elements sir is that of its associated stage. The voltage V8 supplied to stage C8 is set as low as possible, corresponding to the clock f8 requirements of stage C8, but greater than the hardware threshold voltage Vth of stage C8. The voltage V7 supplied to stage C7 is then set as low as possible, corresponding to the clock f7 requirements of stage C7, but greater than the hardware threshold voltage Vth of stage C7.
The advantage of the method of the invention is reduced power consumption. As discussed above, power has a quadratic relationship with voltage and a linear relationship with frequency. Power therefore has a cubic relationship with voltage and frequency together. If frequency and voltage are both halved, power consumption reduces by a factor of 8. Another advantage is the use of storage elements providing for high throughput.
The invention is used to optimally realize in hardware operationally complex computations. What follows is an example of a low-power, high-throughput hardware implementation of multi-stage digital signal transformations based upon the teachings of the invention. The example implements one of the more complex portions of JPEG 2000 image reconstruction—a 2-dimensional IDWT.
When reconstructing an image using a 2-dimensional IDWT, the amount of data increases with each successive level until the image is formed. To sustain the IDWT throughput, the hardware implementation requires resources that provide considerable storage, multipliers, and arithmetic logic units (ALUs). The method of the invention creates an efficient stream-based architecture employing polyphase reconstruction, multiple voltage levels, multiple clocked pipelines, and storage elements as will be described.
By way of background, the wavelet transform converts a time-domain signal to the frequency-domain. The wavelet analysis filters different frequency bands, and then sections each band into slices in time. Unlike a Fourier transform, the wavelet transform can provide time and location information of the frequencies, i.e., which frequency components exist at different time intervals. Image compression is achieved using a source encoder, a quantizer and an entropy encoder. Wavelet decomposition is the source encoder for image compression. Computation time for both the forward and inverse DWT is great and increases exponentially with signal size.
Wavelet analysis separates the smooth variations and details of an image by decomposing the image using a DWT into subband coefficients. The advantage of wavelet subband compression includes gain control for image softening and sharpening, and a scalable compressed data stream. Wavelet image processing keeps an image intact once it is compressed obviating distortions.
A typical digital image is represented as a two-dimensional array of pixels, with each pixel representing the brightness level at that point. In a color image, each pixel is a triplet of red, green and blue (RGB) subpixel intensities. The number of distinct colors that can be represented by a pixel depends on the color depth, i.e., the number of bits per pixel (bpp).
Images are transformed from an RGB color space to either a YCrCb or a reversible component transform (RCT) space leading to three components. After transformation, the image array can be processed.
A time-domain function f(t) can be expressed in terms of wavelets using the wavelet series
where ψ(S, τ, t) represents the different wavelets obtained from the “mother wavelet” ψ, and S indicates dilations of the wavelet. A large S indicates a wide wavelet that can extract low frequency components when convolved with the input signal, while a small S indicates a narrow wavelet that can extract high frequency components. τ represents different translations of the mother wavelet in time and is used to extract frequency components at different time intervals of the input signal.
The coefficients as,τ of the wavelets are found using
The discrete wavelet transform applies the wavelet transform to a discrete-time signal x(n) of finite length having N components. Filter banks are used to approximate the behavior of a continuous wavelet transform. Subband coefficients are found using a series of filtering operations.
Wavelet decomposition—applying a DWT in a forward direction—is performed using two-channel analysis filters where the signal is decomposed using a pair of filters, a half band low pass filter and a half band high pass filter, into high and low frequency components followed by down-sampling. A forward DWT is shown in
Filtering a signal in the digital domain corresponds to the mathematical operation of convolution, where the signal is convolved with the impulse response of the filter. The half band low pass filter removes all frequencies that are above half of the highest frequency in the signal. The half band high pass filter removes all frequencies that are below half of the highest frequency in the signal. The low-frequency component usually contains most of the frequency of the signal and is referred to as the approximation. The high-frequency component contains the details of the signal.
Most natural images have smooth color variations with fine details represented as sharp edges in between the smooth variations. The smooth variations in color can be referred to as low frequency variations and the sharp variations as high frequency variations. The low frequency components constitute the base of an image, and the high frequency components add upon them to refine the image giving detail.
For image processing, digital high and low pass filters are commonly employed in the DWT and DCT processes as one or two-dimensional filters. One-dimensional filters operate on a serial stream of data, whereas two-dimensional filters comprise two one-dimensional filters that alternately operate on the data stream and its transpose.
The filters used for decomposition are typically transverse digital filters as shown in
y[k]=Σi=−∞i=∞H[i].x[k−i]=Σi=0i=KH[i].x[k−i], (5)
where H0, H1, H2, H3, . . . Hk are predefined filter coefficients or weights and z−1 are shift register positions temporarily storing incoming values. With each new value, the filter calculates an output value for a given instant in time by observing the input values surrounding that instant of time. As a new value arrives, the shift register values are displaced discarding the oldest value. The process consists of multiplying each input value by the filter weights which define the filtering action. By adjusting the weights, a low pass or a high pass filter can be obtained. Since the filters employed are half band low pass and half band high pass filters, the filter architectures are the same for each level of decomposition.
Decomposition of an N×M color space is performed in levels with each level performing a row-by-row (N) and a column-by-column (M) analysis. This type of wavelet decomposition is referred to as a 2-dimensional DWT, an example where N<M is shown in
Each row of pixels (sub pixel) is low and high pass filtered. After filtering, half of the samples can be eliminated or down-sampled, yielding two
images referred to as L (low) and H (high) row subband coefficients. The intermediate results are indexed as an array in memory as shown in
The Nyquist theorem states that the minimum number of discrete samples to perfectly reconstruct a signal is twice the maximum frequency component of the signal. Therefore, if a half band low pass filter, which removes all frequency components larger than the median frequency, is applied to a signal, every other sample in the output can be discarded. Discarding every other sample subsamples the signal by two whereby the signal will have half the number of discrete samples effectively doubling the scale. A variation of the theorem makes down-sampling applicable for a high pass filter that removes all frequency components smaller than the median frequency.
Decomposition halves the time resolution since half of the number of samples characterizes the entire signal. However, the operation doubles the frequency resolution since the frequency band of the signal now spans only half the previous frequency band, effectively reducing the uncertainty in the frequency by half. This is referred to as subband coding.
From the data store, each column (M) of coefficients is low and high pass filtered, down-sampled, and stored yielding four
sub images as shown in
JPEG 2000 supports pyramid decomposition. Pyramid decomposition only decomposes the LL sub image in subsequent levels, each leading to four more sub images as shown in
of the original image space. A fifth level decomposition would produce fifth level subbands L10, HL9, LHL8 and H2L8 (not shown). The subbands for a five level decomposition of one video frame are: L10, HL9, LHL8, H2L8; HL7, LHL6, H2L6; HL5, LHL4, H2L4; HL3, LHL2, H2L2; HL, LH and HH.
Shown in
The forward DWT analyzes the image data producing a series of subband coefficients. Rather than discarding some of the subband information and losing detail, all subband coefficients are kept and compression results from subsequent subband quantization and the compression scheme used in the entropy encoder. The quantizer reduces the precision of the values generated from the encoder reducing the number of bits required to save the transform coefficients.
Reconstruction of the original image is performed in reverse; by entropy decoding, inverse quantization, and source decoding—the later performing the DWT in an inverse direction as shown in
A filter pair comprising high and low pass filters is used and is referred to as a synthesis filter. The inverse process begins using the subband coefficients output from the last level of a forward DWT, applying the filters column wise and then row wise for each level, with the number of levels corresponding to the number of levels used in the forward DWT until image reconstruction is complete. The inputs at each level of reconstruction are subband coefficients.
The IDWT can be implemented as a pipelined data path. Owing to up-sampling, successive stages of the pipeline operate on progressively higher amounts of data. For an N×M image, the last level of reconstruction operates on four subbands, each of size
The four subbands of the preceding level are
The input to each level of the IDWT consists of four subbands and the final output is an N×M image. Each level consists of column and row processing. The column stage which includes up-sampling produces two subbands. These subbands are row processed which includes up-sampling to produce another subband. For a given level of reconstruction, the rows cannot be processed until all of the columns are processed. For a high throughput, the row and column stages must be able to operate independently of each other to ensure continuous data flow.
Using the method of the invention shown in
From the synthesis step (step 105) one stage is produced for column processing 17 and another stage is produced for row processing 33 as shown in
a shows the column processing stage 17 derived for each level of the IDWT according to the teachings of the invention. The column processing stage 17 comprises two reconstruction channels having four inputs cin1, cin2, cin3, cin4, four up-samplers up1, up2, up3, up4, each coupled to an input, the up-sampler outputs coupled to two synthesis filters 191, 192 each synthesis filter comprising a low LPF1, LPF3 and a high HPF2, HPF4 pass filter, each filter having an input LPFin1, HPFin2, LPFin3, HPFin4 coupled to a respective up-sampler up1, up2, up3, up4. Each synthesis filter pair 191, 192 output LPFout1, HPFout2, LPFout3, HPFout4 is coupled to an adder 211, 212. Each adder 211, 212 output is coupled to a storage element strcol write function sw1write.
As described above, each storage element strcol allocates memory spaces for storing data output from an upstream computation, while allowing a downstream computation to read previously written data in any pattern. For each pair of memory spaces, write/read functions are used to direct data exclusively to and from each memory space for simultaneous writing and reading, allowing upstream and downstream computation stages to function independently.
The storage element strcol for the column stage 17 has two pairs of allocated memory spaces mem1a, mem1b, mem2a, mem2b accessed by write/read functions sw1write, sw1read, sw2write, sw2read. The common pole of the write function sw1write is coupled to the output of the first channel adder 211. The common pole of the write function sw2write is coupled to the output of the second channel adder 212. The common pole of the two read functions sw1read, sw2read are coupled to stage outputs cout1, cout2. The column IDWT stage 17 is used in conjunction with the row IDWT stage 33 for 2-dimensional IDWT, n level reconstruction.
A voltage input Vcolx provides operating voltage for the column x stage 17 based upon clock 27 frequency. A controller 31 accepts an image information signal setting forth the size of the image, frame rate, color depth (bpp), level of reconstruction known a priori from a common bus BUS coupling all stages in all levels and controls the switching action of the storage element strcol write/read functions over line 29. The image information is obtained either from an external control such as a user configurable setting, or more advantageously, decoded upstream prior to entropy decoding in the incoming data stream header. A maximum image size determines the required storage element capacity for each column 17 and row 33 stage. Image sizes less than the maximum can be processed. Each smaller image size has a correspondingly smaller memory footprint in the allocated memory spaces. The image information changes each storage element memory space access write/read function pattern for each image size.
b shows the row processing stage 33 derived for each level of the IDWT according to the teachings of the invention. The row processing stage 33 comprises one reconstruction channel and five inputs rin1, rin2, rin3, rin4, rin5, two up-samplers upL, upH, coupled to inputs rin1, rin2, the up-sampler outputs coupled to a synthesis filter 19 comprising a low LPF and a high HPF pass filter, each filter having an input LPFin, HPFin coupled to a respective up-sampler upL, upH, and an output LPFout, HPFout coupled to the reconstruction channel adder 21. The adder 21 output is coupled to a storage element strrow write function swwrite.
The storage element strrow for the row stage 33 has four pairs of allocated memory spaces mema, memb, mem3a, mem3b, mem4a, mem4b, mem5a, mem5b accessed by four write/read functions swwrite, swread, sw3write, sw3read, sw4write, sw4read, sw5write, sw5read. Write function swwrite is coupled to the output of the adder 21. The three remaining write functions sw3write, sw4write, sw5write are coupled to stage inputs rin3, rin4, rin5 to receive subband coefficients available and waiting to be processed. The four read functions swread, sw3read, sw4read, sw5read couple to row stage outputs rout, rout3, rout4, rout5.
A voltage input Vrowx provides operating voltage for the row x stage 33 based upon clock 37 frequency. A controller 41 accepts a signal setting forth the size of the image, color depth (bpp) and level of reconstruction, known a priori, from a common bus BUS and controls the switching action of the storage element strrow write/read functions over line 39. The row processing stage 33 for the last level is simplified needing only the reconstruction channel.
a-11e. show a five level IDWT using the column 17 and row 33 stages. The beginning of the inverse transform is the fifth level as shown in
By knowing the reconstructed image size, bpp and number of levels of reconstruction; the column strcol5, strcol4, strcol3, strcol2, strcol1 and row Strrow5, Strrow4, Strrow3, strrow2 storage element memory spaces, clock frequencies clkcol5, clkrow5, clkcol4, Clkrow4, clkcol3, clkrow3, clkcol2, clkrow2, clkcol1, clkrow1 and stage voltages Vcol5, Vrow5, Vcol4, Vrow4, Vcol3, Vrow3, Vcol2, Vrow2, Vcol1, Vrow1 and can be determined.
Continuing with the example, for real-time reconstruction of one color plane of a moving picture having an image resolution of 1024(210)×2048(211) pixels (i.e., sub pixels) at a frame rate of 48 frames per second, wavelet reconstruction of the 1024(N)×2048(M) color space would assemble an image having 2,097,152 pixels, requiring the source decoder (IDWT) to process 100,663,296 pixels per second with each pixel having an associated color depth. For this example, each pixel has a 16 bit value. The larger the color depth, the more storage element memory required. The clock rate supporting real-time reconstruction would be ˜9.9 ns per pixel or ˜101 MHz at the output of the last (1st) level (step 115).
For moving images having a frame rate of 48 fps, each frame of the moving image is processed for display every 0.0208 seconds. For the five level IDWT 51 shown in
The last step of the invention is assigning operating voltages (steps 121, 123) to each stage in the pipeline 51. The ten stage voltages Vcol5, Vrow5, Vcol4, Vrow4, Vcol3, Vrow3, Vcol2, Vrow2, Vcol1, Vrow1 can be determined since each stage voltage is proportional with the stage operating frequency. Each stage voltage must be greater than the threshold voltage Vth of the respective stage hardware. A theoretical value can be approximated for each stage threshold voltage Vth or obtained empirically. For the streaming computation to have maximum power efficiency, the stage in the pipeline having the fastest clock frequency clkrow1 will typically have the highest voltage Vrow1 and the stage having the slowest clock frequency clkcol5 will have the lowest voltage level Vcol5. The stage voltages residing between the maximum Vrow1 and minimum Vcol5 vary accordingly Vrow5, Vcol4, Vrow4, Vcol3, Vrow3, Vcol2, Vrow2, Vcol1. Alternatively, each stage voltage in the pipeline can have the same value, or at least one or more different values, so long as the voltage threshold requirement for each stage is met.
After entropy decoding, inverse quantization and removal of any header information is complete, the subband pixel coefficients for each frame of the one color plane enter the source decoder 51 at a clock clkx rate of 98,600 Hz.
a-11d shows an incoming frame subband coefficient data stream L10, HL9, LHL8, H2L8; HL7, LHL6, H2L6; HL5, LHL4, H2L4; HL3, LHL2, H2L2; HL, LH and HH, and their respective storage element memory spaces 53a, 53b, 55a, 55b, 57a, 57b, 59a, 59b, 61a, 61b. Each storage element memory space alternately stores subband coefficients for one incoming frame for reconstruction. For this example, the incoming frame subband coefficients would be continuously written 48 times per second in alternate a, b memory spaces of the incoming frame 53a, 53b, and fifth 55a, 55b, fourth 57a, 57b, third 59a, 59b, and second 61a, 61b level row storage elements strrowx. The fifth level subband coefficients L10, HL9, LHL8, H2L8, fourth level subband coefficients HL7, LHL6, H2L6, third level subband coefficients HL5, LHL4, H2L4, second level subband coefficients HL3, LHL2, H2L2 and first level subband coefficients HL, LH and HH for frame 1 are written into one of the memory spaces (a) of the storage elements, completing all subband coefficients for one frame. The coefficients arrive in time for each level of reconstruction. A discussion of inverse quantization which controls the incoming subband coefficients is beyond the scope of this disclosure. The process continues by writing the fifth level subband coefficients L10, HL9, LHL8, H2L8 for the next frame (2) into the other memory space (b) of the incoming frame storage element 53.
As can be seen in
Fourth level reconstruction (
Third level reconstruction (
Second level reconstruction (
First level reconstruction (
The entire five level IDWT 51 is filled and busy, with each stage of each level processing coefficients belonging to a subsequent frame. Column 17 and row 33 stages of each level of the IDWT 51 contain storage elements strcolx, strrowx for allocating memory spaces mema, memb for the fifth level 71a, 71b, 63a, 63b, 55a, 55b, fourth level 73a, 73b, 65a, 65b, 57a, 57b, third level 75a, 75b, 67a, 67b, 59a, 59b, second level 77a, 77b, 69a, 69b, 61a, 61b, and first level 79a, 79b, for holding the results of column processing 17 before row processing 33 and allowing the row processing stages 33 to access the memory spaces in a transpose read.
The fifth level subband coefficients L10, HL9, LHL8 and H2L8 each comprise 32×64 values (
The four subbands L10, HL9, LHL8 and H2L8 are read by column, up-sampled up1, up2, up3, up4 by inserting a zero between each coefficient, and low pass and high pass filtered using the two synthesis filters 191, 192. Up-sampling increases the clock rate by a factor of two, transitioning from 98,600 Hz (clkx) to 197 kHz (clkcol5). The synthesis filter 191, 192 outputs are summed 211, 212 forming two subbands L9 and HL8 each comprising 64×64 coefficients which are written into a fifth level column storage element 71. The memory required would be 65,536 bits, or 8,192 bytes for all coefficients of one subband. Since there are two subbands L9 and HL8, and two memory spaces are employed, the total subband memory required for the fifth level row storage element 71 is approximately (8,192 bytes)×(2 subbands)×(2 memory spaces)≅32 KB.
The coefficients of subbands L9 and HL8 are read by rows in a row stage 33, up-sampled upL, upH, and low pass and high pass filtered using one synthesis filter 19. The 197 kHz clock rate (clkcol5) transitions to 395 kHz (clkrow5). The values are summed 21 forming subband coefficients L8 and are written into a fourth level row storage element 63, 55.
The amount of memory required to store subband coefficients for each level of the IDWT progressively increases by a factor of four. The fourth level subbands L8, HL7, LHL6 and H2L6 each comprise 64×128 coefficients. For a sixteen bit color depth, 131,072 bits or 16,384 bytes are required. Using two memory spaces, (16,384 bytes)×(4 subbands)×(2 memory spaces)≅131 KB are required.
At the fourth level, subbands L8, HL7, LHL6 and H2L6 are up-sampled and column 17 processed (
At the third level, subbands L6, HL5, LHL4 and H2L4 are up-sampled and column processed 17 (
At the second level, subbands L4, HL3, LHL2 and H2L2 are column processed 17 (
At the first level, subbands L2, HL, LH and HH are column processed 17 (
The above example shows the method of the invention as applied to one type of signal processing transform, the IDWT, requiring multiple temporal stages, each stage having a storage element allocating memory spaces and its own operating frequency and voltage for maximum power efficiency. The invention can likewise be used to derive pipeline stages for a DWT, DCT, IDCT and other signal processing streaming calculations.
Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims.