The invention relates generally to image processing. More specifically, the invention relates to image reconstruction using the Discrete Periodic Radon Transform (DPRT) and related computer hardware architecture.
While a computer system is operating, resources are spent to manage various aspects of the system. Systems may be configured before operation to prioritize certain resources of the system over others. Other systems may include an embedded system that configures system resources. Embedded systems typically maintain a dedicated function within a larger mechanical, electrical or computer system, often with real-time computing constraints.
Embedded systems include, for example, hardware components such as programmable logic devices (PLDs). PLDs, including field-programmable gate arrays (FPGAs), are integrated circuits (ICs) that can be programmed to implement user-defined logic functions.
FPGAs are an important and commonly used circuit element in conventional electronic systems. FPGAs are attractive for use in many designs in view of their low non-recurring engineering costs and rapid time to market. FPGA circuitry is also being increasingly integrated within other circuitry to provide a desired amount of programmable logic. Many applications can be implemented using FPGA devices without the need of fabricating a custom integrated circuit.
The Discrete Radon Transform (DRT) is an essential component of a wide range of applications in image processing. Applications of the DRT include the classic application of reconstructing objects from projections such as in computed tomography, radar imaging and magnetic resonance imaging. More recently, the DRT has also been applied in image denoising, image restoration, texture analysis, line detection in images, and encryption.
A popular method for computing the DRT involves the use of the Fast Fourier Transform (FFT). First, the 2-D FFT is used for computing the 2-D FFT spectrum. Second, the 2-D FFT spectrum is sampled along different radial lines through the origin. Then, the 1-D inverse FFT is used for estimating the DRT. This direct approach based on the FFT suffers from many artifacts. Assuming that the DRT is computed directly, an exact inversion algorithm has been proposed including improvements related to the elimination of interpolation calculations. However, the DRT requires the use of expensive floating point units for implementing the FFTs. Floating point units require significantly larger amounts of hardware resources than fixed point implementations.
The Discrete Periodic Radon Transform (DPRT) has been extensively used in applications that involve image reconstructions from projections. Beyond classic applications, the DPRT can also be used to compute fast convolutions that avoid the use of floating-point arithmetic associated with the use of the Fast Fourier Transform. Unfortunately, the use of the DPRT has been limited by the need to compute a large number of additions and the need for a large number of memory accesses.
Fixed point implementations of the DRT can be based on the DPRT. Previous research as resulted in the introduction of the forward DPRT algorithm for computing the 2-D Discrete Fourier Transform as well as a sequential algorithm for computing the DPRT and its inverse for prime sized images.
Similar to the continuous-space Radon Transform, the DPRT satisfies discrete and periodic versions of the Fourier slice theorem and the convolution property. Thus, the DPRT can lead to efficient, fixed-point arithmetic methods for computing circular and linear convolutions. The discrete version of the Fourier slice theorem provides a method for computing 2-D Discrete Fourier Transforms based on the DPRT and a minimal number of 1-D FFTs.
Scalable solutions for embedded systems are desired that can deliver different solutions based on one or more constraints. As an example, it is desirable to have a low-energy solution when there is a requirement for long-time operation and a high-performance solution when there is no power (or energy) constraint. More specifically, scalable solutions are desired that can allocate hardware resources based on constraints, for example minimize energy consumption, minimize bitrate requirements, increase accuracy, increase performance, and/or improve image quality.
There is a demand for DPRT algorithms that are both fast—the computation provides the result in the minimum number of cycles—and scalable—the approach provides the fastest implementation based on the amount of available resources. The invention satisfies this demand.
The invention introduces a fast and scalable approach for computing the forward and inverse DPRT that is based on parallel shift and add operations. As demonstrated in an FPGA implementation, for a wide range of computational resources, the invention provides the fastest possible computation of the DPRT. For purposes of this application, the term “SFDPRT” refers to scalable and fast Discrete Periodic Radon Transform, “iSFDPRT” refers to the inverse scalable and fast Discrete Periodic Radon Transform, “FDPRT” refers fast Discrete Periodic Radon Transform, and “iFDPRT” refers to the inverse fast Discrete Periodic Radon Transform.
Specifically, the invention introduces a fast and scalable approach for computing the forward and inverse DPRT that uses: (i) a parallel array of fixed-point adder trees to compute the additions, (ii) circular shift registers to remove the need for accessing external memory components, (iii) an image block-based approach to DPRT computation that can fit the proposed architecture to available resources, and (iv) fast transpositions that are computed in one or a few clock cycles that do not depend on the size of the input image. Compared to previous approaches, the scalable approach provides the fastest known implementations for different amounts of computational resources.
One advantage of the invention is that fast and scalable DPRT approaches lead to algorithms and architectures that are optimal in the multi-objective sense. As an example, the invention may be connected to integrated circuits and dynamic management of hardware resources in an embedded system via multi-objective optimization as described in U.S. patent application Ser. No. 14/069,822 filed on Nov. 11, 2013, which is incorporated by reference in its entirety.
Another advantage of the invention is that the fast and scalable architecture can be adapted to available resources. The invention is scalable, allowing application to larger images with limited computational resources. For example, the required hardware resources can be reduced so that the architecture fits in smaller devices while sacrificing speed. Alternatively, larger devices can provide faster implementations while possibly requiring more resources. Specifically, the invention is designed to be fast in the sense that column sums are computed on every clock cycle. In the fastest implementation, a prime direction is computed on every clock cycle.
Another advantage of the invention is that it is Pareto-optimal in terms of the required cycles and required resources. Thus, the invention provides the fastest known implementations for the given computational resources. As an example, in the fastest case, for an N×N image (N prime), the DPRT is computed in linear time (6N+┌log2 N┐+3 clock cycles) requiring resources that grow quadratically (0(N2)). In the most limited resources case, the running time is quadratic (┌N/2┐(N+8)+N+4 clock cycles) requiring resources that grow linearly (0(N). A Pareto-front of optimal solutions is given for resources that fall within these two extreme cases. Furthermore, when sharing comparable computational resources, the invention is better than previous approaches.
For the fastest case, the scalable architecture can be reduced to obtain the FDPRT in 2N+┌log2 N┐+1 and iFDPRT in 2N+3┌log2 N┐+B+2 cycles with B being the number of bits used to represent each input pixel.
Another advantage of the invention provides parallel and pipelined implementation that provides an improvement over sequential algorithms. The invention computes N×H additions in a single clock cycle. Furthermore, shift registers are used to make data available to the adders in every clock cycle. Then, additions and shifts are performed in parallel in the same clock cycle. To keep the flow of the addition data at high speed, the use of RAMs may be suppressed in favor of using shift registers to provide the data to every adder per clock cycle. Furthermore, in the same clock cycle, addition and shifting can be performed simultaneously.
Another advantage of the invention is that fast transpositions are based on parallel Random Access Memory (RAM) access. A unique RAM architecture and associated algorithm provides a complete row or column of the input image in one clock cycle. Using this architecture, transposition is avoided since the image can be accessed by either rows or columns.
In certain embodiments of the invention, the usage of RAMs is eliminated, the complete image is loaded in registers and the DPRT is computed thereby eliminating the constraint of how much data can be transferred to the adders per clock cycle. In other embodiments, RAMs may be used as a buffer to hold the complete input image and the output DPRT. The invention provides an architecture and associated algorithm that can provide a complete row or column of the input image in one clock cycle (i.e. a throughput of N pixels per clock cycle), and without increasing the amount of RAM required to hold the complete image.
Another advantage of the invention is that the architectures are not tied to any particular hardware. They can be applied to any existing hardware (e.g., FPGA or VLSI) since they are developed in VHDL and are fully parametrized for any prime N.
The invention and its attributes and advantages may be further understood and appreciated with reference to the detailed description below of one contemplated embodiment, taken in conjunction with the accompanying drawings.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an implementation of the invention and, together with the description, serve to explain the advantages and principles of the invention:
The invention is directed to a scalable approach providing the fastest known implementations for different amounts of computational resources for computing the forward and inverse Discrete Periodic Radon Transform (DPRT). According to the invention, for a N×N image (N prime), the roposed approach can compute up to N2 additions per clock cycle. As an example, for a 251×251 image, for approximately 25% less resources, the fast and scalable DPRT (SFDPRT) is computed 39 times faster than the fastest known implementation. For the fastest case, the scalable architecture can be further reduced to obtain the FDPRT and iFDPRT in 2N+┌log2 N┐+1 and 2N+3 ┌log2N┐+B+2 cycles respectively (B is the number of bits used to represent each input pixel). Generic and parametrized architectures are developed in VHDL and validated using an FPGA implementation as described more fully below.
For purposes of this application, the following notation is introduced. N×N images are considered where N is prime. ZN denotes the non-negative integers: {0, 1, 2, . . . , N−1}, and l2(ZN2) be the set of square-summable functions over ZN2 with fεl2(ZN2) being a 2-D discrete function that represents an N×N image, where each pixel is a positive integer value represented with B bits. Subscripts are used to represent rows. For example, fk(j) denotes the vector that consists of the elements of f where the value of k is fixed. Similarly, for R(r,m,d), Rr,m(d) denotes the vector that consists of the elements of R with fixed values for r, m. It is noted that all are fixed but the last index.
N
H≠H since N is prime. With K being the number of strips, K=┌N/H┐. With r denoting the r-th strip, the DPRT can be computed over each strip using:
where
R′(r, m, d) denotes the r-th partial DPRT defined by:
where r=0, . . . , K−1 is the strip number. The DPRT is computed by accumulating the partial sums from each strip. Specifically, the DPRT is computed as a summation of partial DPRTs using:
Similarly, the partial inverse DPRT, or iDPRT, of Nm, d) uses
which allows the computation of iDPRT of R(m, d) using a summation of partial iDPRTs according to:
Scalability is achieved by controlling the number of rows used in each rectangular strip. Thus, for the fastest performance, the largest strip size is chosen that can be implemented using available hardware resources. The final result is computed by combining the DPRTs.
Each row of the SFDPRT core register array is implemented using a Circular Left Shift (CLS) register that can be used to align the image samples along each column. Each column of this array has a H-operand fully pipelined adder tree capable to add the complete column in one clock cycle. The output of the adder trees provide the output of the SFDPRT core 204, which represents the partial DPRT of f. The combination of shift registers and adders allows the computation of H×N additions per clock cycle with a latency of ┌log2 H┐. At the end, the outputs of the SFDPRT core 204 are accumulated using MEM_OUT 206. The finite state machine (FSM) 208 implements the algorithm for processing the SFDPRT according to one embodiment of the invention as shown in
According to the algorithm shown in
There are differences between SFDPRT and iSFDPRT with respect to input size, transposition, and circular right shifting. With respect to input size, the input is R(m, d) with a size of (N+1)×N pixels. Since less terms are needed for computation, there is no transposition of the input image. Thus, the horizontal sums that required fast transposition are no longer needed. As a result, use of MEM_IN is optional and only needed to buffer/sync the incoming data. In specific implementations, MEM_IN may be removed provided that the data can be input to the hardware in strips as described in the algorithm as shown in
Another difference between between SFDPRT and iSFDPRT pertains to the shift array. Specifically, the circular right shifting (CRS) replaces circular left shifts (CLS) since the iDPRT index requires j−mi
N as opposed to
d+mi
N for the DPRT.
The fastest possible implementations can be derived when the need to divide the input into strips (H=N) is eliminated resulting in the fast Discrete Periodic Radon Transform (FDPRT) and its inverse (iFDPRT). For both cases, the use of RAM is completely removed since the entire input can be held inside the register array. For the FDPRT, the register array is also be modified to implement the fast transposition that is required for the last projection (transposition time=1 clock cycle).
Overall, after accounting for the time to load the image into the register array, the FDPRT can be computed in 2N+n+1 cycles. For the iFDPRT, the architecture is basically reduced to the CRS registers plus the adder trees. For the iFDPRT, there is an additional CLS(1), the subtraction of S and the normalizing factor (1/N) that is be embedded inside the adder trees (see
For scalable architectures, t that are optimal in the multi-objective sense are considered. Two objectives considered by the invention include: (i) running time and (ii) hardware resources. An implementation is considered to be sub-optimal if another (different) implementation can be found that runs at the same time or faster for the same or less hardware resources, excluding the case where both the running time and computational resources are equal. The set of realizations that are not sub-optimal form the Pareto-front of realizations that are optimal in the multi-objective sense. As an example, the invention may be connected to integrated circuits and dynamic management of hardware resources in an embedded system via multi-objective optimization as described in U.S. patent application Ser. No. 14/069,822 filed on Nov. 11, 2013, which is incorporated by reference in its entirety.
Considering the set of Pareto-optimal architectures for a fixed image size N. The image is partitioned into K strips where each one contains H rows of the input image. Since N is prime, then N=KH cannot be had. This implies all of the strips will not be completely filled up, i.e., there will always be rows that do not contain image data. Thus, the last strip uses H−1 rows.
Furthermore, it is easy to see that having the maximum of H−1 rows in the last strip provides the best use of hardware resources. More generally, it is desired to minimize the number of unused image rows in the last strip. Formally, for any given N, the entire set of Pareto-optimal realizations can be derived as given by:
where the number of strips can vary
The upper limit of
comes from the requirement that at least two rows in each strip are kept, i.e., H=2.
For the parallel load, refer to
The memory allows transposition to be avoided as described in
The memory components of the invention include RAM-blocks that are standard Random Access Memory (RAM) with separate address, data read, and data write buses. The MODE signal is used to select between row and column access. For row access, the addresses are set to the value stored in ARAM[0]. Column access is only supported for MEM_IN. The addresses for column access are determined using: ARAM[i]=ARAM[0]+i
N, i=1, . . . , N−1.
The algorithm for processing the SFDPRT of
Second, image strips are loaded into SFDPRT core, shifted and written back to MEM_IN as described in
Third, image strips are loaded into the SFDPRT core and left-shifted once as described in
Lastly, to avoid transposition the input image is accessed in column mode for the last projection. The rest of the process is the same as for the previous N projections. The transform is computed using exact arithmetic using NO=B+┌log2 N┐ bits to represent the output where the input uses B-bits per pixel.
According to the invention, inverse DPRT implementations include inverse scalable and fast Discrete Periodic Radon Transform (iSFDPRT) and inverse fast Discrete Periodic Radon Transform (iFDPRT). The iFDPRT hardware implementation is shown in
As shown in
Multiplexers (MUXes) are used to support loading and shifting as separate functions. The vertical adder trees generate the Z(i) signals. A new row of Z(0), Z(1), . . . , Z(N−1) is generated for every cycle. The horizontal adder tree computes SR. The SR computation is the same for all rows. The latency of the horizontal adder tree is ┌log2 N┐ cycles. SR is ready when Z(i) is ready, as the latency of the vertical adder trees is ┌log2(N+1)┐. The SR value is fed to the ‘extra units’, where all Z(i)'s subtract SR and then divide by N. The term R(N,j) is included by loading the last input row on the last register row, where the shift is one to the left. It is always the same element (e.g., the left-most one) that goes to all vertical adders.
A summary of bit width requirements for perfect reconstruction is now provided. It is assumed that the Radon transform coefficients use B′-bits. The number of bits of the vertical adder tree outputs Z(i) are then set to BO=B′+┌log2(N+1)┐. The number of bits of SR need to be BQ=B′+┌log2N┐. Assuming that the input image f is B bits, only B bits are needed to reconstruct the image and the relationship between B′ and B needs to be: B′=B+┌log2N┘ bits. For the subtractor, Z(i)=Σm=0N-1R(m,j−mi
N)+R(N, i) and Z(i)≧SR since f(i,j)≧0. The result of Z(i)−SR is always positive requiring BO bits. Thus, for perfect reconstruction, the result F(i) needs to be represented using BO bits.
For each strip, different amounts of right shifting need to be implemented. This is implemented using (K+1)-input MUXes. Since N is always prime, at least one row of the register array is unused during computations for the last strip. The unused row is used to load the term R(N,j). The vertical MUXes located on the last valid row of the last strip ensure that the term R(N,j) is considered only when the last strip is being processed. Here, for the last row of the last strip, the shift is required to be one to the left. Also, the remaining unused rows are fed with zeros.
Beyond the iSFDPRT core, there is also the input and output memories, an array of adders and divisors, and ancillary logic. According to the invention, the diagonal mode of the memories, flipping the data, or rearrangement of the input memory is not needed. The basic process consists of loading each strip, processing it on the iSFDPRT core, accumulating it to the previous result, and storing it in the output memory. For the last strip, the result is accumulated in addition to subtracting SR and dividing by N.
Overall, the invention results in the fastest running times. Even in the slowest case, the running time is better than any previous implementation. However, in some cases, the better running times come at a cost of increased resource usage. Thus, both running times and required resources are compared. As can be seen from
The invention provides fast and scalable methods for computing the DPRT and its inverse. Overall, at comparable resources, the invention provides much faster computation of the DPRT. Furthermore, the scalable DPRT methods provide fast execution times that can be fitted in available resources.
The described embodiments are to be considered in all respects only as illustrative and not restrictive, and the scope of the invention is not limited to the foregoing description. Those of skill in the art may recognize changes, substitutions, adaptations and other modifications that may nonetheless come within the scope of the invention and range of the invention.
This application claims the benefit of U.S. Provisional Application No. 61/916,523 filed on Dec. 16, 2013, and U.S. Provisional Application No. 61/968,613 filed on Mar. 21, 2014, each of which are incorporated by reference in their respective entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US14/70371 | 12/15/2014 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
61968613 | Mar 2014 | US | |
61916523 | Dec 2013 | US |