This invention relates to a method for Partial bitstream relocation on Field Programmable Gate Arrays.
Partial bitstream relocation (PBR) on Field Programmable Gate Arrays (FPGAs) is a technique to scale parallelism of accelerator architectures at run time and enhance fault tolerance. PBR techniques have focused on reading inactive bitstreams stored in memory, on-chip or off-chip, whose contents are generated for a specific partial reconfiguration region (PRR) and modified on demand for configuration into a PRR at a different location. As an alternative, we disclose a PRRPRR relocation method to generate source and destination addresses, read the bitstream from an active PRR (source) in a nonintrusive manner, and write it to destination PRR. We describe two embodiments of realizing this on Xilinx Virtex 4 FPGAs: (a) hardware based accelerated relocation circuit (ARC) and (b) a software solution executed on Microblaze. A comparative performance analysis to highlight the speed-up obtained using ARC is presented. Performance realizations of the current embodiments are compared to estimated performances of two state of the art methods.
Emerging reconfiguration techniques that include partial dynamic reconfiguration (PDR) and partial bitstream relocation (PBR) have been addressed in the past in order to expose the flexibility of FPGAs at run time. PBR is a technique used to target a partial bitstream of a PRR onto other identical PRRs inside an FPGA, while PDR is used to target a single PRR. Fast PBR techniques are required to support certain fault-tolerant applications, where time to replace a faulty circuit with the correct circuit (using relocation) and restart the computation is critical to the performance. Other applications that require fast PBR include rapid rescaling of kernels for navigation and image processing in satellites. Another application is the ability to move circuits around in a 3D FPGA stack to mitigate hot-spot formation. Techniques for PBR can be classified based on the following five criteria: (a) Location of processor that manipulates the bitstream: on-chip or off-chip (b) Type of on-chip processor: hardware or software (c) Bitstream storage for on-chip processing: on-chip Block RAMs (BRAMs) or off-chip Flash memory (d) Type of wrapper used to communicate with Internal Communication Access Port (ICAP): Xilinx provided hardware ICAP (HWICAP) or a custom wrapper. Type of relocation supported: relocation to identical or non-identical PRRs. Existing works on PBR are analyzed based on these criteria. PARBIT is one of the earliest tools developed to support PBR. This tool runs on an off-chip processor. It extracts a partial bitstream from a bitstream file and transforms it to be relocated to a new PRR. pBITPOS is one of the earliest tools that can relocate BRAMs and 18×18 Multipliers. This tool is similar to PARBIT and targets Virtex II and Virtex II Pro family of FPGAs. REPLICA is a dedicated hardware relocation filter that transforms the bitstream when it is being downloaded from off-chip memory. This approach targets Virtex-E devices, can relocate to identical PRRs, and has no support for relocating PRRs containing BRAMs or 18×18 Multipliers. The next version, REPLICA2Pro is similar to REPLICA, but has support for relocating PRRs containing BRAMs and 18×18 Multipliers, and targets Virtex II and Virtex II Pro family of FPGAs. While REPLICA is implemented using an additional FPGA device, REPLICA2Pro is implemented on the same FPGA as the one containing source and destination PRRs. Both use a custom wrapper to communicate with ICAP. BiRF is yet another hardware-based relocation filter that communicates to the ICAP via a custom wrapper. In addition to Virtex II Pro FPGAs, this approach can target Virtex 4 and 5 series of FPGAs. A software-based approach to perform relocation and use an embedded processor (Microblaze) to transform the relocatable bitstream has been proposed. Communication to ICAP is provided via the Xilinx HWICAP wrapper. Prior work has transformed the relocatable bitstream on an embedded Microblaze processor. However, they rely on on-chip BRAM to store a copy of the bitstream. They target Virtex 4 series of FPGAs. Another method is novel compared to all of the above techniques because they have the ability to relocate to non-identical regions on a device. They read a bitstream from off-chip Flash memory and relocate using software running on an embedded Microblaze processor talking to the HWICAP wrapper. All of the above techniques rely on reading a copy of a bitstream residing in memory. Memory requirements are satisfied in two ways: (i) Using on-chip BRAMs, which are limited and expensive and (ii) Using off-chip memories, which are slow. We disclose a novel PRR-PRR relocation technique to read frame data (not the entire partial bitstream) directly from an active PRR and relocate it to a destination PRR on the fly, thus accelerating the relocation and removing the need to store any temporary copies of bitstreams. We have realized embodiments of this technique both in hardware and software. An analytical model used to evaluate the performance of PRRPRR relocation algorithm and highlight the speed-up obtained by the proposed hardware implementation.
FAR Generator 208 is responsible for decoding SrcPRR and DestPRR and use the decoded information to generate the complete sequence of frame addresses for the source 204 and destination PRRs 205, 206. Functionality of the FAR Generator 208 is shown in
The architecture for the Relocator module 101 is governed by a state machine 207. Based on the values of FARSrc and FARDest, the Relocator module 101 reads one frame from the source PRR and writes the frame to the destination PRR. Functionality of the Relocator module is split into two phases: (i) Readback phase (Read_Done=0) and (ii) Write phase (Read_Done=1). During the readback phase, the Relocator module sets the mode of ICAP 209 operation (ICAP_MODE) to “write” and then sends the Readback Command Sequence (RCS) to ICAP. RCS consists of the following: (a) commands to synchronize with the ICAP (b) command to set the command register (CMD) to read configuration, (c) FARSrc and (d) number of words to read from ICAP. After sending RCS, the Relocator sets the ICAP into “read” mode to read one frame. To read one frame from ICAP, it is required to read a combination of 83 words that includes one dummy word, one pad frame (41 words) and one data frame (41 words). This combination is represented as Frame Data (FD). A Block RAM (BRAM) module is used to temporarily store the FD. After the FD is read, the Relocator sets the ICAP_MODE to “write” 402 and sends the de-sync commands to ICAP. Now, readback phase is completed and the writing phase begins. In this phase, a Write Command Sequence (WCS), which contains FARDest, is written to the ICAP 408. FD is now fetched from BRAM and sent to the ICAP in a specific order 407. The data frame is written first followed by the pad frame. The de-sync commands are now sent to the ICAP, after which the Relocatordone signal is sent to FAR Generator which generates the next pair of FARs. This process goes on until all the frames in the source PRR are relocated to the destination PRR, after which the FAR Generator sends a ‘done’ signal to the Microblaze. It is observed that additional processing is required to relocate the design, if the source and destination regions are located on opposite halves of the chip. Data coming out of the ICAP needs to be bit reversed 103 and stored in the BRAM as a mirror image to the actual frame. In the proposed architecture, this processing is performed on the fly, thereby removing any possible timing overhead at the cost of minimal area overhead (for bit reversal).
ICAP wrapper acts as an interface between Relocator and the ICAP ports (data and control). It decodes the information sent by Relocator (ICAP_MODE) to generate the control signals for ICAP.
A partial bitstream associated with a PRR can be described as a combination of two components: (i) frame data (FD) and (ii) commands to synchronize/desynchronize with the ICAP, write a frame and cyclic redundancy check (CRC) processing. We access FD from an active PRR, and write it back to an identical destination PRR. Source and destination addresses are generated on the fly.
Overall time taken to relocate all the frames in the source PRR is calculated as shown in Equation 1.
T
Overall
=nFrames×(TreadFD+TwriteFD+isOppHalf+TbitReversal) Equation 1
Reading FD from ICAP is a three step process. First, a sequence of set-up commands to synchronize with ICAP and setting it in “read” mode are generated and written to ICAP. This is followed by the actual process of reading the FD from ICAP and storing it in a buffer. Finally, a sequence of desynchronization commands are generated and sent to ICAP to terminate the reading process. Writing data to ICAP is a similar process, and the only difference lies in the sequence of set-up commands sent to the ICAP. Time taken to read FD is computed as the sum of the last five variables listed in Table 1. Similarly, time taken to write FD can also be computed.
There are three fundamental components of the proposed performance model: Tgenα, and TwriteICAPβ, TreadICAPγ. Each of these fundamental components (ex. TwriteICAPβ) depend on the number of words in the data being processed (β) and are computed as sum of ToverheadW and Twrite(χ). Here ToverheadW is the time taken to write ‘zero’ words to ICAP. In other words, it is the time taken to start writing to the ICAP. Twrite(χ) is the time taken to write χ words to the ICAP, where χ is the number of words in the data being written to ICAP (β). Both ToverheadW and Twrite(χ) depend on the type of implementation and the type of interface used to communicate with ICAP. Similar formulas to compute Tgenα and TreadICAPγ are also utilized.
Based on the values of FARSrc and FARDest, the Relocator module reads one frame from the source PRR and writes the frame to the destination PRR 102. Functionality of the Relocator module is split into two phases: (i) Read phase and (ii) Write phase. During the read phase, the Relocator module sets the mode of ICAP operation (ICAP MODE) to “write” 402 and then sends the sequence of commands to set-up the ICAP for reading. After sending this sequence, the Relocator sets the ICAP into “read” mode to read one frame. To read one frame from ICAP, we read a combination of 83 words that includes one dummy word, one pad frame (41 words) and one data frame (41 words). In this paper, this combination is represented as FD. A BRAM module 211 is used to temporarily store the FD. After the FD is read, the Relocator sets the ICAP MODE to “write” and sends the de-synchronization commands to ICAP. Now, read phase is completed and the write phase begins. In this phase, a sequence of commands to set-up the ICAP for writing is sent to the ICAP. FD is now fetched from BRAM and sent to the ICAP in a specific order. The data frame is written first followed by the pad frame. The de-synchronization commands are now sent to the ICAP, after which the Relocatordone signal is sent to FAR Generator which generates the next pair of FARs. This process goes on until all the frames in the source PRR are relocated to the destination PRR, after which the FAR Generator sends a ‘done’ signal to the top-level controller. It is observed that additional processing is required to relocate the design, if the source and destination regions are located on opposite halves of the chip. Data coming out of the ICAP needs to be bit reversed 103 and stored in the BRAM as a mirror image to the actual frame 104. In the proposed architecture, this processing is performed on the fly, thereby removing any possible timing overhead at the cost of minimal area overhead (for bit reversal). ICAP wrapper 202 acts as a simple interface between Relocator and the ICAP ports (data and control). It decodes the information sent by Relocator (ICAP_MODE) to generate the control signals for ICAP 203.
This embodiment is executed on Xilinx Microblaze that talks to the ICAP using a proprietary hardware ICAP (HWICAP) core via the on-chip peripheral bus (OPB). Low-level device drivers are provided by Xilinx to communicate with HWICAP and we use these drivers to read all the frames from the source PRR and write it to an identical destination PRR.
A comparative performance analysis of the hardware and software implementations of PRR-PRR relocation algorithm is provided here. Performance is estimated using the proposed analytical model for relocating a single frame. Table 2 shows a comparative listing (software vs ARC) of the various timing estimates for the variables defined in the proposed model.
At different stages in the relocation process, a sequence of commands is generated. In the software implementation, the commands are generated in sequence and written to a buffer before writing it to ICAP 203. In hardware, the commands are hardcoded and written directly to ICAP 203. Tgenα values for software implementation are much higher (for different α's). For the software implementation there is considerable overhead associated with the process of communicating with ICAP (ToverheadW and ToverheadR). Corresponding numbers for the hardware implementation are much smaller. Once the ICAP is ready, time taken to write (or read) χ words is χ clock cycles (in case of ARC) and is some function of χ (in case of software). Table 2 lists the values for other variables in the performance model and also lists the overall time. In this table, some values are represented as fi(χ), which indicates that the value is a function of the number of words (χ) and is much larger than χ. In case of relocation to opposite half of FPGA, bit-reversal needs to be performed. This is a time consuming process in software as it involves reading the sequence of bits from the frame buffer into a temporary buffer, reversing the bits, and then storing it back into the original buffer. This process involves a large number of sequential memory transactions (in a software implementation) and takes 13310 clock cycles. In hardware, bit-reversal is performed on the fly, and does not require any additional clock cycles. Overall time taken for software is estimated to be 68× larger than that of ARC.
The disclosed method and hardware approaches were implemented and tested to run at 100 MHz on a Virtex 4 SX35 FPGA. Xilinx ISE tool flow is used to synthesize, map, place and route the design. Test cases used to evaluate the different approaches are of two types, as listed below.
The method is applicable to any FPGA as long as source and destination PRRs are floor planned to have identical set of device primitives and routing resources. Accelerating relocation can have a major impact on performance, under two conditions: (i) Relocation time is comparable to actual execution time and (ii) Fast relocation is required to respond to a particular event.
This specification fully discloses the invention including preferred embodiments thereof. The examples and embodiments disclosed herein are to be construed as merely illustrative and not a limitation of the scope of the present invention in any way. It will be obvious to those having skill in the art that many changes may be made to the details of the above-described embodiments without departing from the underlying principles of the invention.
This application claims priority to U.S. Provisional Patent Application No. 61/249,071 titled “Accelerated Relocation Circuit” filed on Oct. 6, 2009, which is hereby incorporated herein by reference.
The U.S. Government has a paid-up license in this invention and the right in limited circumstances to require the patent owner to license others on reasonable terms as provided for by the terms of NNGO6GE54G awarded by NASA.
Number | Date | Country | |
---|---|---|---|
61249071 | Oct 2009 | US |