The present invention relates to processing systems on an integrated circuit that include an array processor as a functional unit or coprocessor, and particularly to integrated systems that include a reconfigurable array processor.
An embedded system is some combination of hardware or software that is specifically designed for a particular purpose or application within an overall system, and may be fixed in capability or programmable. A mobile phone may, for example, have a power saving integrated circuit (IC) or “chip” operable only with its respective type of phone and devoted exclusively to controlling the display and other elements to conserve power.
The same mobile phone typically includes a digital signal processing integrated circuit, which executes the functions on a digital portion of the radio. In order to adapt to different and/or changing radio broadcast formats of an incoming signal, programmable radios would be desirable. However, digital radio processing functions can entail high data sample rates, along with high computational loads, that are typically impractical to implement on programmable hardware.
A typical approach to accommodate the computational load within the capabilities of the programmable hardware is to design hardware acceleration modules that specialize in efficient computation of high-data rate and/or computational rate algorithms. The accelerators may be interfaced with the programmable processor using a number of techniques, each of which allow the programmable processor to control the operation of the accelerator, as well as to properly schedule the data to be exchanged with the accelerator. For instance, a general purpose DSP or other host may have a set of internal register addresses that are visible within the instruction set of the processor, but are mapped to input and output ports of a coprocessor interface. The accelerator inputs and outputs may be connected to this interface, and process data under control of the programmable processor. In this way proper data exchange is programmable by the general purpose device.
In another approach the general purpose programmable host or DSP allows new, high-speed functional units to be inserted into its datapath. The functional unit responds to instruction operation codes provided by the hierarchical controller, and exchanges data with internal register files and other units according the datapath configuration specified by the hierarchical controller.
While these approaches succeed in offloading excess computational loads from a programmable processor, they rely on accelerators with limited or no programmability to execute the computation-intensive tasks. In this manner an important element of the programmability has been lost.
The present invention is directed to the integration of an array processor as a reconfigurable accelerator to a host or main processor, the array processor greatly exceeding the execution processing capacity of the host processor. The coprocessor includes a two-dimensional array of processing cells. The coprocessor is communicatively connected to the host processor by an interface module that has a mechanism for reconfiguring information paths between itself and respective cells on a periphery of the array.
In another aspect, this invention relates to a host or main processor's functional unit, where the host processor is preferably a very long instruction word (VLIW) processor, and the functional unit preferably embodies a two-dimensional array of processing cells having an interface by which information paths to the array through respective cells on a periphery of the array can be reconfigured.
Details of the invention disclosed herein shall be described below, with the aid of the figures listed below, in which same or similar components are denoted by the same reference numbers over the several views:
The IC 102 may, for example, be configured in accordance with the arrangement 10 in
Preferably, inter-cell connection within the array 108 is such that each cell 112 is connected only to cells 112 whose column is the same and whose row is immediately adjacent, and only to cells 112 whose row is the same and whose column is immediately adjacent, to realize a “nearest neighbor” connection architecture, as shown in
In one embodiment, the interface 110 has border cells 114 connected to each respective processing cell 112 on the periphery of the array 108, each border cell 114 having a buffer 116. The periphery preferably consists of those processing cells 112 which are located on the array edges, i.e., in at least one of the first row, last row, first column and last column. Since internal array connection cell-to-cell, under the nearest neighbor scheme, leaves two neighbors missing for each corner cell 112 and one neighbor missing for each other cell 112 on array edges, the missing connections are each made to a corresponding border cell 114.
Further included in the interface 110 are input/output (I/O) pads 118, one for each border cell 114, and a crossbar network 120 for reconfigurably connecting each I/O pad 118 one-to-one to a corresponding border cell 114. For each such connection an information path is formed.
In a preferred embodiment, the array processor 105 is a systolic processing array, a special-purpose system which can be likened to an assembly line for input operands, although operations typically proceed not in a strictly linear direction but in changing directions. In a two-dimensional array of processing cells, differing mathematical operations are performed on the data by different cells, while data proceeds in an orderly, lock-step progression from one cell to another. An example of a systolic array would be one that multiplies matrices. Entries of a row are multiplied by corresponding entries of a column, and the products are summed to produce an ordered column of sums. Efficiency is achieved by arranging operations to be performed in parallel, so that the results are produced in the fewest clock cycles. The '904 application provides another example of a systolic processing array, implementing a 32-tap real finite impulse response (FIR) filter. The filter is enhanced by concatenating other levels, two-dimensional and otherwise, to the original two-dimensional array, border cells being connected to processing cells on the periphery of each level. Such an enhanced array, connected by the border cells 114, is also within the intended scope of the present invention.
In one embodiment, the border cells 114 not only provide input to the array 108. They also provide results of array processing to the I/O pads 118. The border cells 114 receive these results by neighbor to neighbor conveyance from the processing cells 112 producing the results. Optionally, the border cell 114 may validate the results and output a data valid signal to the external process, such as the DSP 20.
In a preferred embodiment, the IC 102 includes a memory, such as in memory system 50, from which array programs are downloaded by means of a bus 113 to corresponding processing cells 112. The memory is preferably a random access memory (RAM) or other writeable storage device so that updated array programs can be provided, as by an array generator external to the receiver 100.
The system controller which may be an external processor passes array programs to a master cell 126 of the embedded array processor 106 over a configuration bus such as the random access configuration bus shown in
The array processor 106 performs mathematical operations whose timing is based on a flow of input operands along the paths providing the operands to the array 108.
Array programs may be prepared using a graphical user interface (GUI) that can edit and show the code to be downloaded to RAM on the IC 102 and then to each programming cell 112.
In an alternative exemplary implementation 300 of the embedded array processor 106 of
The VLIW processor 302 includes an instruction memory 316, and instruction issue register 318, a shared, multiported register file 320. Also included within the processor 302, and, connected to both the file 320 and the register 318 at corresponding issue slots, are a plurality of functional units. Details of this VLIW architecture are provided in commonly owned U.S. Pat. No. 5,974,537, filed Oct. 26, 1999, (hereinafter the '537 patent), the entire disclosure of which is incorporated herein by reference. The functional unit 322 can be realized, for example, as the embedded array processor 106 of
When an array program is updated, as by a user of the chip development platform 309 through interactive utilization of the GUI 314 and by means of the array program generator 310 (steps 406, 408), changes in the program may affect the timing of functional unit 322 input and/or output. The compiler 312 needs to know this timing change for scheduling purposes in forming the VLIW instruction. The array program generator 310 therefore updates this I/O timing data and transmits it to the compiler 312 (step 410). The updated array program is downloaded (step 412), as described above with regard to system initialization. The array program generator 310 determines whether the program change affects a steady state connection pattern of the interface 110. The steady state pattern defines, for example, which I/O pads 118 are connected to which border cells 114 at which stages of a mathematical operation, i.e., the mathematical operation may accept input operands at the array periphery at multiple stages of the operation. If the program update changes the steady state pattern (step 414), the array program generator 310 sends a reconfigure signal to the functional unit 322 (step 416). Preferably, the signal is received by the master cell 126, which then effects the needed connection timings in the crossbar switch 120.
Although array program functionality has been described in the context of the VLIW processor 302 of
While there have been shown and described what are considered to be preferred embodiments of the invention, it will, of course, be understood that various modifications and changes in form or detail could readily be made without departing from the spirit of the invention. For example, alternatively implemented, the system controller 104 and RAM may instead reside within the embedded array processor 106. It is therefore intended that the invention be not limited to the exact forms described and illustrated, but should be constructed to cover all modifications that may fall within the scope of the appended claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 60432801 | Dec 2002 | US | national |
| 60478333 | Jun 2003 | US | national |
| Filing Document | Filing Date | Country | Kind | 371c Date |
|---|---|---|---|---|
| PCT/IB03/05625 | 11/28/2003 | WO | 6/13/2005 |