I. Field of the Disclosure
The technology of the disclosure relates generally to execution of dataflow instruction blocks in computer processor cores based on block-based dataflow instruction set architectures (ISAs).
II. Background
Modern computer processors are made up of functional units that perform operations and calculations, such as addition, subtraction, multiplication, and/or logical operations, for executing computer programs. In a conventional computer processor, data paths connecting these functional units are defined by physical circuits, and thus are fixed. This enables the computer processor to provide high performance at the cost of reduced hardware flexibility.
One option for combining the high performance of conventional computer processors with the ability to modify dataflow between functional units is a coarse-grained reconfigurable array (CGRA). A CGRA is a computer processing structure consisting of an array of functional units that are interconnected by a configurable, scalable network (such as a mesh, as a non-limiting example). Each functional unit within the CGRA is directly connected to its neighboring units, and is capable of being configured to execute conventional word-level operations such as addition, subtraction, multiplication, and/or logical operations. By appropriately configuring each functional unit and the network that interconnects them, operand values may be generated by “producer” functional units and routed to “consumer” functional units. In this manner, a CGRA may be dynamically configured to reproduce the functionality of different types of compound functional units without requiring operations such as per-instruction fetching, decoding, register reading and renaming, and scheduling. Accordingly, CGRAs may represent an attractive option for providing high processing performance while reducing power consumption and chip area.
However, widespread adoption of CGRAs has been hampered by a lack of architectural support for abstracting and exposing CGRA configuration to compilers and programmers. In particular, conventional block-based dataflow instruction set architectures (ISAs) lack the syntactic and semantic capabilities to enable programs to detect the existence and configuration of a CGRA. As a consequence, a program that has been compiled to use a CGRA for processing is unable to execute on a computer processor that does not provide a CGRA. Moreover, even if a CGRA is provided by the computer processor, the resources of the CGRA must match exactly the configuration expected by the program for the program to be able to execute successfully.
Aspects disclosed in the detailed description include configuring coarse-grained reconfigurable arrays (CGRAs) for dataflow instruction block execution in block-based dataflow instruction set architectures (ISAs). In one aspect, a CGRA configuration circuit is provided in a block-based dataflow ISA. The CGRA configuration circuit is configured to dynamically configure a CGRA to provide the functionality of a dataflow instruction block. The CGRA comprises an array of tiles, each of which provides a functional unit and a switch. An instruction decoding circuit of the CGRA configuration circuit maps each dataflow instruction within the dataflow instruction block to one of the tiles of the CGRA. The instruction decoding circuit then decodes each dataflow instruction, and generates a function control configuration for the functional unit of the tile corresponding to the dataflow instruction. The function control configuration may be used to configure the functional unit to provide the functionality of the dataflow instruction. The instruction decoding circuit further generates a switch control configuration of the switch of each of one or more path tiles of the CGRA to route an output of the functional unit of the mapped tile to a destination tile of the CGRA corresponding to each consumer instruction of the dataflow instruction (i.e., other dataflow instructions within the dataflow instruction block that take an output of the dataflow instruction as input). In some aspects, before generating the switch control configuration, the instruction decoding circuit may determine destination tiles of the CGRA corresponding to each consumer instruction of the dataflow instruction. Path tiles that represent a path within the CGRA from the tile mapped to the dataflow instruction to each destination tile may then be determined. In this manner, the CGRA configuration circuit dynamically generates a configuration for the CGRA that reproduces the functionality of the dataflow instruction block, thus enabling the block-based dataflow ISA to exploit the processing functionality of the CGRA efficiently and transparently.
In another aspect, a CGRA configuration circuit of a block-based dataflow ISA is disclosed. The CGRA configuration circuit comprises a CGRA comprising a plurality of tiles, each tile of the plurality of tiles comprising a functional unit and a switch. The CGRA configuration circuit further comprises an instruction decoding circuit. The instruction decoding circuit is configured to receive, from a block-based dataflow computer processor core, a dataflow instruction block comprising a plurality of dataflow instructions. The instruction decoding circuit is further configured to, for each dataflow instruction of the plurality of dataflow instructions, map the dataflow instruction to a tile of the plurality of tiles of the CGRA, and decode the dataflow instruction. The instruction decoding circuit is also configured to generate a function control configuration of the functional unit of the mapped tile to correspond to a functionality of the dataflow instruction. The instruction decoding circuit is additionally configured to, for each consumer instruction of the dataflow instruction, generate a switch control configuration of the switch of each of one or more path tiles of the plurality of tiles of the CGRA to route an output of the functional unit of the mapped tile to a destination tile of the plurality of tiles of the CGRA corresponding to the consumer instruction.
In another aspect, a method for configuring a CGRA for dataflow instruction block execution in a block-based dataflow ISA is provided. The method comprises receiving, by an instruction decoding circuit from a block-based dataflow computer processor core, a dataflow instruction block comprising a plurality of dataflow instructions. The method further comprises, for each dataflow instruction of the plurality of dataflow instructions, mapping the dataflow instruction to a tile of a plurality of tiles of a CGRA, each tile of the plurality of tiles comprising a functional unit and a switch. The method also comprises decoding the dataflow instruction, and generating a function control configuration of the functional unit of the mapped tile to correspond to a functionality of the dataflow instruction. The method additionally comprises, for each consumer instruction of the dataflow instruction, generating a switch control configuration of the switch of each of one or more path tiles of the plurality of tiles of the CGRA to route an output of the functional unit of the mapped tile to a destination tile of the plurality of tiles of the CGRA corresponding to the consumer instruction.
In another aspect, a CGRA configuration circuit of a block-based dataflow ISA for configuring a CGRA comprising a plurality of tiles, each tile of the plurality of tiles comprising a functional unit and a switch, is provided. The CGRA configuration circuit comprises a means for receiving, from a block-based dataflow computer processor core, a dataflow instruction block comprising a plurality of dataflow instructions. The CGRA configuration circuit further comprises, for each dataflow instruction of the plurality of dataflow instructions, a means for mapping the dataflow instruction to a tile of a plurality of tiles of a CGRA, and a means for decoding the dataflow instruction. The CGRA configuration circuit also comprises a means for generating a function control configuration of the functional unit of the mapped tile to correspond to a functionality of the dataflow instruction. The CGRA configuration circuit additionally comprises, for each consumer instruction of the dataflow instruction, a means for generating a switch control configuration of the switch of each of one or more path tiles of the plurality of tiles of the CGRA to route an output of the functional unit of the mapped tile to a destination tile of the plurality of tiles of the CGRA corresponding to the consumer instruction.
With reference now to the drawing figures, several exemplary aspects of the present disclosure are described. The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
Aspects disclosed in the detailed description include configuring coarse-grained reconfigurable arrays (CGRAs) for dataflow instruction block execution in block-based dataflow instruction set architectures (ISAs). In one aspect, a CGRA configuration circuit is provided in a block-based dataflow ISA. The CGRA configuration circuit is configured to dynamically configure a CGRA to provide the functionality of a dataflow instruction block. The CGRA comprises an array of tiles, each of which provides a functional unit and a switch. An instruction decoding circuit of the CGRA configuration circuit maps each dataflow instruction within the dataflow instruction block to one of the tiles of the CGRA. The instruction decoding circuit then decodes each dataflow instruction, and generates a function control configuration for the functional unit of the tile corresponding to the dataflow instruction. The function control configuration may be used to configure the functional unit to provide the functionality of the dataflow instruction. The instruction decoding circuit further generates a switch control configuration of the switch of each of one or more path tiles of the CGRA to route an output of the functional unit of the mapped tile to a destination tile of the CGRA corresponding to each consumer instruction of the dataflow instruction (i.e., other dataflow instructions within the dataflow instruction block that take an output of the dataflow instruction as input). In some aspects, before generating the switch control configuration, the instruction decoding circuit may determine destination tiles of the CGRA corresponding to each consumer instruction of the dataflow instruction. Path tiles that represent a path within the CGRA from the tile mapped to the dataflow instruction to each destination tile may then be determined. In this manner, the CGRA configuration circuit dynamically generates a configuration for the CGRA that reproduces the functionality of the dataflow instruction block, thus enabling the block-based dataflow ISA to exploit the processing functionality of the CGRA efficiently and transparently.
Before exemplary elements and operations of a CGRA configuration circuit are discussed, an exemplary block-based dataflow computer processor core based on a block-based dataflow ISA (e.g., the E2 microarchitecture, as a non-limiting example) is described. As discussed in greater detail below with respect to
In this regard,
As noted above, the block-based dataflow computer processor core 100 is based on a block-based dataflow ISA. As used herein, a “block-based dataflow ISA” is an ISA in which a computer program is divided into dataflow instruction blocks, each of which comprises multiple dataflow instructions that are executed atomically. Each dataflow instruction explicitly encodes information regarding producer/consumer relationships between itself and other dataflow instructions within the dataflow instruction block. The dataflow instructions are executed in an order determined by the availability of input operands (i.e., a dataflow instruction is allowed to execute as soon as all of its input operands are available, regardless of the program order of the dataflow instruction). All register writes and store operations within the dataflow instruction block are buffered until execution of the dataflow instruction block is complete, at which time the register writes and store operations are committed together.
In the example of
In exemplary operation, a dataflow instruction block (not shown) is fetched from the instruction cache 102, and the dataflow instructions (not shown) therein are loaded into one or more of the instruction windows 104(0)-104(3). In some aspects, the dataflow instruction block may have a variable size of between four (4) and 128 dataflow instructions. Each of the instruction windows 104(0)-104(3) forwards an opcode (not shown) corresponding to each dataflow instruction, along with any operands (not shown) and instruction target fields (not shown), to the associated ALUs 108(0)-108(3), the associated registers 110(0)-110(3), or the load/store queue 112, as appropriate. Any results (not shown) from executing each dataflow instruction are then sent to one of the operand buffers 106(0)-106(7) or registers 110(0)-110(3) based on the instruction target fields of the dataflow instruction. Additional dataflow instructions may be queued for execution as results from previous dataflow operations are stored in the operand buffers 106(0)-106(7). In this manner, the block-based dataflow computer processor core 100 may provide high-performance out-of-order (OOO) execution of dataflow instruction blocks.
Programs compiled to employ a CGRA may be able to achieve further performance enhancements when executed by the block-based dataflow computer processor core 100 of
In this regard,
As seen in
Each functional unit 210(0)-210(3) of the tiles 208(0)-208(3) of the CGRA 202 contains logic for implementing a number of conventional word-level operations such as addition, subtraction, multiplication, and/or logical operations, as non-limiting examples. Each functional unit 210(0)-210(3) may be configured using a corresponding function control configuration (FCTL) 214(0)-214(3) to perform one of the supported operations at a time. For example, the functional unit 210(0) first may be configured to operate as a hardware adder by the FCTL 214(0). The FCTL 214(0) later may be modified to configure the functional unit 210(0) to operate as a hardware multiplier for a subsequent operation. In this manner, the functional units 210(0)-210(3) may be reconfigured to perform different operations as specified by the FCTLs 214(0)-214(3).
The switches 212(0)-212(3) of the tiles 208(0)-208(3) are connected to their associated functional units 210(0)-210(3), as indicated by bidirectional arrows 216, 218, 220, and 222. In some aspects, each of the switches 212(0)-212(3) may be connected to the corresponding functional units 210(0)-210(3) via a local port (not shown). The switches 212(0)-212(3) may also be configured using corresponding switch control configurations (SCTLs) 224(0)-224(3) to connect to all neighboring switches 212(0)-212(3). Thus, in the example of
In some aspects, the switches 212(0)-212(3) may be connected via ports (not shown) referred to as north, east, south, and west ports. Accordingly, the switch control configurations 224(0)-224(3) may specify on which ports the corresponding switches 212(0)-212(3) receive input from and/or send output to other switches 212(0)-212(3). As a non-limiting example, the switch control configuration 224(1) may specify that the switch 212(1) will receive input for the functional unit 210(1) from the switch 212(0) via its west port, and may provide output from the functional unit 210(1) to the switch 212(3) via its south port. It is to be understood that the switches 212(0)-212(3) may provide more or fewer ports than illustrated in the example of
The CGRA configuration generated by the CGRA configuration circuit 200 to configure the CGRA 202 to provide the functionality of the dataflow instruction block 206 includes the function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3) of the tiles 208(0)-208(3) of the CGRA 202. To generate the function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3), the CGRA configuration circuit 200 includes an instruction decoding circuit 234. The instruction decoding circuit 234 is configured to receive the dataflow instruction block 206 from the block-based dataflow computer processor core 100, as indicated by arrows 236 and 238. The instruction decoding circuit 234 then maps each of the dataflow instructions 204(0)-204(X) to one of the tiles 208(0)-208(3) of the CGRA 202. It is to be understood that the CGRA 202 is configured to provide a number of tiles 208(0)-208(3) equal to or greater than a number of dataflow instructions 204(0)-204(X) within the dataflow instruction block 206. Some aspects may provide that mapping the dataflow instructions 204(0)-204(X) to the tiles 208(0)-208(3) may comprise deriving a column coordinate and a row coordinate for one of the tiles 208(0)-208(3) within the CGRA 202 based on instruction slot numbers or other indices (not shown) for the dataflow instructions 204(0)-204(X). As a non-limiting example, a column coordinate may be calculated as the modulus of the instruction slot number of one of the dataflow instructions 204(0)-204(X) and the width of the CGRA 202, while a row coordinate may be calculated as the integer result of dividing the instruction slot number and the width of the CGRA 202. Thus, for instance, if the instruction slot number of the dataflow instruction 204(2) is two (2), the instruction decoding circuit 234 may map the dataflow instruction 204(2) to the tile 208(2) (i.e., tile 0,1). It is to be understood that other approaches for mapping each of the dataflow instructions 204(0)-204(X) to one of the tiles 208(0)-208(3) may be employed.
The instruction decoding circuit 234 next decodes each of the dataflow instructions 204(0)-204(X). In some aspects, the dataflow instructions 204(0)-204(X) are processed serially, while some aspects of the instruction decoding circuit 234 may be configured to process multiple dataflow instructions 204(0)-204(X) in parallel. Based on the decoding, the instruction decoding circuit 234 generates the function control configurations 214(0)-214(3) corresponding to the tiles 208(0)-208(3) to which the dataflow instructions 204(0)-204(X) are mapped. Each of the function control configurations 214(0)-214(3) configures the corresponding functional unit 210(0)-210(3) of the associated tile 208(0)-208(3) to perform a same operation as the dataflow instruction 204(0)-204(X) mapped to the tile 208(0)-208(3). The instruction decoding circuit 234 further generates the switch control configurations 224(0)-224(3) for the switches 212(0)-212(3) of the tiles 208(0)-208(3) to ensure that an output (not shown), if any, of each functional unit 210(0)-210(3) is routed to one of the tiles 208(0)-208(3) to which a consumer dataflow instruction 204(0)-204(X) is mapped. Operations for mapping and decoding the dataflow instructions 204(0)-204(X) and generating the function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3) are discussed in greater detail below with respect to
In some aspects, the function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3) may be streamed directly into the CGRA 202 by the instruction decoding circuit 234, as indicated by arrow 240. The function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3) may be provided to the CGRA 202 as they are generated by the instruction decoding circuit 234, or a subset or an entire set of the function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3) may be provided at the same time to the CGRA 202. Some aspects may provide that the function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3) generated by the instruction decoding circuit 234 may be output to a CGRA configuration buffer 242, as indicated by arrow 244. The CGRA configuration buffer 242 according to some aspects may comprise a memory array (not shown) indexed with coordinates of the tiles 208(0)-208(3), and configured to store the function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3) for the corresponding tiles 208(0)-208(3). The function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3) may then be provided to the CGRA 202 at a later time, as indicated by arrow 246.
In the example of
Some aspects may provide that the CGRA configuration circuit 200 is configured to select, at runtime, either the CGRA 202 or the block-based dataflow computer processor core 100 to execute the dataflow instruction block 206. As a non-limiting example, the CGRA configuration circuit 200 may determine, at runtime, whether the instruction decoding circuit 234 was successful in generating the function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3). If the function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3) were successfully generated, the CGRA configuration circuit 200 selects the CGRA 202 to execute the dataflow instruction block 206. However, if the instruction decoding circuit 234 was unsuccessful in generating the function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3) (e.g., because of an error during decoding), the CGRA configuration circuit 200 selects the block-based dataflow computer processor core 100 to execute the dataflow instruction block 206. In some aspects, the CGRA configuration circuit 200 may also select the block-based dataflow computer processor core 100 to execute the dataflow instruction block 206 if it determines, at runtime, that the CGRA 202 does not provide a required resource needed to execute the dataflow instruction block 206. For instance, the CGRA configuration circuit 200 may determine that the CGRA 202 lacks a sufficient number of functional units 210(0)-210(3) that support a particular operation. In this manner, the CGRA configuration circuit 200 may provide a mechanism for ensuring that the dataflow instruction block 206 is successfully executed.
To provide a simplified illustration of operations for mapping and decoding the dataflow instructions 204(0)-204(X) and generating the function control configurations 214(0)-214(3) and the switch control configurations 224(0)-224(3) of
In
As noted above, in dataflow instruction block execution, each of the dataflow instructions 204(0)-204(2) may execute as soon as all of its input operands are available. In the dataflow instruction block 206 shown in
Referring now to
The instruction decoding circuit 234 of the CGRA configuration circuit 200 next analyzes the dataflow instruction I0 204(0) to identify its consumer instructions. In this example, the dataflow instruction I0 204(0) provides its output to both the dataflow instruction I1 204(1) and the dataflow instruction I2 204(2) (also referred to as “consumer instructions 204(1) and 204(2)”). Based on its analysis, the CGRA configuration circuit 200 identifies the destination tiles 208(1) and 208(2) (i.e., the tiles 208(0)-208(3) to which the output of the functional unit 210(0) should be sent) to which the consumer instructions 204(1) and 204(2), respectively, are mapped. The CGRA configuration circuit 200 then determines one or more tiles 208(0)-208(3) (referred to herein as “path tiles”) that comprise a path from the mapped tile 208(0) to each of the destination tiles 208(1) and 208(2). The “path tiles” represent each tile 208(0)-208(3) of the CGRA 202 for which a switch 212(0)-212(3) must be configured in order to route the output of the functional unit 210(0) to the destination tiles 208(1) and 208(2). In some aspects, the path tiles may be determined by determining a shortest Manhattan distance between the mapped tile 208(0) and each of the destination tiles 208(1) and 208(2).
In the example of
In
As seen in
Referring now to
In some aspects, the instruction decoding circuit 234 may determine whether the CGRA 202 provides a required resource (block 505). Accordingly, the instruction decoding circuit 234 may be referred to herein as “a means for determining, at runtime, whether the CGRA provides a required resource.” The required resource may comprise, for example, a sufficient number of functional units 210(0)-210(3) within the CGRA 202 that support a particular operation. If it is determined at decision block 505 that the CGRA 202 does not provide the required resource, processing proceeds to block 506 of
Referring now to
In
Turning to
Configuring CGRAs for dataflow instruction block execution in block-based dataflow ISAs according to aspects disclosed herein may be provided in or integrated into any processor-based device. Examples, without limitation, include a set top box, an entertainment unit, a navigation device, a communications device, a fixed location data unit, a mobile location data unit, a mobile phone, a cellular phone, a computer, a portable computer, a desktop computer, a personal digital assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a digital video player, a video player, a digital video disc (DVD) player, and a portable digital video player.
In this regard,
Other devices can be connected to the system bus 608. As illustrated in
The CPU(s) 602 may also be configured to access the display controller(s) 620 over the system bus 608 to control information sent to one or more displays 626. The display controller(s) 620 sends information to the display(s) 626 to be displayed via one or more video processors 628, which process the information to be displayed into a format suitable for the display(s) 626. The display(s) 626 can include any type of display, including but not limited to a cathode ray tube (CRT), a liquid crystal display (LCD), a light emitting diode (LED) display, a plasma display, etc.
Those of skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the aspects disclosed herein may be implemented as electronic hardware. The devices described herein may be employed in any circuit, hardware component, integrated circuit (IC), or IC chip, as examples. Memory disclosed herein may be any type and size of memory and may be configured to store any type of information desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How such functionality is implemented depends upon the particular application, design choices, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
It is also noted that the operational steps described in any of the exemplary aspects herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary aspects may be combined. It is to be understood that the operational steps illustrated in the flow chart diagrams may be subject to numerous different modifications as will be readily apparent to one of skill in the art. Those of skill in the art will also understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.